Command line tools for large tabular data files.
Visit the Tools Reference main page
Visit the TSV Utilities main page
Note: See the tsv-filter description in the project README for a tutorial style introduction.
Synopsis: tsv-filter [options] [fileā¦]
Filter lines by comparison tests against fields. Multiple tests can be specified. By default, only lines satisfying all tests are output. This can be changed using the --or
option. A variety of tests are available.
--help
- Print help.--help-verbose
- Print detailed help.--help-options
- Print the options list by itself.--help-fields
- Print help on specifying fields.--V|version
- Print version information and exit.--H|header
- Treat the first line of each file as a header.--d|delimiter CHR
- Field delimiter. Default: TAB. (Single byte UTF-8 characters only.)--or
- Evaluate tests as an OR rather than an AND.--v|invert
- Invert the filter, printing lines that do not match.--c|count
- Print only a count of the matched lines.--label STR
- Rather than filter, add a new field to each record indicating if the filter criteria was satisfied. STR is the header, ignored if there is no header line. Values 1
and 0
are used to indicate pass/no-pass. This can be customized using --label-values
.--label-values STR1:STR2
- The pass/no-pass values used by --label
. Defaults to 1
and 0
.--line-buffered
- Immediately output every matched line.--empty <field-list>
- True if the field is empty (no characters)--not-empty <field-list>
- True if the field is not empty.--blank <field-list>
- True if the field is empty or all whitespace.--not-blank <field-list>
- True if the field contains a non-whitespace character.--is-numeric <field-list>
- True if the field can be interpreted as a number.--is-finite <field-list>
- True if the field can be interpreted as a number, and it is not NaN or infinity.--is-nan <field-list>
- True if the field is NaN (including: "nan", "NaN", "NAN").--is-infinity <field-list>
- True if the field is infinity (including: "inf", "INF", "-inf", "-INF")--le <field-list>:NUM
- FIELD <= NUM (numeric).--lt <field-list>:NUM
- FIELD < NUM (numeric).--ge <field-list>:NUM
- FIELD >= NUM (numeric).--gt <field-list>:NUM
- FIELD > NUM (numeric).--eq <field-list>:NUM
- FIELD == NUM (numeric).--ne <field-list>:NUM
- FIELD != NUM (numeric).--str-le <field-list>:STR
- FIELD <= STR (string).--str-lt <field-list>:STR
- FIELD < STR (string).--str-ge <field-list>:STR
- FIELD >= STR (string).--str-gt <field-list>:STR
- FIELD > STR (string).--str-eq <field-list>:STR
- FIELD == STR (string).--istr-eq <field-list>:STR
- FIELD == STR (string, case-insensitive).--str-ne <field-list>:STR
- FIELD != STR (string).--istr-ne <field-list>:STR
- FIELD != STR (string, case-insensitive).--str-in-fld <field-list>:STR
- FIELD contains STR (substring search).--istr-in-fld <field-list>:STR
- FIELD contains STR (substring search, case-insensitive).--str-not-in-fld <field-list>:STR
- FIELD does not contain STR (substring search).--istr-not-in-fld <field-list>:STR
- FIELD does not contain STR (substring search, case-insensitive).--regex <field-list>:REGEX
- FIELD matches regular expression.--iregex <field-list>:REGEX
- FIELD matches regular expression, case-insensitive.--not-regex <field-list>:REGEX
- FIELD does not match regular expression.--not-iregex <field-list>:REGEX
- FIELD does not match regular expression, case-insensitive.--char-len-le <field-list>:NUM
- FIELD character length <= NUM.--char-len-lt <field-list>:NUM
- FIELD character length < NUM.--char-len-ge <field-list>:NUM
- FIELD character length >= NUM.--char-len-gt <field-list>:NUM
- FIELD character length > NUM.--char-len-eq <field-list>:NUM
- FIELD character length == NUM.--char-len-ne <field-list>:NUM
- FIELD character length != NUM.--byte-len-le <field-list>:NUM
- FIELD byte length <= NUM.--byte-len-lt <field-list>:NUM
- FIELD byte length < NUM.--byte-len-ge <field-list>:NUM
- FIELD byte length >= NUM.--byte-len-gt <field-list>:NUM
- FIELD byte length > NUM.--byte-len-eq <field-list>:NUM
- FIELD byte length == NUM.--byte-len-ne <field-list>:NUM
- FIELD byte length != NUM.--ff-le FIELD1:FIELD2
- FIELD1 <= FIELD2 (numeric).--ff-lt FIELD1:FIELD2
- FIELD1 < FIELD2 (numeric).--ff-ge FIELD1:FIELD2
- FIELD1 >= FIELD2 (numeric).--ff-gt FIELD1:FIELD2
- FIELD1 > FIELD2 (numeric).--ff-eq FIELD1:FIELD2
- FIELD1 == FIELD2 (numeric).--ff-ne FIELD1:FIELD2
- FIELD1 != FIELD2 (numeric).--ff-str-eq FIELD1:FIELD2
- FIELD1 == FIELD2 (string).--ff-istr-eq FIELD1:FIELD2
- FIELD1 == FIELD2 (string, case-insensitive).--ff-str-ne FIELD1:FIELD2
- FIELD1 != FIELD2 (string).--ff-istr-ne FIELD1:FIELD2
- FIELD1 != FIELD2 (string, case-insensitive).--ff-absdiff-le FIELD1:FIELD2:NUM
- abs(FIELD1 - FIELD2) <= NUM--ff-absdiff-gt FIELD1:FIELD2:NUM
- abs(FIELD1 - FIELD2) > NUM--ff-reldiff-le FIELD1:FIELD2:NUM
- abs(FIELD1 - FIELD2) / min(abs(FIELD1), abs(FIELD2)) <= NUM--ff-reldiff-gt FIELD1:FIELD2:NUM
- abs(FIELD1 - FIELD2) / min(abs(FIELD1), abs(FIELD2)) > NUMTip: Bash completion is very helpful when using commands like tsv-filter
that have many options. See Enable bash-completion for details.
$ # 'Count' field non-zero
$ tsv-filter --header --ne Count:0
$ # Field 2 non-zero
$ tsv-filter --ne 2:0 data.tsv
$ # Field 1 == 0 and Field 2 >= 100, first line is a header.
$ tsv-filter --header --eq 1:0 --ge 2:100 data.tsv
$ # 'Count' field == -1 or 'Count' field > 100
$ tsv-filter --or --eq Count:-1 --gt Count:100
$ # 'Name1' field is foo, 'Name2' field contains bar
$ tsv-filter -H --str-eq Name1:foo --str-in-fld Name2:bar data.tsv
$ # 'start_date' field == 'end-date' field (numeric test)
$ tsv-filter -H --ff-eq start_date:end_date data.tsv
Field lists can be used to run the same test on multiple fields. For example:
$ # Test that fields 1-10 are not blank
$ tsv-filter --not-blank 1-10 data.tsv
$ # Test that fields 1-5 are not zero
$ tsv-filter --ne 1-5:0 data.tsv
$ # Test that all the '_time' fields are not zero
$ tsv-filter -H --ne '*_time:0' data.tsv
$ # Test that fields 1-5, 7, and 10-20 are less than 100
$ tsv-filter --lt 1-5,7,10-20:100 data.tsv
See Field syntax for more information on field lists and specifying fields by name.
The regular expression syntax supported is that defined by the D regex library. The basic syntax has become quite standard and is used by many tools. It will rarely be necessary to consult the D language documentation. A general reference such as the guide available at Regular-Expressions.info will suffice in nearly all cases. (Note: Unicode properties are supported.)
$ # Field 2 has a sequence with two a's, one or more digits, then 2 a's.
$ tsv-filter --regex '2:aa[0-9]+aa' data.tsv
$ # Same thing, except the field starts and ends with the two a's.
$ tsv-filter --regex '2:^aa[0-9]+aa$' data.tsv
$ # 'Name' field is a sequence of "word" characters with two or more embedded
$ # whitespace sequences (match against entire field)
$ tsv-filter -H --regex 'Name:^\w+\s+(\w+\s+)+\w+$' data.tsv
$ # 'Title' field containing at least one cyrillic character.
$ tsv-filter -H --regex 'Title:\p{Cyrillic}' data.tsv
Numeric tests like --gt
(greater-than) assume field values can be interpreted as numbers. An error occurs if the field cannot be parsed as a number, halting the program. This can be avoided by including a testing ensure the field is recognizable as a number. For example:
$ # Ensure the 'count' field is a number before testing for greater-than 10.
$ tsv-filter -H --is-numeric count --gt count:10 data.tsv
$ # Ensure field 2 is a number, not NaN or infinity before greater-than test.
$ tsv-filter --is-finite 2 --gt 2:10 data.tsv
The above tests work because tsv-filter
short-circuits evaluation, only running as many tests as necessary to filter each line. Tests are run in the order listed on the command line. In the first example, if --is-numeric 2
is false, the remaining tests do not get run.
The --c|count
option changes the operation of tsv-filter
to produce a count of the matching records. For example:
$ # Number of records with an empty 'name' field
$ tsv-filter -H --count --empty name data.tsv
127
The --count
operation is similar to piping tsv-filter
output to the Unix wc -l
command, except that --count
excludes the header line. Using --count
is also faster. Example:
$ # Number of records with 'Year' field greater than 1995
$ tsv-filter -H --count --gt Year:1995 data.tsv
925
$ # Same operation with wc -l. This includes one more line, the header.
$ tsv-filter -H --gt Year:1995 data.tsv | wc -l
926
The --label
option adds a field to each record indicating if the filter criteria was satisfied. The --label-values
option can be used to customize the indicator values. A small data file is used to illustrate this in the example below.
$ # The data file
$ tsv-pretty data.tsv
id color year
100 green 1978
101 red 1992
102 red 2007
103 yellow 1995
$ # Mark records indicating years in the 1990s
$ tsv-filter -H --label 1990s --ge year:1990 --le year:2000 data.tsv | tsv-pretty
id color year 1990s
100 green 1978 0
101 red 1992 1
102 red 2007 0
103 yellow 1995 1
$ # Customizing the indicator values
$ tsv-filter -H --label 1990s --label-values yes:no --ge year:1990 --le year:2000 data.tsv | tsv-pretty
id color year 1990s
100 green 1978 no
101 red 1992 yes
102 red 2007 no
103 yellow 1995 yes
The --line-buffered
option changes input and output buffering so each line is read and written as soon as it is available. It is typically used when reading from an input stream that produces records slowly and output is desired as soon as available. An example:
$ complex_operation | tsv-filter -H --gt score:50 --line-buffered
The above command will output lines satisfying the filter as soon as they are received from the complex_operation
command. Without the --line-buffered
option output would wait until enough data is available to fill the buffers used by tsv-filter
.
The effect can be seen using the sleep
command when printing a series of lines:
$ # Print ten numbers, sleeping 1 second between each line. Use tsv-filter
$ # to filter the results, with --line-buffering on.
$ #
$ for i in `seq 10` ; do echo $i ; sleep 1 ; done | tsv-filter --ne 1:3 --line-buffered
The command line above outputs a new line every second. However, if the --line-buffered
option is removed, there is no output for ten seconds.