tsvutils -- utilities for processing tab-separated files

tsvutils are scripts that can convert and manipulate tabular data in the TSV file format: tab-separated values, sometimes with a header. They build on top of standard Unix utilities to allow ad-hoc, efficient, and reliable processing and summarization of tabular data from the shell.

Overview of scripts

Convert into tsv:

csv2tsv - convert from Excel-compatible csv.
json2tsv - convert from concatenated JSON records.
xlsx2tsv - convert from Excel's .xlsx format.
others: eq2tsv ssv2tsv uniq2tsv yaml2tsv ...

Manipulate tsv:

tsvawk - gives you column names in your awk.
hwrap - wraps pipeline process but preserves stdin's header.
tsvcat - concatenate tsv's, aligning common columns.
namecut - like 'cut' but with header names.
tabsort, tabawk - wrappers for tab delimitation.

Convert out of tsv:

tsv2csv - convert tsv to Excel-compatible csv.
tsv2my - load tsv into a new MySQL table.
tsv2fmt - format as ASCII-art table.
tsv2html - format as HTML table.
others: tsv2yaml tsv2tex ...

By "tsv" we mean honest-to-goodness tab-separated values, often with a header. No quoting, escaping, or comments. All rows should have the same number of fields. Rows end with a unix \n newline. Cell values cannot have tabs or newlines.

These conditions are all enforced in scripts that output tsv. For programs that take tsv input, if these assumptions do not hold, the script's behavior is undefined.

TSV is an easy format for other programs to handle:

After stripping the newline, split("\t") correctly parses a row.
To strip out the header beforehand: "tail -n+2" or "tail +2".

Weak naming convention: programs that don't work well with headers call that format "tab"; ones that either need a header or are agnostic call that "tsv". E.g., for tabsort you don't want to sort the header, but tsv2my is impossible without the header. csv2tsv and tsv2csv are agnostic, since a csv file may or may not have a header.

Examples: pipelines

The TSV format is intended to work with many other pipeline-friendly programs. Examples include:

General
- cat, head, tail, tail -n+X, cut, merge, diff, comm, sort, uniq, uniq -c, wc -l
Multipurpose
- perl -pe, ruby -ne, awk, sed, tr
SQL to TSV
- echo 'select a,b from bla' | mysql
- echo a b | ssv2tsv; echo "select a,b from bla" | sqlite3 -separator $(echo -e '\t')
- echo a b | ssv2tsv; echo "select a,b from bla" | psql -tqAF $(echo -e '\t')
GUI to TSV
- Excel: copy-and-paste cells <-> text as tsv (though kills double quotes)
- Web browsers: copy rendered HTML table -> text as tsv
Misc
- pv (Highly recommended!)
- shuffle
- md5sort
- setdiff

The tsvutils scripts' comments include further examples.

Examples: Named columns in programs

Here are examples of parsing TSV-with headers in several script-y languages, such that you get to refer to columns by their names, instead of positions. This makes the scripts much more maintainable.

Python has a built-in facility for TSV-with-headers:

tsv_reader = lambda f: csv.DictReader(f, dialect=None, delimiter='\t', quoting=csv.QUOTE_NONE)
for record in tsv_reader(sys.stdin):
  print record  # => hash of key/values

Or equivalently:

cols = sys.stdin.readline()[:-1]
for line in sys.stdin:
  vals = line[:-1].split("\t")
  record = dict((cols[j],vals[j]) for j in range(len(cols)))
  print record  # => hash of key/values

Ruby:

cols = STDIN.readline.chomp.split("\t")
STDIN.each do |line|
  vals = line.chomp.split("\t")
  record = (0...cols.size).map {|j| [cols[j], vals[j]]}.to_h
  p record  # => hash of key/values
end

R has a built-in facility:

data = read.delim("data.tsv", sep="\t")

Installation

It's probably useful to look at or tweak these scripts, so you're best off just putting the entire directory on your PATH.

The philosophy of tsvutils

There are many data processing and analysis situations where data consists of tables. A "table" is a list of flat records each with the same set of named attributes, where it's easy to manipulate a particular attribute across all records -- a "column". The main data structures in SQL, R, and Excel are tables.

TSV-with-headers sits in a sweet spot on the spectrum of data format complexity.

A more complex alternative is to encode in arbitrarily nested structures (XML, JSON). These have greater representational capacity, but are less convenient for data analysis. Since they can have high structural complexity, it's often error-prone to use them -- ad-hoc querying is generally difficult. Furthermore, it's wasteful of space to repeat key names over and over if all records are known to have the same set of keys. Finally, when doing data analysis, especially statistical analysis, you want to turn columns into vectors, which presupposes a flatter, more table-like structure.
A simpler alternative is a table with positional, un-named columns. The main weakness is that for more than several columns, it's hard to remember which column is which. Named columns improve maintainability.

But: SQL databases and Excel spreadsheets are often inconvenient data management environments compared to the filesystem on the unix commandline. Unfortunately, the most common file format for tables is CSV, which is complex and has several incompatible versions. It plays only moderately nicely with the unix commandline, which is the best ad-hoc processing tool for filesystem data. Often the only correct way to handle CSV is to use a parsing library, but it's inconvenient to fire up a python/perl/ruby session just to do simple sanity checks and casually inspect data.

To balance these needs, so far I've found that TSV-with-headers is the most convenient canonical format for table data in the filesystem/commandline environment, or at least the lingua franca in shell pipelines. These utilities are just a little bit of glue to make TSV play nicely with CSV, Excel, MySQL, and Unix tools. Interfaces in and out of other table-centric environments could easily be added.

On the philosophy of having NO escaping or special data value conventions: If you want those things in your data, make up your own convention (like backslash escaping, URL escaping, or whatever) and have your application be aware of it. Our philosophy is, a data processing utility should ignore that stuff in order to have safe and predictable behavior. I've seen too many bugs because some intermediate program imposed a special meaning on "NA" or "\N" or "NULL", etc., when really a program further downstream should have had sole responsibility for this interpretation.

In Conclusion

Hope you enjoy tsvutils!

tsvutils: http://github.com/brendano/tsvutils
By Brendan O'Connor: http://anyall.org

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
json2tsv_testdata		json2tsv_testdata
test		test
xlsx_test		xlsx_test
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
bridges.rb		bridges.rb
csv2json		csv2json
csv2tsv		csv2tsv
eq2tsv		eq2tsv
fmt2tsv		fmt2tsv
gs2tsv		gs2tsv
html2tsv		html2tsv
hwrap		hwrap
json2tsv		json2tsv
lamecut		lamecut
namecut		namecut
sas2tsv		sas2tsv
ssv2tsv		ssv2tsv
tabawk		tabawk
tabsort		tabsort
tar2tsv		tar2tsv
tsv2csv		tsv2csv
tsv2fmt		tsv2fmt
tsv2html		tsv2html
tsv2json		tsv2json
tsv2my		tsv2my
tsv2ssv		tsv2ssv
tsv2tex		tsv2tex
tsv2yaml		tsv2yaml
tsvawk		tsvawk
tsvcat		tsvcat
tsvsort		tsvsort
tsvutil.py		tsvutil.py
uniq2tsv		uniq2tsv
xls2csvfiles		xls2csvfiles
xls2tsv		xls2tsv
xlsx2tsv		xlsx2tsv
yaml2tsv		yaml2tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tsvutils -- utilities for processing tab-separated files

Overview of scripts

Examples: pipelines

Examples: Named columns in programs

Installation

The philosophy of tsvutils

In Conclusion

About

Releases

Packages

Contributors 2

Languages

License

brendano/tsvutils

Folders and files

Latest commit

History

Repository files navigation

tsvutils -- utilities for processing tab-separated files

Overview of scripts

Examples: pipelines

Examples: Named columns in programs

Installation

The philosophy of tsvutils

In Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages