tsv-join

Jun 10, 2021

6a284fb · Jun 10, 2021

Name	Name	Last commit message	Last commit date
parent directory ..
src/tsv_utils	src/tsv_utils	Minor doc edits. (#347 )	Jun 10, 2021
tests	tests	Use FILE.dos_tsv for DOS line ending tests and configure git to respe…	Jun 6, 2021
README.md	README.md	Minor doc edits. (#347 )	Jun 10, 2021
dub.json	dub.json	2021 copyright updates (#330 )	Feb 20, 2021
makefile	makefile	Code coverage reports generated, added to travis-ci and codecov (#54 )	Apr 5, 2017

README.md

Visit the eBay TSV utilities main page

tsv-join

Joins lines from multiple files based on a common key. One file, the 'filter' file, contains the records (lines) being matched. The other input files are scanned for matching records. Matching records are written to standard output, along with any designated fields from the filter file. In database parlance this is a hash semi-join. This is similar to the "stream-static" joins available in Spark Structured Streaming and "KStream-KTable" joins in Kafka. (The filter file plays the same role as the Spark static dataset or Kafka KTable.)

Example:

$ tsv-join -H --filter-file filter.tsv --key-fields Country,City --append-fields Population,Elevation data.tsv

This reads filter.tsv, creating a lookup table keyed on the Country and City fields. data.tsv is read, lines with a matching key are written to standard output with the Population and Elevation fields from filter.tsv appended. This is an inner join. Left outer joins and anti-joins are also supported.

Common uses for tsv-join are to join related datasets or to filter one dataset based on another. Filter file entries are kept in memory, this limits the ultimate size that can be handled effectively. The author has found that filter files up to about 10 million lines are processed effectively, but performance starts to degrade after that.

See the tsv-join reference for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

tsv-join

tsv-join

README.md

tsv-join

Files

tsv-join

Directory actions

More options

Directory actions

More options

Latest commit

History

tsv-join

Folders and files

parent directory

README.md

tsv-join