Join two tables by a fuzzy comparison of text columns.
- Command line utility to quickly join CSV files.
- Ngram blocking to reduce the total number of comparisons.
- Pure python levenshtein edit distance using pylev.
- Fast levenshtein edit distance using editdistance.
- License: MIT
- Pure python:
pip install fuzzyjoin
- Optimized:
pip install fuzzyjoin[fast]
The goal of this package is to provide a quick and convenient way to
join two tables on a pair of text columns, which often contain variations
of names for the same entity. fuzzyjoin
satisfies the simple and common case
of joining by a single column from each table for datasets in the thousands of records.
For a more sophisticated and comprehensive treatment of the topic that will allow you to join records using multiple fields, see the packages below:
\> fuzzyjoin --help
Usage: fuzzyjoin_cli.py [OPTIONS] LEFT_CSV RIGHT_CSV
Inner join <left_csv> and <right_csv> by a fuzzy comparison of
<left_field> and <right_field>.
Options:
-f, --fields TEXT... <left_field> <right_field> [required]
-t, --threshold FLOAT Only return matches above this score. [default: 0.7]
-o, --output TEXT File to write the matches to.
--multiples TEXT File for left IDs with multiple matches.
--exclude TEXT Function used to exclude records. See:
<fuzzyjoin.compare.default_exclude>
--collate TEXT Function used to collate <fields>. See:
<fuzzyjoin.collate.default_collate>
--compare TEXT Function used to compare records. See:
<fuzzyjoin.compare.default_compare>
--numbers-exact Numbers and order must match exactly.
--numbers-permutation Numbers must match but may be out of order.
--numbers-subset Numbers must be a subset.
--ngram-size INTEGER The ngram size to create blocks with. [default: 3]
--no-progress Do not show comparison progress.
--debug Exit to PDB on exception.
--yes Yes to all prompts.
--help Show this message and exit.
# Use field `name` from left.csv and field `full_name` from right.csv
\> fuzzyjoin --fields name full_name left.csv right.csv
# Export rows with multiple matches from left.csv to a separate file.
\> fuzzyjoin --multiples multiples.csv --fields name full_name left.csv right.csv
# Increase the ngram size, reducing execution time but removing tokens small than `ngram_size`
# as possible matches.
\> fuzzyjoin --ngram-size 5 --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields and in the same order.
\> fuzzyjoin --numbers-exact --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields but may be in a different order.
\> fuzzyjoin --numbers-permutation --fields name full_name left.csv right.csv
# Ensure numbers that appear in one field are at least a subset of the other.
\> fuzzyjoin --numbers-subset --fields name full_name left.csv right.csv
# Use importable function `package.func` from PATH as the comparison function
# instead of `fuzzyjoin.compare.default_compare`.
\> fuzzyjoin --compare package.func --fields name full_name left.csv right.csv
from fuzzyjoin import io
# Specify which field to use from the left and right CSV files.
options = Options(
field_1='name',
field_2='full_name'
)
matches = io.inner_join_csv_files('left.csv', 'right.csv', options)
io.write_matches(matches, output_file='matches.csv')
- Test transformation and exclude functions.
- Implement left join and full join.
- Check that the ID is actually unique.
- Add documentation.
- Option to rename headers and disambiguate duplicate header names.
- Fix API Usage history docs in README.
- Update usage docs.
- Rename key_* params to field_* for consistency.
- Removed ID field requirement.
- Options parameter.
- Fix function defaults.
- Minor optimizations.
- Additional CLI parameters.
- Cleanup checks.
- Include basic installation instructions.
- Minor README updates.
- Use editdistance if available, otherwise fallback to pylev.
- Report progress by default.
- Number comparison options.
- Renamed get_multiples to filter_multiples.
- Additional docs and tests.
- Write multiples matches to a separate file.
- Added types and docstrings.
- Duplicate release of 0.1.1
- First release on PyPI.