Releases: dathere/qsv
Releases · dathere/qsv
0.21.0
MAJOR NEW FEATURES
- added
apply geocode
caching, more than doubling performance in the geocode benchmark. - added
--random
and--seed
options tosort
command from @pjsier, enabling reproducible, randomized "scrambling" of CSVs. - Bash shell qsv tab completion
- additional
apply operations
subcommands:- Match Trim operations - enables trimming of more than just whitespace, but also of multiple trim characters in one pass (Example):
- mtrim: Trims
--comparand
matches left & right of the string (trim_matches wrapper) - mltrim: Left trim
--comparand
matches (trim_start_matches wrapper) - mrtrim: Right trim
--comparand
matches (trim_end_matches wrapper)
- mtrim: Trims
- replace: Replace all matches of a pattern (using
--comparand
)
with a string (using--replacement
) (Std::String replace wrapper). - regex_replace: Replace the leftmost-first regex match with
--replacement
(regex replace wrapper). - titlecase - capitalizes English text using Daring Fireball titlecase style
https://daringfireball.net/2008/05/title_case - censor_check: check if profanity is detected (boolean) Examples
- censor: profanity filter
- Match Trim operations - enables trimming of more than just whitespace, but also of multiple trim characters in one pass (Example):
- added parameter validation to
apply operations
subcommands - added more robust parameter validation to
apply
command by leveraging docopt
More benchmark script improvements:
- allow binary to be changed, so users can benchmark xsv and other xsv forks by simply replacing the $bin shell variable
- now uses a much larger data file - a 1M row, 512 mb, 41 column sampling of NYC's 311 data
See CHANGELOG for details.
0.20.0
MAJOR NEW FEATURES
- major refactoring of
apply
command:- to take advantage of docopt parsing/validation.
- instead of one big command, broke down apply to several subcommands:
- operations
- emptyreplace
- datefmt
- geocode
- added string similarity operations to
apply
command:- simdl: Damerau-Levenshtein similarity
- simdln: Normalized Damerau-Levenshtein similarity (between 0.0 & 1.0)
- simjw: Jaro-Winkler similarity (between 0.0 & 1.0)
- simsd: Sørensen-Dice similarity (between 0.0 & 1.0)
- simhm: Hamming distance. Number of positions where characters differ.
- simod: OSA Distance.
- soundex: sounds like (boolean)
- added progress bars to commands that may spawn long-running jobs - for this release,
apply
,foreach
, andlua
. Progress bars can be suppressed with--quiet
option. - added progress bar helper functions to utils.rs.
Benchmark improvements:
- added
apply
to benchmarks. - added sample NYC 311 data to benchmarks.
- added records per second (RECS_PER_SEC) to benchmarks
See CHANGELOG for details.
0.19.0
MAJOR NEW FEATURES
- new
scramble
command. Randomly scrambles a CSV's records. - read/write buffer capacity can now be set using environment variables
QSV_RDR_BUFFER_CAPACITY
andQSV_WTR_BUFFER_CAPACITY
(in bytes). - benchmark script revamped. Now produces aligned output onscreen,
while also creating a benchmark TSV file; downloads the sample file from GitHub;
benchmark more commands. Designed to help users tailor and maximize qsv's performance
in their environment. - added a Performance Tuning section in the README.
See CHANGELOG for details.
0.18.2
0.18.1
0.18.0
MAJOR NEW FEATURES
stats
mode
is now also multi-modal -i.e. returns multiples modes when detected.
e.g. mode[1,1,2,2,3,4,6,6] will return [1,2,6].
It will continue to return one mode if only one is detected.stats
quartile
now also computes IQR, lower/upper fences and skew (using Pearson's median skewness). For code simplicity, calculated skew with quartile.join
now also supportleft-semi
andleft-anti
joins, the same way Spark does.search
--flag
option now returns row number, not just '1'.searchset
--flag
option now returns row number, followed by a semi-colon, and a list of matching regexes.
See CHANGELOG for details.
0.17.3
0.17.2
0.17.1
0.17.0
MAJOR NEW FEATURES
searchset
command. Match multiple regexes in a single pass.unicode
option onsearch
,searchset
andreplace
commands.
Previously, regex unicode support was on by default, which comes at the cost of performance for these "expensive" regex operations. Unicode support is now off by default for these commands.
Otherwise "inexpensive" regex operations (apply, select, partition, foreach), that primarily scan headers and do input validation still have unicode support on by default.stats
now has quartiles and a new, faster variance algorithm that also eliminates intermittent unit test failures on macOS.
See CHANGELOG for details.