Skip to content

Releases: dathere/qsv

0.123.0

05 Mar 14:18
b833e47
Compare
Choose a tag to compare

OPEN DATA DAY 2024 Release! 🎉🎉🎉

In celebration of Open Data Day, we're releasing qsv 0.123.0 - the biggest release ever with 330+ commits! qsv 0.123.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.

We've been baking qsv pro for a while now, and it's almost ready for release. qsv pro is a cross-platform Desktop Data Wrangling tool marrying an Excel-like UI with the power of qsv, backed by cloud-based data cleaning, enrichment and enhancement service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.

Stay tuned!

Highlights:

# with fast path optimization turned off
/usr/bin/time qsv sqlp taxi.csv --no-optimizations "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
        6.09 real         6.82 user         0.16 sys

# with fast path optimization, fully exploiting Polars' multithreaded, mem-mapped CSV reader!
 /usr/bin/time qsv sqlp taxi.csv "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
        0.14 real         1.09 user         0.09 sys

# in contrast, csvq takes 72.46 seconds - 517.57x slower
/usr/bin/time csvq "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
+----------+---------------------+
| VendorID |  SUM(total_amount)  |
+----------+---------------------+
| 1        |  52377417.529256366 |
| 2        |    89959869.1264675 |
| 4        |   600584.6099999828 |
+----------+---------------------+
       72.46 real        65.15 user        75.17 sys

"Traditional" SQL engines

qsv and csvq both operate on "bare" CSVs. For comparison, let's contrast qsv's performance against "traditional" SQL engines
that require setup and import (aka ETL). Not counting setup and import time (which alone, takes several minutes), we get:

sqlite3.43.2 takes 2.910 seconds - 20.79x slower

sqlite> .timer on
sqlite> select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID;
1,52377417.53
2,89959869.13
4,600584.61
Run Time: real 2.910 user 2.569494 sys 0.272972

PostgreSQL 15.6 using PgAdmin 4 v6.12 takes 18.527 seconds - 132.34x slower

Screenshot 2024-03-06 at 10 14 04 AM

even with an index, qsv sqlp is still 5.96x faster

Screenshot 2024-03-08 at 7 57 57 AM
  • sqlp now supports JSONL output format and adds compression support for Avro and Arrow output formats.
  • fetch now has a --disk-cache option, so you can cache web service responses to disk, complete with cache control and expiry handling!
  • jsonl is now multithreaded with additional --batch and --job options.
  • split now has three modes: split by record count, split by number of chunks and split by file size.
  • datefmt is a new top-level command for date formatting. We extracted it from apply to make it easier to use, and to set the stage for expanded date and timezone handling.
  • enum now has a --start option.
  • excel now has a --keep-zero-time option and now has improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24.
  • tojsonl now has --trim and --no-boolean options and eliminated false positive boolean inferences.

Added

  • apply: add gender_guess operation #1569
  • datefmt: new top-level command for date formatting. #1638
  • enum: add --start option #1631
  • excel: added --keep-zero-time option; improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24 #1595
  • fetch: add --disk-cache option #1621
  • jsonl: major performance refactor! Now multithreaded with addl --batch and --job options #1553
  • sniff: added addl mimetype/file formats detected by bumping file-format from 0.23 to 0.24 #1589
  • split: add <outdir> error handling and add usage text examples #1585
  • split: added --chunks option #1587
  • split: add --kb-size option #1613
  • sqlp: added JSONL output format and compression support for AVRO and Arrow output formats in #1635
  • tojsonl: add --trim option #1554
  • Add QSV_DOTENV_PATH env var #1562
  • Add license scan report and status by @fossabot in #1550
  • Added several benchmarks for new/changed commands

Changed

  • luau: bumped Luau from 0.606 to 0.614
  • freq: major performance refactor - 1a3a4b4
  • split: migrate to rayon from threadpool #1555
  • split: refactored to actually create chunks <= desired --kb-size, obviating need for hacky --sep-factor option #1615
  • tojsonl: improved true/false boolean inferencing false positive handling #1641
  • tojsonl: fine-tune boolean inferencing #1643
  • schema: use parallel sort when sorting enums for fields 523c60a
  • Use array for rustflags to avoid conflicts with user flags by @clarfonthey in #1548
  • Make it easier and more consistent to package for distros by @alerque in #1549
  • Replace simple_home_dir with simple_expand_tilde crate #1578
  • build(deps): bump rayon from 1.8.0 to 1.8.1 by @dependabot in #1547
  • build(deps): bump rayon from 1.8.1 to 1.9.0 by @dependabot in #1623
  • build(deps): bump uuid from 1.6.1 to 1.7.0 by @dependabot in #1551
  • build(deps): bump jql-runner from 7.1.2 to 7.1.3 by @dependabot in #1552
  • build(deps): bump jql-runner from 7.1.3 to 7.1.5 by @dependabot in #1602
  • build(deps): bump jql-runner from 7.1.5 to 7.1.6 by @dependabot in #1637
  • build(deps): bump flexi_logger from 0.27.3 to 0.27.4 by @dependabot in #1556
  • build(deps): bump regex from 1.10.2 to 1.10.3 by @dependabot in #1557
  • build(deps): bump cached from 0.47.0 to 0.48.0 by @dependabot in #1558
  • build(deps): bump cached from 0.48.0 to 0.48.1 by @dependabot in #1560
  • build(deps): bump cached from 0.48.1 to 0.49.2 by @dependabot in #1618
  • build(deps): bump chrono from 0.4.31 to 0.4.32 by @dependabot in #1559
  • build(deps): bump chrono from 0.4.32 to 0.4.33 by @dependabot in #1566
  • build(deps): bump mlua from 0.9.4 to 0.9.5 by @dependabot in #1565
  • build(deps): bump mlua from 0.9.5 to 0.9.6 by @dependabot in #1632
  • build(deps): bump serde from 1.0.195 to 1.0.196 by @dependabot in #1568
  • build(deps): bump serde from 1.0.196 to 1.0.197 by @dependabot in #1612
  • build(deps): bump serde_json from 1.0.111 to 1.0.112 by @dependabot in #1567
  • build(deps): bump serde_json from 1.0.112 to 1.0.113 by @dependabot in #1576
  • build(deps): bump serde_json from 1.0.113 to 1.0.114 by @dependabot in #1610
  • bump Polars from 0.36 to 0.37 #1570
  • build(deps): bump polars from 0.37.0 to 0.38.0 by @dependabot in #1629
  • build(deps): bump polars from 0.38.0 to 0.38.1 by @dependabot in #1634
  • build(deps): bump strum from 0.25.0 to 0.26.1 by @dependabot in #1572
  • build(deps): bump indexmap from 2.1.0 to 2.2.1 by @dependabot in https://g...
  1. measurements taken on an Apple Mac Mini 2023 model with an M2 Pro chip with 12 CPU cores & 32GB of RAM, running macOS Sonoma 14.4

Read more

0.122.0

17 Jan 04:54
4ff43bc
Compare
Choose a tag to compare

👉 REQUEST FOR USE CASES: 👈

Please help define the future of qsv.
Add what you're currently using qsv for here - #1529

Not only does it help us catalog what use cases we should optimize for, posters will get higher priority access to the qsv pro preview.

Highlights:

  • qsvpy is now available in the prebuilt binaries for select platforms! It's a new qsv binary variant with the python feature, enabling the py command. Three subvariants are available - qsvpy310, qsvpy311 and qsvpy312, corresponding to Python 3.10, 3.11 and 3.12 respectively.
  • Removed generate command as generate's main dependency is unmaintained and has old dependencies. generate was also not used much, as the test data it generated was not well suited for training models and it was too slow so we decided to remove it even before the synthesize (#235) command is ready.
  • reverse now has index support and can work in "streaming" mode and handle larger than memory CSV files.
  • sort and sample: users can now choose from three Random Number Generator (RNG) algorithms with the --rng option - standard, faster & cryptosecure.
  • pseudo now has --start, --increment & --formatstr options.
  • fmt now has a --no-final-newline option to suppress the final newline for better interoperability with other tools, specifically Excel. It also treats "T" as special value for tab character for the --out-delimiter option.

Added

  • reverse: now has index support and can work in "streaming" mode #1531
  • sort: added --rng <kind> for different kinds of RNGs - standard, faster & cryptosecure #1535
  • sample: added --rng <kind> option (standard, faster & cryptosecure) #1532
  • pseudo: major refactor. Added --start, --increment & --formatstr options #1541
  • fmt: add --no-final-newline option #1545
  • added additional benchmarks
  • added additional test for new options. We now have ~1,300 tests!

Changed

  • fmt: --out-delimiter now treats "T" as special value for tab character #1546
  • build(deps): bump whatlang from 0.16.3 to 0.16.4 by @dependabot in #1525
  • build(deps): bump serde_json from 1.0.110 to 1.0.111 by @dependabot in #1524
  • build(deps): bump pyo3 from 0.20.1 to 0.20.2 by @dependabot in #1526
  • build(deps): bump sysinfo from 0.30.3 to 0.30.4 by @dependabot in #1523
  • build(deps): bump sysinfo from 0.30.4 to 0.30.5 by @dependabot in #1530
  • build(deps): bump serial_test from 2.0.0 to 3.0.0 by @dependabot in #1534
  • build(deps): bump mlua from 0.9.2 to 0.9.3 by @dependabot in #1540
  • build(deps): bump mlua from 0.9.3 to 0.9.4 by @dependabot in #1542
  • build(deps): bump simple-home-dir from 0.2.1 to 0.2.3 by @dependabot in #1544
  • apply select clippy suggestions
  • update several indirect dependencies

Removed

  • removed generate command #1527
  • removed generate feature from GitHub Action workflows #1528
  • sample: removed --faster RNG sampling option, replacing it with --rng #1532

Full Changelog: 0.121.0...0.122.0

0.121.0

03 Jan 13:24
12957d3
Compare
Choose a tag to compare

Two days ago, qsv 0.120.0 was released. Hours later, significant updates occurred in our ecosystem: Polars upgraded to version 0.36, Homebrew rolled out support for Rust 1.75.0, and our pull request for 'cached' was merged.

In light of these developments, we're releasing 0.121.0 out of cycle to leverage the new features, fixes and performance enhancements in these key components integral to qsv.


👉 REQUEST FOR USE CASES: 👈
Please help define the future of qsv.
Add what you're currently using qsv for here - #1529

Not only does it help us catalog what use cases we should optimize for, posters will get higher priority access to the qsv pro preview.


Added

  • sqlp: with Polars 0.36, it now supports:
  • sqlp: now supports writing to Apache Avro format 32f2fbb
  • sqlp: when writing to CSV --format, if the --output file has a TSV or TAB extension, it will automatically use the tab delimiter c97048c

Changed

  • Bump polars from 0.35 to 0.36 #1521
  • build(deps): bump serde from 1.0.193 to 1.0.194 by @dependabot in #1520
  • build(deps): bump serde_json from 1.0.109 to 1.0.110 by @dependabot in #1519
  • build(deps): bump semver from 1.0.20 to 1.0.21 by @dependabot in #1518
  • build(deps): bump serde_stacker from 0.1.10 to 0.1.11 by @dependabot in #1517
  • build(deps): bump cached from 0.46.1 to 0.47.0 by @dependabot in #1522
  • bumped MSRV to 1.75.0

Fixed

  • cat: fixed performance regression in rowskey by moving unchanging variables out of hot loop - 96a40e9
  • sqlp: Polars 0.36 fixed the SQL SUBSTR() function

Full Changelog: 0.120.0...0.121.0

0.120.0

02 Jan 04:49
f1f19db
Compare
Choose a tag to compare

Happy New Year! 🎉🎉🎉
Here's the first release of 2024, the biggest ever with 280+ commits! qsv 0.120.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.

Apart from wrapping qsv with a User Interface, qsv pro also comes with a retinue of related cloud-based data cleaning, enrichment and enhancement services along with expanded metadata inferencing to make your Data Useful, Usable and Used!

qsv pro draws inspiration from OpenRefine, but reimagined without its file size and speed limitations, with qsv pro having the ability to process multi-gigabyte files in seconds.

It incorporates hard lessons we learned in the past 12 years deploying Data Portals and Data Pipelines to create a new Data/Metadata Wrangling and AI-assisted Data Publishing service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.

But it's not quite ready for release yet, so stay tuned!

We're now taking signups for a preview release however, so if you're interested, please sign up here!

Excitingly, qsv was also mentioned on Hacker News in this thread Dec 23, 2023! As a result, we're now almost at 2,000+ stars on GitHub from 900 stars on Dec 22! 🎉🎉🎉

Stay tuned for more advancements in 2024 – it's set to be a landmark year for qsv! 🦄🦄🦄


Added

  • cat: add rowskey --group options; increased perf of rowskey #1508
  • validate: add --trim and --quiet options #1452
  • apply & applydp: operations regex_replace now supports empty --replacement with the "<NULL>" special value #1470 and #1471
  • exclude: also consider rows with empty fields #1498
  • extsort: add --tmp-dir option ca1f461

Changed

  • validate: Faster RFC4180 validation with byterecords and SIMD-accelerated utf8 validation #1440
  • excel: minor performance tweaks #1446
  • apply, applydp, explode, geocode, pseudo: consolidate redundant code and use one replace_column_value helper fn in util.rs #1456
  • excel: bump calamine from 0.22 to 0.23 #1473
  • excel & joinp: use atoi_simd for faster &[u8] to int conversion 9521f3e
  • cat, describegpt, headers, sqlp, to, tojsonl: refactor commands that accept multiple input files to use improved process_input helper #1496
  • fetch & fetchpost: get_response refactor for maintainability and performance #1507
  • luau: replaced --no-colindex option with --colindex option. --col-index slows down processing and is not often used, so make it an option, not the default. a0c8568
  • make thousands crate optional with apply feature in #1453
  • build(deps): bump uuid from 1.6.0 to 1.6.1 by @dependabot in #1430
  • build(deps): bump serde from 1.0.192 to 1.0.193 by @dependabot in #1432
  • build(deps): bump data-encoding from 2.4.0 to 2.5.0 by @dependabot in #1435
  • build(deps): bump mlua from 0.9.1 to 0.9.2 by @dependabot in #1436
  • build(deps): bump url from 2.4.1 to 2.5.0 by @dependabot in #1437
  • build(deps): bump jql-runner from 7.0.6 to 7.0.7 by @dependabot in #1439
  • build(deps): bump jql-runner from 7.0.7 to 7.1.0 by @dependabot in #1447
  • build(deps): bump jql-runner from 7.1.0 to 7.1.1 by @dependabot in #1457
  • build(deps): bump jql-runner from 7.1.1 to 7.1.2 by @dependabot in #1486
  • build(deps): bump hashbrown from 0.14.2 to 0.14.3 by @dependabot in #1441
  • build(deps): bump redis from 0.23.3 to 0.23.4 by @dependabot in #1442
  • build(deps): bump redis from 0.23.3 to 0.24.0 by @dependabot in #1455
  • build(deps): bump atoi_simd from 0.15.3 to 0.15.4 by @dependabot in #1444
  • build(deps): bump atoi_simd from 0.15.4 to 0.15.5 by @dependabot in #1445
  • build(deps): bump atoi_simd from 0.15.5 to 0.15.6 by @dependabot in #1512
  • build(deps): bump actions/setup-python from 4.7.1 to 4.8.0 by @dependabot in #1454
  • build(deps): bump actions/setup-python from 4.8.0 to 5.0.0 by @dependabot in #1459
  • build(deps): bump actions/stale from 8 to 9 by @dependabot in #1463
  • build(deps): bump itoa from 1.0.9 to 1.0.10 by @dependabot in #1464
  • build(deps): bump tokio from 1.34.0 to 1.35.0 by @dependabot in #1465
  • build(deps): bump tokio from 1.35.0 to 1.35.1 by @dependabot in #1483
  • build(deps): bump ryu from 1.0.15 to 1.0.16 by @dependabot in #1466
  • build(deps): bump file-format from 0.22.0 to 0.23.0 by @dependabot in #1468
  • build(deps): bump github/codeql-action from 2 to 3 by @dependabot in #1476
  • build(deps): bump geosuggest-utils from 0.5.1 to 0.5.2 by @dependabot in #1479
  • build(deps): bump geosuggest-core from 0.5.1 to 0.5.2 by @dependabot in #1478
  • build(deps): bump reqwest from 0.11.22 to 0.11.23 by @dependabot in #1480
  • build(deps): bump calamine from 0.23.0 to 0.23.1 by @dependabot in #1481
  • build(deps): bump qsv-sniffer from 0.10.0 to 0.10.1 by @dependabot in #1484
  • build(deps): bump anyhow from 1.0.75 to 1.0.76 by @dependabot in #1485
  • build(deps): bump futures from 0.3.29 to 0.3.30 by @dependabot in #1492
  • build(deps): bump futures-util from 0.3.29 to 0.3.30 by @dependabot in #1491
  • build(deps): bump crossbeam-channel from 0.5.9 to 0.5.10 by @dependabot in #1490
  • build(deps): bump sysinfo from 0.29.10 to 0.29.11 by @dependabot in #1443
  • Bump sysinfo from 0.29.11 to 0.30 #1489
  • build(deps): bump sysinfo from 0.30.0 to 0.30.1 by @dependabot in #1495
  • build(deps): bump sysinfo from 0.30.1 to 0.30.2 by @dependabot in #1504
  • build(deps): bump sysinfo from 0.30.2 to 0.30.3 by @dependabot in #1509
  • build(deps): bump tabwriter from 1.3.0 to 1.4.0 by @dependabot in #1500
  • build(deps): bump tempfile from 3.8.1 to 3.9.0 by @dependabot in #1502
  • build(deps): bump qsv_docopt from 1.4.0 to 1.5.0 by @dependabot in #1503
  • build(deps): bump ahash from 0.8.6 to 0.8.7 by @dependabot in #1510
  • build(deps): bump serde_json from 1.0.108 to 1.0.109 by @dependabot in #1511
  • apply select clippy suggestions
  • update several indirect dependencies
  • pin Rust nightly to 2023-12-23

Fixed

  • apply: Fix for dynfmt and calcconv subcommands not working in release mode #1467
  • luau: fix check for excess mapped columns earlier. Otherwise, we'll get a CSV different field count error db15811

Removed

  • luau: remove unneeded --jit option as we precompile luau scripts to bytecode #1438

Full Changelog: 0.119.0...0.120.0

0.119.0

20 Nov 04:48
367987a
Compare
Choose a tag to compare

Highlights:

As we prepare for version 1.0, we're focusing on performance, stability and reliability as we set the stage for qsv pro - a cloud-backed UI version of qsv powered by Tauri, set to be released in 2024. Stay tuned!

  • diff is now out of beta and blazingly fast! Give "the fastest CSV-diff in the world" a try 😉!
  • joinp now supports snappy automatic compression/decompression!
  • sqlp & joinp now recognize the QSV_COMMENT_CHAR environment variable, allowing you to skip comment lines in your input CSV files. They're also faster with the upgrade to Polars 0.35.4.
  • sqlp now supports subqueries, table aliases, and more!
  • luau: upgraded embedded Luau from 0.599 to 0.604; refactored code to reduce unneeded allocations and increase performance (more than doubling it!) as we prepare for extended recipe support.
  • cat is now even faster with the --flexible option. If you know your CSV files are valid, you can use this option to skip CSV validation and make cat run twice as fast!
  • qsv can now add a Byte Order Mark (BOM) header sequence to produce Excel-friendly CSVs with the QSV_OUTPUT_BOM environment variable.
  • stats, sort, schema & validate are now faster with the use of atoi_simd to directly convert &[u8] to integer, skipping unnecessary utf8 validation, while also using SIMD CPU instructions for noticeably faster performance.

Added

  • diff: added option/flag for headers in output by @janriemer in #1395
  • diff: added option/flag --delimiter-output by @janriemer in #1402
  • cat: added --flexible option to make cat rows faster still #1408
  • sqlp & joinp: both commands now recognize QSV_COMMENT_CHAR env var #1412
  • joinp: added snappy compression/decompression support #1413
  • geocode: now automatically decompresses snappy-compressed index files #1429
  • Add Byte Order Mark (BOM) output support #1424
  • Added Codacy code quality badge 9959129

Changed

  • stats, sort, schema & validate: use atoi_simd to directly convert &[u8] to integer skipping unnecessary utf8 validation, while also using SIMD instructions for noticeably faster performance
  • cat: faster cat rows #1407
  • count: optimize --width option #1411
  • luau: upgrade embedded Luau from 0.603 to 0.604 #1426
  • use ato_simd for fast &[u8] to int conversion #1423
  • luau: performance refactor 4cebd7c
  • build(deps): bump csv-diff from 0.1.0-beta.4 to 0.1.0 by @dependabot in #1394
  • build(deps): bump serde_json from 1.0.107 to 1.0.108 by @dependabot in #1393
  • build(deps): bump indexmap from 2.0.2 to 2.1.0 by @dependabot in #1397
  • build(deps): bump jql-runner from 7.0.4 to 7.0.5 by @dependabot in #1399
  • build(deps): bump jql-runner from 7.0.5 to 7.0.6 by @dependabot in #1400
  • build(deps): bump file-format from 0.21.0 to 0.22.0 by @dependabot in #1401
  • build(deps): bump cached from 0.46.0 to 0.46.1 by @dependabot in #1403
  • build(deps): bump serde from 1.0.190 to 1.0.192 by @dependabot in #1404
  • build(deps): bump tokio from 1.33.0 to 1.34.0 by @dependabot in #1409
  • build(deps): bump flexi_logger from 0.27.2 to 0.27.3 by @dependabot in #1410
  • build(deps): bump qsv-stats from 0.11.0 to 0.12.0 by @dependabot in #1415
  • build(deps): bump itertools from 0.11.0 to 0.12.0 by @dependabot in #1418
  • build(deps): bump rust_decimal from 1.33.0 to 1.33.1 by @dependabot in #1420
  • build(deps): bump polars from 0.35.2 to 0.35.4 by @dependabot in #1425
  • build(deps): bump uuid from 1.5.0 to 1.6.0 by @dependabot in #1428
  • bump MSRV to 1.74.0
  • apply select clippy suggestions
  • update several indirect dependencies
  • pin Rust nightly to 2023-11-18

Fixed

  • pseudo: detect when more than one column is selected for pseudonymization 0b09372
  • dotenv (.env) tweaks/fixes #1427
  • fix several typos 723443e
  • fix several markdown lints

Removed

  • remove fast-float as std float parse is now also using Eisel-Lemire algorithm #1414

Full Changelog: 0.118.0...0.119.0


NOTE:

To verify prebuilt binary zip archives - click here.

0.118.0

27 Oct 13:24
bd23d3f
Compare
Choose a tag to compare

Highlights:

  • With the Polars upgrade to 0.34.2, the sqlp and joinp enjoy expanded capabilities and a noticeable performance boost. 🦄🏇
  • We now publish the 500, 1000, 5000 and 15000 Geonames cities indices for the geocode command, with users able to easily switch indices with the index-load subcommand. As the name implies, the 500 index contains cities with populations of 500 or more, the 1000 index contains cities with populations of 1000 or more, and so on.
    The 15000 index (default) is the smallest (13mb) and fastest with ~26k cities. The 500 index is the largest(56mb) and slowest, with ~200k cities. The 5000 index is 21mb with ~53k cities. The 1000 index is 44mb with ~140k cities. 🎠
  • The geocode command now returns US Census FIPS codes for US places with the %json and %pretty-json formats, returning both US State and US County FIPS codes, with upcoming support for Cities and other US Census geographies (School Districts, Voting Districts, Congressional Districts, etc.) 🎠
  • Improved performance for stats, schema and tojsonl commands with the stats cache bincode refactor. This is especially noticeable for large CSV files as stats previously created large bincode cache files by default.
    The bincode cache allows other commands (currently, only schema and tojsonl) to skip recomputing statistics and deserialize the saved stats data structures directly into memory. Now, it will only create a bincode file if the --stats-binout option is specified (typically, before using the schema an tojsonl commands). stats will still continue to create a stats CSV cache file by default, but it will be much smaller than the bincode file, and is universally applicable, unlike the bincode cache. 🏇
  • self-update will now verify updates. This is done by verifying the zipsign signature of the release zip archive before applying it. This should make it harder for malicious actors to compromise the self-update process. Version 0.118.0 has the verification code, and future releases will use this new verification process.
    Regardless, we will zipsign all zip archives starting with this release.
    Users can manually verify the signatures by downloading the zipsign public key and running the zipsign command line tool. See Verifying the Integrity of the Prebuilt Binaries Zip Archive for more info. 🦄
  • The frequency command now supports the --ignore-case option for case-insensitive frequency counts. 🦄🎠
  • The schema command can now compile case-insensitive enum constraints. 🦄
  • Improved performance for apply and applydp commands with faster compile-time perfect hash functions for operations lookups. 🏇
  • Several minor performance improvements and bug fixes with snappy, sniff & cat commands. 🏇

Added

  • frequency: added --ignore-case option #1386
  • geocode: added 500, 1000, 5000, 15000 Geonames cities convenience shortcuts to index subcommands bd9f4c3
  • schema: added --ignore-case option when compiling enum constraints; replaced Hashset with faster AHashset a16a1ca
  • snappy: added buf_size parm to compress helper fn e0c0d1f
  • sniff added --just-mime option #1372
  • added zipsign signature verification to self-update #1389

Changed

  • apply & applydp: replaced binary_search with faster compile-time perfect hash functions for operations lookups #1371
  • stats, schema and tojsonl: stats cache bincode refactor #1377
  • luau: replaced sanitise-file-name with more popular sanitize-filename crate 8927cb7
  • cat: minor optimization by preallocating with capacity c13c341
  • sqlp & joinp: expanded speed/functionality with upgrade to Polars 0.34.2 #1385
  • tojsonl: improved boolean inferencing. Now correctly infers booleans, even if the enum domain range is more than 2, but has cardinality 2 case-insensitive 6345f2d
  • build(deps): bump strum_macros from 0.25.2 to 0.25.3 by @dependabot in #1368
  • build(deps): bump regex from 1.10.1 to 1.10.2 by @dependabot in #1369
  • build(deps): bump uuid from 1.4.1 to 1.5.0 by @dependabot in #1373
  • build(deps): bump hashbrown from 0.14.1 to 0.14.2 by @dependabot in #1376
  • build(deps): bump self_update from 0.38.0 to 0.39.0 by @dependabot in #1378
  • build(deps): bump ahash from 0.8.5 to 0.8.6 by @dependabot in #1383
  • build(deps): bump serde from 1.0.189 to 1.0.190 by @dependabot in #1388
  • build(deps): bump futures from 0.3.28 to 0.3.29 by @dependabot in #1390
  • build(deps): bump futures-util from 0.3.28 to 0.3.29 by @dependabot in #1391
  • build(deps): bump tempfile from 3.8.0 to 3.8.1 by @dependabot in 4f6200c
  • apply select clippy suggestions
  • update several indirect dependencies
  • pin Rust nightly to 2023-10-26

Fixed

  • dedup: fixed --ignore-case not being honored during internal sort option #1387
  • applydp: fixed wrong usage text using apply and not applydp c47ba86
  • geocode: fixed index-update not honoring --timeout parameter 3272a9e
  • geocode : fixed index-load to work properly with convenience shortcuts 5097326

Full Changelog: 0.117.0...0.118.0

0.117.0

15 Oct 13:45
1901fe3
Compare
Choose a tag to compare

Highlights:

  • geocode: added Federal Information Processing Standards (FIPS) codes to results for US places, so we can derive GEOIDs. This paves the way to doing data enrichment lookups (starting with the US Census) in an upcoming release. 🦄
  • Added Goal/Non-goals, explicitly codifying what qsv is and isn't, and what we're trying to achieve with the toolkit.
  • excel: CSV output processing is now multi-threaded, making it a bit faster. The bottleneck is still the Excel/ODS library we're using (calamine), which is single-threaded. But there are active discussions underway to make it much faster in the future. 🏇
  • Upgrading the MSRV to 1.73.0 has allowed us to use LLVM 17, which has resulted in a small performance boost. 🏇

Added:

  • geocode: added Federal Information Processing Standards (FIPS) codes to results for US places.
  • Added Goals/Non-goals to README.md

Changed

  • cat : minor optimization 343bb66
  • excel: CSV output processing is now multi-threaded #1360
  • geocode: more efficient dynfmt ptocessing #1367
  • frequency: optimize allocations before hot loop 655bebc
  • luau: upgraded embedded Luau from 0.596 to 0.599
  • deps: bump calamine from 0.22.0 to 0.22.1 4c4ed7e
  • docs: reorganized README, moving FEATURES and INTERPRETERS to their own markdown files.
  • build(deps): bump byteorder from 1.4.3 to 1.5.0 by @dependabot in #1347
  • build(deps): bump tokio from 1.32.0 to 1.33.0 by @dependabot in #1354
  • build(deps): bump regex from 1.9.6 to 1.10.0 by @dependabot in #1356
  • build(deps): bump semver from 1.0.19 to 1.0.20 by @dependabot in #1358
  • build(deps): bump pyo3 from 0.19.2 to 0.20.0 by @dependabot in #1359
  • build(deps): bump serde from 1.0.188 to 1.0.189 by @dependabot in #1361
  • build(deps): bump flate2 from 1.0.27 to 1.0.28 by @dependabot in #1363
  • build(deps): bump regex from 1.10.0 to 1.10.1 by @dependabot in #1366
  • deps: update several indirect dependencies
  • pin Rust nightly to 2023-10-14
  • bump MSRV to 1.73.0

Removed

  • excel: removed --progressbar option as Excel/ODS maximum sheet size is just too small (1,048,576 rows) to make it useful.

Fixed

  • Fixed Jupyter Notebook Viewer Link by @a5dur in #1349

Full Changelog: 0.116.0...0.117.0

0.116.0

05 Oct 20:14
edf73a3
Compare
Choose a tag to compare

Highlights: 🎉 🚀

  • Benchmarks refinements galore with more benchmarks and more comprehensive benchmarking instructions. 🎠
  • geocode: The Geonames index's configuration metadata is now available with the geocode index-check subcommand. No need to maintain a separate metadata JSON file. This should make it even easier to maintain multiple Geonames index files with different configurations without having to worry if you're looking at the right metadata JSON file. 🎠
  • cat: rowskey subcommand is now 27% faster 🏇🏽
  • tojsonl: parallelized with rayon, making it 33% faster! 🏇🏽
  • smaller qsv binary size and faster compile times if the to_parquet feature is disabled. If you're good enough with sqlp's ability to create a parquet file from a SQL query, qsv's binary size and compile time will be markedly smaller/faster. 🏇🏽
  • minor perf tweaks & optimizations - count and luau commands 🏇🏽

Added

  • geocode: added Geonames index file metadata to index-check subcommand
  • tojsonl: parallelized with rayon #1338
  • to: added to_parquet feature. #1341
  • benchmarks: upgraded from 3.0.0 to 3.3.1
    • you can now specify a separate benchmarking binary as we dogfood qsv for the benchmarks and some features are required that may not be in the qsv binary variant being benchmarked
    • added additional count benchmarks with --width option
    • added additional luau benchmarks with single/multi filter options
    • added additional search benchmark with --unicode option
    • show absolute path of qsv binaries used (both the one we're dogfooding and the one being benchmarked) and their version info before running the benchmarks proper
    • ensured schema benchmark was not using the stats cache with the --force option

Changed

  • cat: use an empty byte_record var instead of repeatedly allocating a new one in a hot loop eddafd1
  • count: minor optimization bb113c0
  • luau: minor perf tweaks c71cd16 and f9c1e3c
  • (deps): bump Geosuggest from 0.4.5 to 5.1 #1333
  • (deps): use patched version of calamine which has unreleased fixes since 0.22.0
  • build(deps): bump flexi_logger from 0.27.0 to 0.27.2 by @dependabot in #1328
  • build(deps): bump indexmap from 2.0.0 to 2.0.1 by @dependabot in #1329
  • build(deps): bump hashbrown from 0.14.0 to 0.14.1 by @dependabot in #1334
  • build(deps): bump file-format from 0.20.0 to 0.21.0 by @dependabot in #1335
  • build(deps): bump indexmap from 2.0.1 to 2.0.2 by @dependabot in #1336
  • build(deps): bump regex from 1.9.5 to 1.9.6 by @dependabot in #1337
  • build(deps): bump jql-runner from 7.0.3 to 7.0.4 by @dependabot in #1340
  • build(deps): bump csvs_convert from 0.8.7 to 0.8.8 by @dependabot in #1339
  • build(deps): bump actions/setup-python from 4.7.0 to 4.7.1 by @dependabot in #1342
  • build(deps): bump reqwest from 0.11.21 to 0.11.22 by @dependabot in #1343
  • build(deps): bump csv from 1.2.2 to 1.3.0 by @dependabot in #1344
  • build(deps): bump actix-governor from 0.4.1 to 0.5.0 by @dependabot in #1346
  • applied select clippy suggestions
  • update several indirect dependencies
  • pin Rust nightly to 2023-10-04

Removed

  • geocode: removed separate metadata JSON file for Geonames index files. The metadata is now embedded in the index file itself and can be viewed with the index-check command.
  • removed redundant setting from profile.release-samply in Cargo.toml 2a35be5

Fixed

  • geocode: when producing JSON output with the now subcommands (suggestnow, reversenow, countryinfonow), we now produce valid JSON. We previously generated JSON with escaped/extra quotes as it was formatted to be included in CSV files, which is required for the suggest, reverse and countryinfo subcommands as they are designed to process CSVs with multiple rows, thus requiring escaped JSON. The now commands are only meant for one result so there's no need to escape quote the JSON. #1345
  • schema: fixed --force flag not being honored

Full Changelog: 0.115.0...0.116.0

0.115.0

26 Sep 13:19
1c47a87
Compare
Choose a tag to compare

We continue to refine the benchmark suite, and have added a new setup argument to setup and install the required tools for the benchmark suite. We've also added more comprehensive checks to ensure that the required tools are installed before running the benchmarks. 🎠

For geocode, we've added a JSON file describing the Geonames index file configuration. This should help users maintain several Geonames index files with different configurations. 🎠

geocode should also be a tad faster now, thanks to cached crate making ahash its default hashing algorithm and upgrading hashbrown - microbenchmarks show a 33% performance improvement. 🏇🏽

We also added a release-samply profile so we can make it easier to squeeze more performance out of the toolkit with samply. 🏇🏽


Added

  • geocode: added a JSON file describing the Geonames index file configuration in #1324
  • benchmarks: v3.0.0 release
    • added setup argument to setup and install required tools for the benchmark suite
    • added more comprehensive required tools check
    • added more realistic luau benchmarks, using helper luau scripts
      (dt_format.luau and turnaround_time.luau)
    • added stats with_cache and create_cache benchmarks
    • added benchmark_aggregations.luau script for benchmark analysis
    • added binary, total_mean and qsv_env columns to benchmark results
      binary is the qsv binary variant used
      total_mean is the sum of all the mean run times of the benchmarks
      qsv_env are the qsv-relevant environment variables active while running the benchmarks
    • expanded README.md and benchmark suite usage instructions
  • added release-samply profile to Cargo.toml to facilitate continued performance optimization with samply

Changed

  • readme: move tab completion instructions/script to scripts/misc
  • geocode: updated bundled Geonames index to 2021-09-25
  • bump embedded luau from 0.594 to 0.596
  • build(deps): bump flexi_logger from 0.26.1 to 0.27.0 by @dependabot in #1317
  • build(deps): bump indicatif from 0.17.6 to 0.17.7 by @dependabot in #1318
  • build(deps): bump semver from 1.0.18 to 1.0.19 by @dependabot in #1320
  • build(deps): bump cached from 0.45.1 to 0.46.0 by @dependabot in #1322
  • build(deps): bump geosuggest-core from 0.4.3 to 0.4.5 by @dependabot in #1323
  • build(deps): bump geosuggest-utils from 0.4.3 to 0.4.5 by @dependabot in #1321
  • build(deps): bump fastrand from 2.0.0 to 2.0.1 by @dependabot in #1325
  • bump MSRV from Rust 1.72.0 to 1.72.1
  • cargo update bump several indirect dependencies
  • pin Rust nightly to 2023-09-25

Fixed

  • benchmarks: fixed invalid luau benchmark that had invalid luau command

Full Changelog: 0.114.0...0.115.0

0.114.0

21 Sep 11:03
c471321
Compare
Choose a tag to compare

The long-overdue Benchmarks revamp is finally here! 🎉- https://qsv.dathere.com/benchmarks

The benchmarks have been completely rewritten to be more reproducible, and now use hyperfine instead of time. The new benchmarks are now run as part of the release process, and the results are compiled into a single page that is published on the new Quicksilver website.

The new benchmarks are also more comprehensive, and designed to be run on a variety of hardware and operating systems. This allows users to adapt the benchmarks to their own workloads and environments.

Other release highlights include:

  • geocode is now fully-featured and ready for production use! 🎉 Though it only currently features Geonames city-level lookup support, it provides a solid foundation on top of which we'll add more geocoding providers in the future (next up - OpenCage support with street-level geocoding).
  • Polars has been bumped from 0.32.1 to 0.33.2, which includes a number of performance improvements for the sqlp and joinp commands.
  • major performance increase on several regex/aho-corasick powered commands on Apple Silicon thanks to various under-the-hood improvements in the aho-corasick crate.

Big thanks to @rzmk , @a5dur, @minhajuddin2510 and @samibaig and helping me finally push out the revamped Benchmarks!


Added

  • Added autoindex size threshold, replacing QSV_AUTOINDEX env var with QSV_AUTOINDEX_SIZE. Resolves #1300. in #1301 69e25ac
  • diff: Added test for different delimiters by @janriemer in #1297
  • benchmarks: Added qsv benchmark notebook. by @a5dur in #1309
  • geocode: Added countryinfo/now subcommand made available in geosuggest 0.4.3 #1311
  • geocode: Added --language option so users can specify the language of the geocoding results. This requires running the index-update subcommand with the --languages option to rebuild the index with the desired languages.
  • sqlp: add example of using columns with embedded spaces in SQL queries f7bf4f6

Changed

  • benchmarks: Benchmarks revamped #1298, #1310 d8eeb94
  • build(deps): bump serde_json from 1.0.106 to 1.0.107 by @dependabot in #1302
  • build(deps): bump mimalloc from 0.1.38 to 0.1.39 by @dependabot in #1303
  • build(deps): bump simple-home-dir from 0.1.4 to 0.2.0 by @dependabot in #1304
  • build(deps): bump chrono from 0.4.30 to 0.4.31 by @dependabot in #1305
  • (deps): bump Polars from 0.32.1 to Polars 0.33.2 #1308
  • build(deps): bump cpc from 1.9.2 to 1.9.3 by @dependabot in #1313
  • build(deps): bump rayon from 1.7.0 to 1.8.0 by @dependabot in #1315
  • (deps): update several indirect dependencies
  • pin Rust nightly to 2023-09-21

Full Changelog: 0.113.0...0.114.0