Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying partitioned tables #31

Open
Dietr1ch opened this issue Jan 11, 2025 · 0 comments
Open

Querying partitioned tables #31

Dietr1ch opened this issue Jan 11, 2025 · 0 comments

Comments

@Dietr1ch
Copy link

Dietr1ch commented Jan 11, 2025

I want to merge multiple tables with the same into a single one.

Say I have a table Sales(transaction_id, date, amount) and sharded files in my file system,

sales/
  - 2024/
    - 12/
      - 30.csv
      - 31.csv
  - 2025/
    - 01/
      - 01.csv
      - 02.csv
      - 03.csv

Is there a convenient way to treat sales/**/*.csv as a single table?

So far it seems that bdt query supports 2 flags for input tables,

  • --table path/to/single_file.csv
    • A single table is read with name single_file
  • --tables path/to/directory/
    • Multiple tables are read, each one with it's own basename
      • This imports N tables
      • I don't see much value here, why not using the shell to expand something like path/to/directory/*.csv?

I kind of want a new input file flag that expects a table name, and a set of (compatible) files,

bdt query \
  --partitioned_table sales sales/**/*.csv \  # Shell will expand these globs
  --sql "
    select
      count(*)
    from
      sales
  "

Which would use a flag with 1+N arguments, --partitioned_table sales sales/2024/12/30.csv sales/2024/12/31.csv sales/2025/01/01.csv sales/2025/01/02.csv sales/2025/01/03.csv, and make the table sales available.

Is there a way to get this today? I tried the --tables flag, but instead got N different tables that were hard to work with as a unit.

It's not hard to create a single file that concatenates all tables, but I'd nice not needing to create it as it'd allow writing queries from the shell, with a tiny rewrite --partitioned_table sales sales/2024/12/*.csv would get me info about sales in December 2024 without any made-up disk writes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant