Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flights-3m on v2.10.0 exceeds configured limit #627

Open
dangotbanned opened this issue Nov 13, 2024 · 10 comments · Fixed by #628
Open

flights-3m on v2.10.0 exceeds configured limit #627

dangotbanned opened this issue Nov 13, 2024 · 10 comments · Fixed by #628
Labels

Comments

@dangotbanned
Copy link
Member

dangotbanned commented Nov 13, 2024

In vega/altair#3631 I'm working on a new wrapper to this repo in altair.

After updating v2.9.0 -> v2.10.0, I'm seeing an error for flights-3m only:

File size exceeded the configured limit of 50 MB.

Reading through jsdelivr-restrictions, it looks like the default was raised from 20MB -> 50MB.
However flights-3m has increased to 69.42MB.

Related

@domoritz
Copy link
Member

Can you ignore the file? It can't be much smaller if it contains all 3m flights.

@dangotbanned
Copy link
Member Author

dangotbanned commented Nov 13, 2024

Can you ignore the file? It can't be much smaller if it contains all 3m flights.

@domoritz yeah for now, that is essentially what I'm doing.

But unless I'm missing something, the dataset itself isn't accesible via the cdn.

https://cdn.jsdelivr.net/npm/[email protected]/data/flights-3m.csv

Image

Manually grabbing from github is working.
I imagine that we could get the file size down quite a bit by switching to another format.
Will have a look now

@domoritz
Copy link
Member

Yeah, we could switch to parquet.

@dangotbanned
Copy link
Member Author

dangotbanned commented Nov 13, 2024

Yeah, we could switch to parquet.

Strong agree @domoritz

Open the Chart in the Vega Editor

Image

@dsmedia
Copy link
Contributor

dsmedia commented Nov 14, 2024

@domoritz should I open a PR to save flights-3m as a parquet file (and update flights.py)?

@domoritz
Copy link
Member

domoritz commented Nov 14, 2024

that would be great, thanks!

when you create a parquet file, make sure it doesn't have an extra index column

@dangotbanned
Copy link
Member Author

Thanks @domoritz, @dsmedia for such a quick response!

This might be better served by a separate issue.
But more broadly adding .parquet variants of the datasets would have benefits beyond reducing the file size.

You'd also have:

If you decided to go down this route, altair#3631-...metadata.parquet may be helpful in selecting other candidates for a .parquet variant

@domoritz
Copy link
Member

I agree that parquet is generally better for the reasons you mentioned but csv and JSON are still common, can be read with a text editor, and require parsing which is instructional. Since these datasets are designed for examples, teaching, and demos I think it's useful for them to be what they are.

@dangotbanned
Copy link
Member Author

I agree that parquet is generally better for the reasons you mentioned but csv and JSON are still common, can be read with a text editor, and require parsing which is instructional. Since these datasets are designed for examples, teaching, and demos I think it's useful for them to be what they are.

Absolutely see the value in keeping the accessible formats.
I'd only be proposing adding alternative formats in-addition-to what is available already.

Are there any concerns around the total number of files, e.g. some limits from jsDelivr?

@domoritz
Copy link
Member

It would increase the size of the npm package. I don't see many people asking for parquet right now so I'd say let's not do it right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants