Should we output FERC-714 hourly demand to Parquet? #2460

zaneselvans · 2023-03-27T20:43:37Z

zaneselvans
Mar 27, 2023
Maintainer

We've finally integrated the cleaned-up FERC-714 (and EIA-861) tables into the PUDL DB! 🎉

However, the hourly demand table in FERC-714 is pretty big at about 15M records, and the majority of the "data" in there (in SQLite or a naive dataframe) is the timezone column, which only has distinct 6 values, and should be better understood as a categorical column.

The other big hourly timeseries that we produce is the hourly_emissions_epacems and we write it to Apache Parquet, but it's 800M records (~50x as large).

Should FERC-714 hourly demand be stored in the SQLite DB? Or in Parquet?

Parquet

Pros

Very fast to read/write.
Smaller on disk and in memory with rich data types / categoricals.

Cons

Hourly demand data wouldn't be available via Datasette
Would need to refactor some other code (or maybe just update PudlTabl to deal with the Parquet.
We would need to implement a FK check outside of the DB, or not check that the IDs in the hourly demand table are valid, since it wouldn't be in the DB anymore.

SQLite

Pros

FERC-714 demand data is in the DB alongside everything else, including the respondent_id_ferc714 table with which it has a FK relationship.

Cons

Adding hourly demand to the SQLite DB roughly doubles the size of pudl.sqlite
Writing to the PUDL DB takes 2-5 minutes in our ETL.
Reading the whole table out of SQLite takes 1 minute.
Making it work within the GitHub runner memory limits required doing some chunking & imposition of the categorical dtypes in the output layer (maybe to be moved into the SQLite IO Manager and generalized in Manage dtypes and memory usage in SQLiteIOManager #2431).
Doubling the size of the SQLite DB will affect the resources we need to dedicate to the Datasette deployment, maybe requiring the instances to get bumped up again to the next larger size.

asg017 · 2023-03-28T19:55:18Z

asg017
Mar 28, 2023

I'd say writing to Parquet makes the most sense!

If you want to query Parquet files from SQLite (and therefore from Datasette), there are a few strategies you can look at.

The most popular solution is sqlite-parquet-vtable, but you have to compile it yourself and it hasn't had updates in a while.

I have the sqlite-parquet extension in the works, which would allow you to mount Parquet files as SQLite virtual tables and query them like they're normal tables. But it's not complete yet!

There's also datasette-parquet, which allows you to mount DuckDB/Parquet files in a separate database, but it comes with a number of caveats and is still fairly new and beta.

Another distant SQLite extension I'm building and will open source is sqlite-duckdb, a SQLite -> DuckDB extension. Since DuckDB can read Parquet files by default, then you would be able to query Parquet files like normal tables through SQLite, with most of the performance benefits that DuckDB brings.

That's all to say: I'd say it makes sense to output FERC-714 hourly demand to Parquet for now, and in a few months when these projects mature, then you can bridge Datasette + Parquet fairly easily!

6 replies

asg017 Mar 28, 2023

Currently, they would need to be inside the container to work! Though it looks like datasette-parquet has a httpfs option to enable HTTP. I do have plans for adding HTTP/S3 support to sqlite-parquet, but that'll take some time. The future sqlite-duckdb extension will also have an option to turn on DuckDB's http/s3 support.

Though one thing to consider with these approaches: They would make the S3/HTTP requests to the parquet files at query time. Which means if you have 100 people looking at the same table at the same time and they all query the S3-backed virtual table, then you'll have 100 concurrent S3 requests (for the same data!) that you'll have to pay for. I don't think there's any "materialized view"-like feature in any of these options.

So if you are storing these Parquet files in S3/storage bucket somewhere, then I'd imagine you'd want to download + cache them inside the container when you start it up. That would slow down startups a bit (and would be cost a little bit in S3 fees), but the Parquet files would be fairly compact anyway and hopefully wouldn't take too long to download anyway.

zaneselvans Mar 28, 2023
Maintainer Author

Would the fees be from the data egress? Or from the queries themselves? If our Datasette is running on GCP, and the data is in a bucket in GCP, would there still be a cost since it's within the same cloud provider / region?

We're also publishing our nightly build outputs & versioned data releases to free, public buckets that are part of the AWS Open Data catalog, and those can be accessed without cost I think.

asg017 Mar 28, 2023

Egress fees would be a part of it! Every query to a S3 backed Parquet file (when this is implemented for sqlite-parquet, and what I assume what DuckDB does) involves a GET request, and nothing is cached by default.

If our Datasette is running on GCP, and the data is in a bucket in GCP, would there still be a cost since it's within the same cloud provider / region?

Not too familar with google cloud pricing, but from here, it sounds like there wouldn't be egress fees if you're requesting from the same region, but there would be "operations fees" for every GET request. So there would still some some cost to doing that.

We're also publishing our nightly build outputs & versioned data releases to free, public buckets that are part of the AWS Open Data catalog, and those can be accessed without cost I think.

That's an option! Though the GCP VM that's running the Datasette instance would have some network costs to handle (since it'll be downloading a lot of data). Also, if your Datasette instance gets a lot of traffic (ether organic or bots/bad actors DDOS'ing), too many queries might rack up those network costs, or can overwhelm your VM and cause crashes.

Again, not likely, but something to consider with that kind of setup!

gschivley Mar 29, 2023

Would creating both a separate sqlite db (rather than including in pudl.sqlite) and parquet files be another short-term option? I'm not sure how the Datasette process works and if you'd be able to make it all available through a separate db. You'd also lose the FK relationship in the short term.

zaneselvans Mar 29, 2023
Maintainer Author

Creating a separate SQLite DB feels like it's got the disadvantages of both of the other options: disconnected from the rest of the data, but also a bigger file with slow read/write access. The parquet file option would still be easily available from the AWS Open Data Registrty (either via s3:// or http://) so folks can download or read directly with pd.read_parquet("http://path.to.storage") etc. And it's small enough that folks don't have to use a tool like Dask to avoid running out of memory (with the timezone as a catetorical it's a ~500MB dataframe, as a string it's ~1.5GB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Should we output FERC-714 hourly demand to Parquet? #2460

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Should we output FERC-714 hourly demand to Parquet? #2460

zaneselvans Mar 27, 2023 Maintainer

Parquet

Pros

Cons

SQLite

Pros

Cons

Replies: 1 comment · 6 replies

asg017 Mar 28, 2023

asg017 Mar 28, 2023

zaneselvans Mar 28, 2023 Maintainer Author

asg017 Mar 28, 2023

gschivley Mar 29, 2023

zaneselvans Mar 29, 2023 Maintainer Author

zaneselvans
Mar 27, 2023
Maintainer

Replies: 1 comment 6 replies

asg017
Mar 28, 2023

zaneselvans Mar 28, 2023
Maintainer Author

zaneselvans Mar 29, 2023
Maintainer Author