Should we output FERC-714 hourly demand to Parquet? #2460
Replies: 1 comment 6 replies
-
I'd say writing to Parquet makes the most sense! If you want to query Parquet files from SQLite (and therefore from Datasette), there are a few strategies you can look at. The most popular solution is sqlite-parquet-vtable, but you have to compile it yourself and it hasn't had updates in a while. I have the There's also datasette-parquet, which allows you to mount DuckDB/Parquet files in a separate database, but it comes with a number of caveats and is still fairly new and beta. Another distant SQLite extension I'm building and will open source is That's all to say: I'd say it makes sense to output FERC-714 hourly demand to Parquet for now, and in a few months when these projects mature, then you can bridge Datasette + Parquet fairly easily! |
Beta Was this translation helpful? Give feedback.
-
We've finally integrated the cleaned-up FERC-714 (and EIA-861) tables into the PUDL DB! 🎉
However, the hourly demand table in FERC-714 is pretty big at about 15M records, and the majority of the "data" in there (in SQLite or a naive dataframe) is the
timezone
column, which only has distinct 6 values, and should be better understood as a categorical column.The other big hourly timeseries that we produce is the
hourly_emissions_epacems
and we write it to Apache Parquet, but it's 800M records (~50x as large).Should FERC-714 hourly demand be stored in the SQLite DB? Or in Parquet?
Parquet
Pros
Cons
PudlTabl
to deal with the Parquet.SQLite
Pros
respondent_id_ferc714
table with which it has a FK relationship.Cons
pudl.sqlite
Beta Was this translation helpful? Give feedback.
All reactions