Deprecation of the PudlTabl
output caching class
#2503
Replies: 10 comments 10 replies
-
@arengel @grgmiller @jrea-rmi @gschivley Do you have feelings about which of the two near-term options we go with above? |
Beta Was this translation helpful? Give feedback.
-
I think I'd prefer transitioning directly to the database since it means that there is only one time that we have to update our code. Also, pandas 2.0.0 has introduced a For OGE, we're currently using an older forked version of pudl anyway, so this would not have an immediate impact on us until we get around to updating our dependency on pudl (which it sounds like we should wait to do until the dust settles on this change anyway). |
Beta Was this translation helpful? Give feedback.
-
I'd also prefer transitioning directly to the database. Avoiding multiple disruptions and extra work as well as removing the PUDL dependency are probably the main selling points for me. I think the main thing we would want in order to have this transition be as painless as possible is to get the names / naming convention for what It would also be helpful to know what the dtypes of columns would have been had they come directly from |
Beta Was this translation helpful? Give feedback.
-
Thanks for writing up the discussion @zaneselvans and the helpful feedback, everyone! DtypesI'm a little concerned about going straight to just distributing the database without a solution to correcting dtypes. Ideally, we could distribute the database as a duckdb but it's not quite ready yet. I think distributing Naming conventionsThe nice thing about semi-supporting PudlTabl is that we can punt decisions about table naming conventions 😄 If we move immediately to just distributing the database, we should iron out the naming conventions so users can expect stable table names. Maybe next sprint we can allocate some time to research and chat with @turbo3136 about best practices here. WorkflowIf we decide to go straight to distributing data, how should we structure the transition? Create a big branch off of dev that converts all of the output tables and rips out PudlTabl? Or should we continue to convert output tables and slowly remove |
Beta Was this translation helpful? Give feedback.
-
Questions and context for non-Catalyst data engineering folksOur ETL extracts data from spreadsheets and databases cleaned the data using pandas then loaded it into a sqlite database. For the most part we have two tables for each entity (boiler, generator, plant, utility…). One table contains attributes that change monthly or yearly (net generation…) and another table with static information (utility name, location, fuel type). We refer to these as our “normalized” tables. We have a python class called PudlTabl that reads the normalized tables back into pandas to further denormalize the data. For example, we create a table that joins plants' annual, static, and utility information. PudlTabl also contains some methods that perform imputation, aggregation, and record linkage. Historically, we’ve referred to the data created in PudlTabl as output and analysis tables. To interact with the output and analysis tables, users need to install the pudl package to access the PudlTabl class. With dagster, we are now converting the pandas logic in PudlTabl to dagster assets so they can be easily written to the database and distributed. With all of these new output and analysis tables in the database we need to establish a naming convention and database organization structure. We figured there are some best practices out there we should adopt. PUDL-specific QuestionsI think our “normalized” data roughly follows a star/snowflake schema. Our annual/monthly varying tables resemble fact tables, and our static tables resemble dimension tables. I’m tempted to adopt this popular model and naming convention, but I’m not 100% sure it is appropriate.
By the end of converting and writing all of the data created by logic in PudlTabl, we’ll have dozens of new tables in the database. We want to design a handful of tables that can serve most of our users’ needs. For example, we are considering creating a table where each row contains information about a generator for a given year. The table will have dozens of columns that describe each generator: static information about the plant, utility, and generator, and annually varying information like net generation. Is it appropriate to include all information for a given entity, or should it be separated by attribute theme? General data warehouse questions
|
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@zaneselvans I didn't have time to reply last month but agree with moving away from the Do you plant to continue making all data available via the data portal? I'm debating a switch from having users download the full DB via zenodo to just querying what is needed from the data portal. |
Beta Was this translation helpful? Give feedback.
-
Thank you for all of the input! After a few months of research and design, we've decided on a new naming convention for our data tables and assets. This google sheet catalogs how table and asset names are expected to change. BackgroundWith the adoption of dagster there is a huge influx in the number of data assets PUDL produces / processes. The new assets are coming from two parts of our system:
Now that there are hundreds of assets in PUDL, we need to define a standard naming convention to improve consistency, organization interpretability. The naming conventions should:
How are PUDL assets currently organized?Currently, PUDL has roughly four layers of data assets:
What are some issues with the current conventions?
New naming convention proposalPUDL assets will be organized into three layers:
Asset/table names will use the following naming convention:
Underscores!
Users will be able to access If you're interested to learn more about the naming convention and the design process, you can read and comment on the full design doc. Over the next couple of weeks, the naming convention will be applied to assets and tables on the |
Beta Was this translation helpful? Give feedback.
-
Hi Catalyst team, I've heard about some of the big upcoming changes including the deprecation of For now, would the safest bet be to start downloading |
Beta Was this translation helpful? Give feedback.
-
It looks like native storage is stable and backwards compatible as of DuckDB 10.0. I know there was already a move towards making parquet files available, but maybe it's worth also distributing the database as pudl.duckdb? I converted the 14.5 GB sqlite database into a 2.9 GB duckdb database. |
Beta Was this translation helpful? Give feedback.
-
Part of the motivation behind our move to Dagster is the proliferation of useful output tables that are derived from the public data we curate.
Some of these are simple denormalized tables that are more legible for users because they include names as well as IDs for plants and utilities, and also provide useful plant, boiler, or generator attributes alongside timeseries information. In other cases these tables facilitate joining datasets that share no common key, as with the FERC Form 1 to EIA record linkage. And then there are also analytical outputs, like our estimates of generator level heat rates, capacity factors, and fuel costs, or allocations of net generation and fuel consumption to each generator, or state-level electricity demand estimates or historical utility & balancing authority service territories.
The Problem
Up until now we have coordinated these calculations and cached the resulting dataframes using the
PudlTabl
class. However, this system has several disadvantages:PudlTabl
coordinates has gotten complicated and outgrown our homebrew system. We want to keep adding new derived outputs and need a more robust system.The Solution
Rather than requiring everybody to install a bunch of software just so they can all run essentially the exact same calculations over and over again, we are going to pre-compute these outputs and write them into the PUDL DB
The Problems With The Solution
PudlTabl
class. Our deprecation or modification of the class will break these applications, and they'll have to switch to reading data directly from the database.Two Options for Now
There are a couple of ways we can go about replacing the underlying output functions as we wrap them in Dagster assets and write them into the database:
Preserve the
PudlTabl
interface for now (kind of)PudlTabl
class to access the data, but instead of having it do any calculations it can just read the data out of the database (and apply pandas data types as appropriate).start_date
andend_date
pretty easily and just return the range of data requested. But you won't be able to turn the various data repair operations on and off -- where we backfill generatortechnology_description
or plantbalancing_authority_code_eia
PudlTabl
class to read from the PUDL DB indirectly, but it would go away in a subsequent release as we move to focusing more exclusively on distributing data and not the data processing pipeline and its environment.Transition directly to using the database only
PudlTabl
still exists, but behaves a bit differently, which could still be somewhat disruptive, we could just deprecate it now, and move to data only distribution, and the expectation that everyone will read the data out of the database directly.PudlTabl
class by virtue of reading directly from the DB.BOOLEAN
columns will show up as 0 or 1. Some string columns that should be categoricals with just a few distinct values (like timezones) will get parsed as strings (which can take up a lot of memory), etc.Two Options for Later
SQLite is great because it's self-contained, requires no setup, is pretty universally accessible, and retains a lot of the relational structure of our data. But it's slow for analytic workloads on millions of rows, and it has a pretty restricted universe of data types, which means there's some loss of fidelity if the database is the only information being distributed. DuckDB and Apache Parquet files are two possible future options for distributing our data that address both of those issues:
Beta Was this translation helpful? Give feedback.
All reactions