Deprecation of the `PudlTabl` output caching class #2503

zaneselvans · 2023-04-07T07:28:56Z

zaneselvans
Apr 7, 2023
Maintainer

Part of the motivation behind our move to Dagster is the proliferation of useful output tables that are derived from the public data we curate.

Some of these are simple denormalized tables that are more legible for users because they include names as well as IDs for plants and utilities, and also provide useful plant, boiler, or generator attributes alongside timeseries information. In other cases these tables facilitate joining datasets that share no common key, as with the FERC Form 1 to EIA record linkage. And then there are also analytical outputs, like our estimates of generator level heat rates, capacity factors, and fuel costs, or allocations of net generation and fuel consumption to each generator, or state-level electricity demand estimates or historical utility & balancing authority service territories.

The Problem

Up until now we have coordinated these calculations and cached the resulting dataframes using the PudlTabl class. However, this system has several disadvantages:

Some of these computations are slow, and regenerating the cached results every time you restart your notebook kernel or run a script is tedious and a drag on overall productivity.
Caching a bunch of dataframes with millions of rows in them in memory is not scalable. We've started hitting memory constraints in CI where we only have 7GB of memory to work with.
Managing the dependencies between these different dataframes in the calculations that PudlTabl coordinates has gotten complicated and outgrown our homebrew system. We want to keep adding new derived outputs and need a more robust system.
Many users don't even realize that this software layer exists! They only use the published data directly, which means they're missing out on a large portion of the value we think we're providing!
Even if they do realize this layer exists, they currently have to use Python to access it, and they have to install our behemoth of a python package (we have 500+ transitive dependencies in all!). This functionally precludes the incorporation of PUDL (the software) into other projects as a dependency, which means almost nobody can build upon these output tables in other projects.

The Solution

Rather than requiring everybody to install a bunch of software just so they can all run essentially the exact same calculations over and over again, we are going to pre-compute these outputs and write them into the PUDL DB

Dagster can easily manage the dependencies between all of our tables -- both for the initial extraction, cleaning, and normalization of the data, and for the latter analyses and transformations into ready-to-use outputs.
We'll be able to use the same abstractions and tools across all phases of data processing, which will be much more maintainable, and easier for someone new to understand. It also means we don't have to maintain all that orchestration code, since Dagster is doing it for us.
Doing all these computations up front and storing the results in the database will reduce memory use, and save individual users from needing to do the calculations, reducing startup time for new analysis.
Users will be able to download the finished database (or in some cases, Parquet files) and access all of the output tables using whatever tools they want -- Python, R, SQL directly -- without needing to install any specialized software, allowing much easier integration of the data into other projects downstream.
We already have nightly builds running that produce new outputs most days. If we're just distributing data, then those build outputs become data releases, and we can do far more frequent releases.
If we finish Automate Zenodo archiving pudl-archiver#61 then we'll have both frequent updates to our inputs, and the ability to publish frequently updated outputs, which will result in much fresher data for all users.

The Problems With The Solution

Pre-computing all of these tables necessarily means some loss of flexibility for the user. We'll have to choose particular parameters to use, and if a user really doesn't like our choices, then they'll still have to do a computation, and rely on the huge software package.
There are applications in the wild that currently use the PudlTabl class. Our deprecation or modification of the class will break these applications, and they'll have to switch to reading data directly from the database.
Reading data out of SQLite is slower than reading out of memory, so working with large dataframes that have been "cached" in the database won't be as responsive as if you ran the computation and worked with the data cached in memory.

Two Options for Now

There are a couple of ways we can go about replacing the underlying output functions as we wrap them in Dagster assets and write them into the database:

Preserve the `PudlTabl` interface for now (kind of)

Attempt to preserve the ability to use the PudlTabl class to access the data, but instead of having it do any calculations it can just read the data out of the database (and apply pandas data types as appropriate).
Note: This option will still change behavior since we'll be freezing some of the previously variable parameters. You'll still be able to choose between raw, monthly, and annually aggregated data if we pre-compute all three of them. The queries that the class makes can accommodate start_date and end_date pretty easily and just return the range of data requested. But you won't be able to turn the various data repair operations on and off -- where we backfill generator technology_description or plant balancing_authority_code_eia
The original cleaned (and incomplete) data can still be in the database in the normalized tables, but will be much less friendly to work with than the "standard" denormalized output tables.
The goal would be to have one cycle of deprecations where you can still use the PudlTabl class to read from the PUDL DB indirectly, but it would go away in a subsequent release as we move to focusing more exclusively on distributing data and not the data processing pipeline and its environment.
Maintaining this interface will be more work and take longer than the option below.

Transition directly to using the database only

Rather than having an intermediary phase where PudlTabl still exists, but behaves a bit differently, which could still be somewhat disruptive, we could just deprecate it now, and move to data only distribution, and the expectation that everyone will read the data out of the database directly.
This would be less work for us, and could be done sooner, leaving us more resources to do new data integration.
It might be more disruptive in the near term, but it would only be disruptive one time, whereas the option above might be disruptive two times, depending on how much behavior change there is in the PudlTabl class by virtue of reading directly from the DB.
One disadvantage of this option is that users wouldn't be able to easily apply the specific data types that we're using internally to the data as they read it out of SQLite. So BOOLEAN columns will show up as 0 or 1. Some string columns that should be categoricals with just a few distinct values (like timezones) will get parsed as strings (which can take up a lot of memory), etc.

Two Options for Later

SQLite is great because it's self-contained, requires no setup, is pretty universally accessible, and retains a lot of the relational structure of our data. But it's slow for analytic workloads on millions of rows, and it has a pretty restricted universe of data types, which means there's some loss of fidelity if the database is the only information being distributed. DuckDB and Apache Parquet files are two possible future options for distributing our data that address both of those issues:

DuckDB is another file-based database, but very specifically designed for analytical use. It's much faster for the kind of work we do, has a much richer set of data types, and lots of other useful analytical features. However, it's very much under development right now, and the native DuckDB file format is not yet stable. E.g. DuckDB v0.7 can't necessarily read a database that was written by DuckDB v0.6. Like SQLite, all of the data is stored in a single file.
Apache Parquet files are also extremely fast to work with in analytical use cases, and use Apache Arrow internally to layout the data. Arrow provides rich data types. Pandas 2.0 has just introduced native support for Apache Arrow backed dataframes, which would allow the dataframes and the Parquet files to share exactly the same data types and representations, which would be great. However it's still experimental and not everything in Pandas supports Arrow yet. One downside of using Parquet files is that there'd be a file for every table, and there are hundreds of tables. But maybe that's not a problem if there's a relatively small number of wide denormalized data warehouse style tables that are actually what most people use for analysis. An upside of using Parquet is it's designed to be queried remotely from cloud object stores. Also the DuckDB query engine can be pointed at Parquet files just fine.

zaneselvans · 2023-04-07T07:30:49Z

zaneselvans
Apr 7, 2023
Maintainer Author

@arengel @grgmiller @jrea-rmi @gschivley Do you have feelings about which of the two near-term options we go with above?

0 replies

grgmiller · 2023-04-07T16:01:36Z

grgmiller
Apr 7, 2023
Collaborator

I think I'd prefer transitioning directly to the database since it means that there is only one time that we have to update our code. Also, pandas 2.0.0 has introduced a dtype argument for pd.read_sql() so I think that may mitigate the issue you mentioned with reading boolean and categorical columns.

For OGE, we're currently using an older forked version of pudl anyway, so this would not have an immediate impact on us until we get around to updating our dependency on pudl (which it sounds like we should wait to do until the dust settles on this change anyway).

0 replies

arengel · 2023-04-07T16:32:27Z

arengel
Apr 7, 2023
Collaborator

I'd also prefer transitioning directly to the database. Avoiding multiple disruptions and extra work as well as removing the PUDL dependency are probably the main selling points for me.

I think the main thing we would want in order to have this transition be as painless as possible is to get the names / naming convention for what PudlTabl settings and methods correspond to which tables in the DB.

It would also be helpful to know what the dtypes of columns would have been had they come directly from PudlTabl. Perhaps this could be an additional table in the DB that we could read first and then use to recast the types as (or more easily after) we read a table out of the DB.

1 reply

zaneselvans Apr 7, 2023
Maintainer Author

Documenting parameter choices

We should definitely document the arguments that are going into the generation of these output tables. Probably the table description is the right place for that. This information would show up in the data dictionaries that are generated.

Naming Conventions

We're still not sure what set of naming conventions to use, but the information it seems like we want to convey in the table (and/or Dagster asset) names includes:

What stage of processing is it at? (raw, clean, normalized, denormalized)
What is the original data source? (eia923, eia860, ferc1, epacems, a generic mix of eiasources,ferc714, epacamd_eia`
The descriptive name of the table (fuel_receipts_costs, plants_steam, hourly_emissions etc.)

This leads to asset/table names like raw_generators_eia860, clean_sales_eia861, norm_generation_fuel_eia923, or denorm_plants_steam_ferc1. It seems like giving the tables that we expect folks to access and use day-to-day the privilege of getting an unmodified name would be convenient / more legible. I think that probably means that the denormalized tables -- which are the most like the relatively wide tables you'd see in a data warehouse, with useful attributes merged in -- are the ones that end up getting the generic name.

Note that not all of these tables would actually be written into the DB. The earlier stages are mostly just transient states that exist during the ETL. My guess is we'll end up including the normalized tables, the denormalized tables (some of which may be constructed using stored SQL queries aka views), and any tables that we haven't normalized yet but are available preliminarily, a clean_ version -- right now this would include most of the EIA-861 tables.

So for example in the ETL / Dagster asset definitions we might have

raw_fuel_receipts_costs_eia923 (concatenated data from all years of spreadsheets -- not in the DB)
clean_fuel_receipts_costs_eia923 (reshaped to tidy format, uniform NA values and data types, cleaned up coded values, etc. -- not in the DB)
norm_fuel_receipts_costs_eia923 (deduplicated data, with e.g. plant attributes removed and only plant IDs left to allow retrieval of those attributes from the plants_entity_eia / plants_annual_eia tables -- in the DB).
fuel_receipts_costs_eia923 (the denormalized table with useful and self-consistent attributes merged in from the plants, utilities, fuel types, and other tables, as well as filled in estimates of missing values like the redacted fuel prices -- in the DB, and presumed to be the primary table that folks would access day-to-day)

Metadata distribution

One option we've thought about for providing richer typing information (as well as other documentary metadata) alongside the database in this data-only distribution world is to output a datapackage.json (as we're already doing for the FERC DBs we extract from XBRL) and distribute it alongside the DB itself. This way you would have table and column descriptions, as well as the data types that should be applied to each column in each table, which can be used to populate the new dtype parameter in pd.read_sql() appropriately -- or whatever the equivalent function call is in R, etc. But we need to figure out if this functionality is actually built in already or not and how to wield it effectively. If nothing else we could distribute data type dictionaries that are automatically generated and can be used with the new dtype argument. I'm excited to see where the Arrow and DuckDB projects go in the next year.

bendnorman · 2023-04-07T20:48:25Z

bendnorman
Apr 7, 2023
Maintainer

Thanks for writing up the discussion @zaneselvans and the helpful feedback, everyone!

Dtypes

I'm a little concerned about going straight to just distributing the database without a solution to correcting dtypes. Ideally, we could distribute the database as a duckdb but it's not quite ready yet. I think distributing datapackage.json is a good option but it might feel kludgy for users. However, if the main dtype issues are booleans getting cast to 1s and 0s and categoricals getting cast strings it might not be a huge problem for users.

Naming conventions

The nice thing about semi-supporting PudlTabl is that we can punt decisions about table naming conventions 😄 If we move immediately to just distributing the database, we should iron out the naming conventions so users can expect stable table names. Maybe next sprint we can allocate some time to research and chat with @turbo3136 about best practices here.

Workflow

If we decide to go straight to distributing data, how should we structure the transition? Create a big branch off of dev that converts all of the output tables and rips out PudlTabl? Or should we continue to convert output tables and slowly remove PudlTabl methods?

2 replies

zaneselvans Apr 7, 2023
Maintainer Author

I think the dtype problem is more than just bools and categoricals. Issues I think folks will run into reading data out of SQLite into Pandas naively include:

categoricals will be turned into strings (objects by default) and use way more memory.
BOOLEAN columns will become int 0 or 1, or float if there are NULL values (which there are!)
DATE and DATETIME columns will become strings (objects) again using way more memory and requiring pd.to_datetime() or specific pandas instructions to parse as dates.
INTEGER columns with NULL values will become float including cases where ID columns have NULL values, which is very bad for doing joins based on those IDs.
If one uses read-chunking without an explicit dtype, does pandas infer the dtype from the database? Or from the contents? If it's from the content, and you get a chunk that happens to only have numerical IDs, in a column that actually contains alphanumeric IDs, will we end up with mixed types (hopefully this is just a bad memory of reading Excel data, and pandas is smart enough to get the dtype from SQLAlchemy)

But these are also common problems that you run into reading data into pandas, so maybe it's not so bad.

But in the long run, getting rich data types that provide high fidelity in the DB and analytical tools, and a much faster columnar storage medium seems like it'll be very good.

turbo3136 Apr 21, 2023

Hey @bendnorman! Sorry for the suuuuuper delayed response here, but yes I'm more than happy to help out however I can. Just let me know!

bendnorman · 2023-05-26T03:26:38Z

bendnorman
May 26, 2023
Maintainer

Questions and context for non-Catalyst data engineering folks

Our ETL extracts data from spreadsheets and databases cleaned the data using pandas then loaded it into a sqlite database. For the most part we have two tables for each entity (boiler, generator, plant, utility…). One table contains attributes that change monthly or yearly (net generation…) and another table with static information (utility name, location, fuel type). We refer to these as our “normalized” tables.

We have a python class called PudlTabl that reads the normalized tables back into pandas to further denormalize the data. For example, we create a table that joins plants' annual, static, and utility information. PudlTabl also contains some methods that perform imputation, aggregation, and record linkage. Historically, we’ve referred to the data created in PudlTabl as output and analysis tables.

To interact with the output and analysis tables, users need to install the pudl package to access the PudlTabl class. With dagster, we are now converting the pandas logic in PudlTabl to dagster assets so they can be easily written to the database and distributed.

With all of these new output and analysis tables in the database we need to establish a naming convention and database organization structure. We figured there are some best practices out there we should adopt.

PUDL-specific Questions

I think our “normalized” data roughly follows a star/snowflake schema. Our annual/monthly varying tables resemble fact tables, and our static tables resemble dimension tables. I’m tempted to adopt this popular model and naming convention, but I’m not 100% sure it is appropriate.

For example, many of the fields in our annual/monthly varying tables are actually slowly changing dimensions (change once or twice in the 20 years of data we have). Would you advise against adopting the naming conventions associated with dimensional modeling if we don’t follow 100% of dimensional modeling recommendations?
Based on my understanding of dimensional modeling, using surrogate keys to accommodate slowly changing dimensions is common. Have you seen data warehouses that rely on natural keys instead of surrogate keys?
Dimensional modeling was born out of industry to track things like sales. Is annually reported power plant information an appropriate use case for dimensional modeling?

By the end of converting and writing all of the data created by logic in PudlTabl, we’ll have dozens of new tables in the database. We want to design a handful of tables that can serve most of our users’ needs. For example, we are considering creating a table where each row contains information about a generator for a given year. The table will have dozens of columns that describe each generator: static information about the plant, utility, and generator, and annually varying information like net generation. Is it appropriate to include all information for a given entity, or should it be separated by attribute theme?

General data warehouse questions

How do you typically organize your data warehouses?
Do you use a common naming convention?

0 replies

bendnorman · 2023-05-26T03:32:30Z

bendnorman
May 26, 2023
Maintainer

I've found some helpful data warehouse naming convention guides:

0 replies

gschivley · 2023-05-26T13:15:00Z

gschivley
May 26, 2023

@zaneselvans I didn't have time to reply last month but agree with moving away from the PudlTabl class. My only use case for the analysis data is calculating unit heat rates, which I'm considering shifting to an internal function anyway. I'd like to use the data distribution system but it isn't too difficult to recreate internally in the near-term.

Do you plant to continue making all data available via the data portal? I'm debating a switch from having users download the full DB via zenodo to just querying what is needed from the data portal.

3 replies

zaneselvans May 26, 2023
Maintainer Author

We definitely plan to keep publishing data to Datasette, but for bulk downloads the best option is probably going to be the AWS open data registry. We'll push nightly build updates there, and also provide long-term access to any versioned data release (which would also be archived on Zenodo).

zaneselvans May 26, 2023
Maintainer Author

There are also HTTPS direct download links for folks that would prefer to avoid the AWS CLI tool. They're linked from the README.

gschivley May 27, 2023

Good to know. Since I only use a few of the tables and usually only the most recent year or two of data the datasette might be a good option. One less thing for users to download and set up. I already ditched the EIA API in favor of the bulk download 🙄

bendnorman · 2023-08-09T16:01:22Z

bendnorman
Aug 9, 2023
Maintainer

Thank you for all of the input! After a few months of research and design, we've decided on a new naming convention for our data tables and assets. This google sheet catalogs how table and asset names are expected to change.

Background

With the adoption of dagster there is a huge influx in the number of data assets PUDL produces / processes. The new assets are coming from two parts of our system:

Intermediate data processed in the extract and transform steps in the pre-dagster codebase are now “raw_” and “clean_” assets.
Data created by PudlTabl methods are now dagster assets that are persisted to the database. These assets mostly use the “denorm_” prefix though some don’t have any prefix.

Now that there are hundreds of assets in PUDL, we need to define a standard naming convention to improve consistency, organization interpretability. The naming conventions should:

Reflect PUDL's existing data organization: we won't have to change any processing steps/topology.
Based on an asset name, one can understand: how polished/raw the data is, what the dataset's main subject is, the original data source, and the structure of the data.
Based on the documentation, one can understand: what to consider when creating new assets

How are PUDL assets currently organized?

Currently, PUDL has roughly four layers of data assets:

The raw layer contains the raw extracted data from databases and files as data frames with some light data cleaning, such as column renames and type corrections. These assets are not written to the pudl.sqlite database and are cached as pickle files during processing.
The clean layer contains intermediate assets that are logical steps toward the well-defined normalized data assets.
The normalized layer (norm) contains well-defined, normalized tables with proper primary, foreign, and data types. These assets adhere to Tiny Data modeling principles.
The denormalized/output layer contains assets that combine normalized assets, impute data and perform record linkage.

What are some issues with the current conventions?

We aren’t consistent about including the layer as a prefix. For example, all assets in the raw and clean layers include the layer as a prefix but assets in the norm and denorm layer only sometimes include a prefix.
It’s not immediately clear which tables users should interact with.
It’s not always clear how tables differ from one another. For example, it’s difficult to understand the differences between plants_entity_eia vs plants_eia vs plants_pudl vs plants_eia860 without reading the table descriptions in the data dictionary.
It’s difficult to understand the relationships between different types of tables in the normalized layer.

New naming convention proposal

PUDL assets will be organized into three layers:

Raw: This layer contains the source data assets. Currently, these assets in PUDL extract data from spreadsheets and databases and are persisted as pickle files.
Core: This layer contains well-modeled assets that serve as building blocks for downstream wide tables and analyses. Assets are considered well-modeled if they have logical primary keys, foreign keys, and datatypes. These assets are typically stored in parquet files or tables in a database. This layer can contain intermediate tables that bridge the raw and well-modeled core assets (asset that currently have the “_clean” prefix).
Output: This layer contains extremely wide and complete tables suitable for users to perform analysis on. This layer can contain intermediate tables that bridge the core and user-facing tables.

Asset/table names will use the following naming convention: {layer}_{source}__{asset_type}_{asset_name}.

layer is one of the three layers the asset is a part of (raw_, core_, out_).
source is the abbreviated name of the original data source. source is optional in the output layer as some assets might contain data from many sources (EIA, EPA, FERC).
asset_type describes how the data is modeled. There are five asset types in the core layer: association, codes, entities, slowly changing dimensions, and variable. Variable assets use the reported frequency as the asset type. For example, the current generation_eia923 asset will be renamed to core_eia923__monthly_generation. asset_type is optional for assets that don't have well-defined modeling.
asset_name describes the entity, categorical code type, or measurement. These names are mostly inherited from the raw data.

Underscores!

The proposed naming convention uses a double underscore to separate the metadata like source and layer from the more specific table descriptors, asset type and asset name.
Intermediate assets are logical steps towards a final well-modeled core asset or user-facing output asset. These assets are not intended to be persisted in the database or accessible to the user. These assets are denoted by a preceding underscore, like a private python method. For example, the current clean_plants_eia860 asset would be renamed to _core_eia860__plants. Notice the name does not include an asset_type descriptor because the modeling is not well-defined.

Users will be able to access core_ and output_ tables in the database though the documentation will encourage users to work with the wide and complete tables in the output layer. These tables will have the output_ prefix.

If you're interested to learn more about the naming convention and the design process, you can read and comment on the full design doc. Over the next couple of weeks, the naming convention will be applied to assets and tables on the dev branch.

0 replies

grgmiller · 2023-11-21T19:54:23Z

grgmiller
Nov 21, 2023
Collaborator

Hi Catalyst team, I've heard about some of the big upcoming changes including the deprecation of pudl_out, and no longer publishing a catalystcoop.pudl software package. I was wondering if this thread and https://catalystcoop-pudl.readthedocs.io/en/latest/release_notes.html#v2023-xx-xx have the most recent updates on how everything is going to be working with this new update? We're getting ready to update OGE when we return from the Thanksgiving holiday, which will require updating all of our pudl dependencies, so I just want to make sure that we are looking at the most recent information.

For now, would the safest bet be to start downloading pudl.sqlite from the nightly builds and directly reading the tables we need from the database?

1 reply

zaneselvans Nov 21, 2023
Maintainer Author

Hi @grgmiller,

We don't typically add release notes until the branch is merged into dev so you won't see advance notice of the changes there, but those notes should reflect the current state of what's in the nightly builds.

@bendnorman is very close to merging in the big rename, at which point we are going to make a versioned data release that includes all the 2022 data.

If you want to get used to reading data out of the database directly and working with the current form of the data as it exists in the DB, then downloading the nightly build outputs is a great place to start. The big rename PR should not be changing the contents of the tables, just table (and some column?) names. So after we do the rename, you'll need to adjust some names, but should not need to adjust how you're dealing with the contents.

gschivley · 2024-05-13T18:02:49Z

gschivley
May 13, 2024

It looks like native storage is stable and backwards compatible as of DuckDB 10.0. I know there was already a move towards making parquet files available, but maybe it's worth also distributing the database as pudl.duckdb? I converted the 14.5 GB sqlite database into a 2.9 GB duckdb database.

3 replies

zaneselvans May 13, 2024
Maintainer Author

Ah! I saw that it was able to read older files as of v0.10.0 but I didn't realize they were making a longer term commitment to the backward compatibility. There's a built-in DuckDB IO Manager in Dagster and I've wanted to play around with it. We already have it outputting both Parquet and SQLite all the time in development. DuckDB is so fast I wonder if we could add it as an output format (or at least an option) for testing without too much work and without slowing down the ETL.

gschivley May 14, 2024

Going from pudl.sqlite to duckdb takes 2 min on my machine. But that includes reading the entire thing from disk. Should be faster if data are written to duckdb from memory at the same time as other formats.

zaneselvans May 14, 2024
Maintainer Author

I think the dtypes will be much more expressive / correct if we write with an explicit DuckDB schema instead of doing a naive conversion of the SQLite dtypes, and I suspect that it'll do dictionary encoding of the categorical columns which will also make the file size smaller.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Catalyst Cooperative

Deprecation of the `PudlTabl` output caching class #2503

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Catalyst Cooperative

Deprecation of the PudlTabl output caching class #2503

zaneselvans Apr 7, 2023 Maintainer

The Problem

The Solution

The Problems With The Solution

Two Options for Now

Preserve the PudlTabl interface for now (kind of)

Transition directly to using the database only

Two Options for Later

Replies: 10 comments · 10 replies

zaneselvans Apr 7, 2023 Maintainer Author

grgmiller Apr 7, 2023 Collaborator

arengel Apr 7, 2023 Collaborator

zaneselvans Apr 7, 2023 Maintainer Author

Documenting parameter choices

Naming Conventions

Metadata distribution

bendnorman Apr 7, 2023 Maintainer

Dtypes

Naming conventions

Workflow

zaneselvans Apr 7, 2023 Maintainer Author

turbo3136 Apr 21, 2023

bendnorman May 26, 2023 Maintainer

Questions and context for non-Catalyst data engineering folks

PUDL-specific Questions

General data warehouse questions

bendnorman May 26, 2023 Maintainer

gschivley May 26, 2023

zaneselvans May 26, 2023 Maintainer Author

zaneselvans May 26, 2023 Maintainer Author

gschivley May 27, 2023

bendnorman Aug 9, 2023 Maintainer

Background

How are PUDL assets currently organized?

What are some issues with the current conventions?

New naming convention proposal

grgmiller Nov 21, 2023 Collaborator

zaneselvans Nov 21, 2023 Maintainer Author

gschivley May 13, 2024

zaneselvans May 13, 2024 Maintainer Author

gschivley May 14, 2024

zaneselvans May 14, 2024 Maintainer Author

Deprecation of the `PudlTabl` output caching class #2503

zaneselvans
Apr 7, 2023
Maintainer

Preserve the `PudlTabl` interface for now (kind of)

Replies: 10 comments 10 replies

zaneselvans
Apr 7, 2023
Maintainer Author

grgmiller
Apr 7, 2023
Collaborator

arengel
Apr 7, 2023
Collaborator

zaneselvans Apr 7, 2023
Maintainer Author

bendnorman
Apr 7, 2023
Maintainer

zaneselvans Apr 7, 2023
Maintainer Author

bendnorman
May 26, 2023
Maintainer

bendnorman
May 26, 2023
Maintainer

gschivley
May 26, 2023

zaneselvans May 26, 2023
Maintainer Author

zaneselvans May 26, 2023
Maintainer Author

bendnorman
Aug 9, 2023
Maintainer

grgmiller
Nov 21, 2023
Collaborator

zaneselvans Nov 21, 2023
Maintainer Author

gschivley
May 13, 2024

zaneselvans May 13, 2024
Maintainer Author

zaneselvans May 14, 2024
Maintainer Author