Update PUDL Dependencies

Question

Update PUDL Dependencies

Closed this issue 7 months ago · 1 comments

This issue replaces #284, which is now out of date from when it was originally written back in February.

The OGE pipeline depends on data from PUDL, and also currently makes use of some of the code from the pudl.analysis module. However, since the last OGE update in 2022, several big changes are happening with PUDL that affect our dependency on the project:

Previously, we could only access PUDL data by downloading data that Catalyst posted to Zenodo. However, Catalyst is now publishing nightly builds of the data on AWS, which is based on the most up to date version of their dev branch. We will need to update download_data.py to allow an option to download the pudl.sqlite file from the nightly build. Ideally, we will want to use a stable, versioned file for our inputs so that it can be cited/referenced in the final data, or we will just need to archive the version of the database we download and use for our outputs.
The pudl_out is being deprecated, and instead all of these tables will be pre-compiled and saved in pudl.sqlite. Thus, instead of having to use pudl_out, we can now just directly read all of these tables from the database using pd.read_sql()
catalystcoop.pudl is no longer going to be published as a software package, which I think means that we will no longer be able to import and use the pudl.analysis functions in our pipeline. We'll have to see how much this will affect our pipeline, but for example in data_cleaning.clean_eia923(), we use pudl.analysis.allocate_net_gen.allocate_gen_fuel_by_generator_energy_source() to load the data, and then we do our own transformations on the data before running pudl.analysis.allocate_net_gen.agg_by_generator(). This may just mean that we need to copy this second function into the OGE codebase, or try to merge some of our data cleaning into the pudl codebase.
Many of the tables and column names from the tables that we use are being renamed, so we will need to go through the code base and update those.

One of the additional benefits of this change is that the output tables that we will read from the database will contain all historical years, and thus will have better backfilling of missing data. When we were running pudl_out before, we were only loading data using a single year, which means that many of the backfill functions for things like BA codes would not work as effectively.

I believe that all of the information about these updates from the PUDL side are documented in the following places:

To do:

Change data dependency on pudl.sqlite from zenodo version to aws nightly build
Change all references of pudl_out to pd.read_sql()
Update table names and column names to match new convention
Identify and address any dependencies on pudl.analysis code
Remove dependency on pudl software package in environment
Update docs

General steps:

Run pipeline for 2020 before any changes to make sure everything is still working
Update dependencies and get working with 2020 data
Try to run with 2022 data

Answer 1 · 2023-12-29T19:49:24.000Z

Addressed by #318