Update PUDL Dependencies
Closed this issue · 1 comments
This issue replaces #284, which is now out of date from when it was originally written back in February.
The OGE pipeline depends on data from PUDL, and also currently makes use of some of the code from the pudl.analysis
module. However, since the last OGE update in 2022, several big changes are happening with PUDL that affect our dependency on the project:
- Previously, we could only access PUDL data by downloading data that Catalyst posted to Zenodo. However, Catalyst is now publishing nightly builds of the data on AWS, which is based on the most up to date version of their
dev
branch. We will need to updatedownload_data.py
to allow an option to download thepudl.sqlite
file from the nightly build. Ideally, we will want to use a stable, versioned file for our inputs so that it can be cited/referenced in the final data, or we will just need to archive the version of the database we download and use for our outputs. - The
pudl_out
is being deprecated, and instead all of these tables will be pre-compiled and saved inpudl.sqlite
. Thus, instead of having to use pudl_out, we can now just directly read all of these tables from the database usingpd.read_sql()
catalystcoop.pudl
is no longer going to be published as a software package, which I think means that we will no longer be able to import and use thepudl.analysis
functions in our pipeline. We'll have to see how much this will affect our pipeline, but for example indata_cleaning.clean_eia923()
, we usepudl.analysis.allocate_net_gen.allocate_gen_fuel_by_generator_energy_source()
to load the data, and then we do our own transformations on the data before runningpudl.analysis.allocate_net_gen.agg_by_generator()
. This may just mean that we need to copy this second function into the OGE codebase, or try to merge some of our data cleaning into the pudl codebase.- Many of the tables and column names from the tables that we use are being renamed, so we will need to go through the code base and update those.
One of the additional benefits of this change is that the output tables that we will read from the database will contain all historical years, and thus will have better backfilling of missing data. When we were running pudl_out before, we were only loading data using a single year, which means that many of the backfill functions for things like BA codes would not work as effectively.
I believe that all of the information about these updates from the PUDL side are documented in the following places:
- https://github.com/orgs/catalyst-cooperative/discussions/2503#discussioncomment-6682222
- https://catalystcoop-pudl.readthedocs.io/en/latest/release_notes.html#v2023-xx-xx
- https://registry.opendata.aws/catalyst-cooperative-pudl/
To do:
- Change data dependency on
pudl.sqlite
from zenodo version to aws nightly build - Change all references of
pudl_out
topd.read_sql()
- Update table names and column names to match new convention
- Identify and address any dependencies on
pudl.analysis
code - Remove dependency on pudl software package in environment
- Update docs
General steps:
- Run pipeline for 2020 before any changes to make sure everything is still working
- Update dependencies and get working with 2020 data
- Try to run with 2022 data