singularity-energy/open-grid-emissions

Ensure complete timeseries data

Opened this issue · 0 comments

We want to ensure that there are always 12 values for all monthly data, and 8760 (or 8784 in leap years) values for all hourly data. Our validation checks are currently identifying instances of less than this number of values, as well as instances where there are more than 8760 hourly values.

Here are some examples of various situations that are being flagged and some ideas on how to fix them:

Timeseries has 8758-8759 out of 8760 values.

  • Example: plant 540 in 2022
  • In ISONE, multiple petroleum plants are missing data on 4/1 at 4am UTC. These plants switch from using the 930 hourly profile in March to using CEMS data in April. Apparently, all CEMS data is reported in standard time, so for plants that start reporting mid year (eg April 1 at midnight), that data actually represents April 1 at 1am local prevailing time, so the first timestamp we have available is 5am UTC.
  • Potential fixes: 1) treat this value as missing, but ensure that we have a timestamp in the timeseries for this timestamp. This missing timestamp is not currently being picked up by our validation check since we are only checking for complete timeseries between the min and max timestamp in each month, but because the first timestamp is missing, we are not seeing it. (we should fix this validation check to use a complete date range rather than the min/max dates) 2) try and interpolate this single hour value.

Timeseries has much less than 8760 values

  • Example: plant 10549 in 2022
  • This is likely due to only having partial months of reported data in CEMS and or EIA-923.
  • Solution: When loading 923 and CEMS data, we should ensure complete timestamps/report dates for each plant/subplant/unit, filling in missing values where no data is reported. However, this may balloon the size of the dataframes so we want to be careful with this.

Timeseries has more than 8760 values

  • Example: plant 50240
  • plant 50240 is located in ET, where MISO spans from MT to CT to ET. 50240 has data starting 5am UTC (expected for EST), but ending at 5am UTC (it should end 4am). It looks like the 930 hourly profile we are using to shape this plant (MISO NG) is in central time, even though this plant is located in eastern time. Thus, when shaping this plant, it is adding an extra hour to the end (and may be resulting in missing data for the first hour of that month if switching from CEMS to EIA).
  • Solution: We may need to use tz-aware profiles when shaping and assigning report dates for BAs with plants in multiple TZs. We may also want to add a validation check to flag when a BA has plants in multiple timezones.