Open-Power-System-Data/time_series

Interpolated values in stacked version

Closed this issue · 3 comments

The last rows of the stacked version of the time series contains markers for interpolated values (I believe). There are a two issues with those to me:

  • they break automatic type detection of pandas (because they introduce strings to the otherwise float-only value column)
  • linking them back to the actual values is cumbersome, because one would need to properly split the string label, e.g. DE_50hertz_load_entsoe_transparency , into its parts

Instead what I'd propose is to add an interpolated column which holds boolean markers for each value in the time series that indicate whether the value has been interpolated.

>>> tail -3 time_series_60min_stacked.csv
interpolated_values,,,2017-12-25T17:00:00Z,DE_50hertz_load_entsoe_transparency
interpolated_values,,,2017-12-28T06:00:00Z,SI_load_entsoe_transparency
interpolated_values,,,2017-12-31T15:00:00Z,SI_load_entsoe_transparency

>>> tail -5737 time_series_60min_stacked.csv | head -3
UA_west,load,entsoe_power_statistics,2015-12-31T22:00:00Z,746.0
interpolated_values,,,2006-01-01T22:00:00Z,DE_50hertz_wind_generation_forecast
interpolated_values,,,2006-01-02T22:00:00Z,DE_50hertz_wind_generation_forecast

Thanks a lot for the suggestion. I like it!

The only caveat I see is that the data file then no longer is a "fully stacked" one, but has two instead of one value column, one containing the data and one containing the boolean interpolated marker. But I don't see any disadvantage resulting from that, so I would suggest to go that way.

What I also noticed is that the interpolation marker column looks quite ugly right now on a filtered dataset, because even if one only downloads a single column, one always gets the interpolation markers for all columns in the dataset regardless of the selected column right now. Jonathan, lets discuss, how we can tackle that... My idea: Either we could handle that by the website filter, or we could generally switch to boolean marker columns (as Tim suggested for the stacked version) and also use that for the singleindex etc. variants. That would increase the file size but would make it easier to work with for those who want to actually use it.

The only caveat I see is that the data file then no longer is a "fully stacked" one, but has two instead of one value column, one containing the data and one containing the boolean interpolated marker.

I see, didn't think about that before. Now I also understand better why you chose the current format. I personally don't think too much about strict definitions of data format. What I proposed above feels natural to me, but I agree it's not strictly stacked and I'd hence understand if you simply left it as is.

Maybe one could solve my issues above and still staying strictly stacked? Instead of the current:

region,variable,attribute,utc_timestamp,data
interpolated_values,,,2017-12-31T15:00:00Z,SI_load_entsoe_transparency

Maybe something like:

region,variable,attribute,utc_timestamp,data
SI,load,interpolated,2006-01-01T22:00:00Z,1

This would:

(1) leave the data column strictly numeric, hence support pandas' type detection,
(2) leave a clear link back to the original data item,

while staying strictly stacked.

I solved this a while ago by deleting the column that contains the markers on interpolated values, as it was creating more problems than value.