openENTRANCE/openentrance

Standard date and time format

erikfilias opened this issue ยท 16 comments

This issue aims to get an agreement for a standard date and time form.
Here, several issues related to the datetime will be discussed in this issue and all agreements will be summarize in #45.

Until now, we have the followings agreements:

  • Standard datetime format : UTC

For example: 2020-01-01, 2020-01-02, ... or 2020-01-01T13:00, ...
Inclusion of attributes is possible such as duration (More details here)

  • Using time zone : no relevance
  • Using winter/summer time :

To distinguish between different granularity levels of representative timeslices, It was proposed the following: <Granularity>|<Name of timeslice>. For example: 2 Season-2 Times|Summer-Day.

  • Averaging values over the time span :

A value is always the average for flow variables (i.e. It is the average between the subannual time and the subsequent one, where average is contingent on the lowest level of granularity - if you use 2020-01-01T13:00, it is hourly average, if you use 2020-01-01, it is daily average...).
Reference comment -> here

  • Accumulating values over the time span:

A value at the start of the period for stock over the time-period until the start of the next timeslice.
For example, Capacity or Reservoir Level at 2020-01-01T13:00 is the value at 1pm, if subannual is given as 2020-01-01, it is understood as midnight that day, if it's January, it is understood as midnight on the first day of the month.
Reference comment -> here

If there any detail that is not considered, don't hesitate to comment, please.

Important question still pending: how will a user know whether a value set for time 2020-01-01T12:00 is the average of the 12 hours preceding it (morning) or the 12 hours following it (afternoon/evening)? (cf. #45 (comment))

Important question still pending: how will a user know whether a value set for time 2020-01-01T12:00 is the average of the 12 hours preceding it (morning) or the 12 hours following it (afternoon/evening)? (cf. #45 (comment))

As it as mentioned by @sandrinecharousset and @arght , two representative hours could be used. e.g. 2020-01-01T12:00 and 2020-01-02T00:00 could be use of average of the 12 hours preceding, morning and afternoon/evening, respectively.

I copy my last answer here:
how will a user know what the model/scenario time resolution is? => This is more model parameter than data. Eg in plan4eu we have a "parameters" file which contains data of start-end, and time granularities (ie duration of the 2 kinds of time steps that are used). Then we have a tool that reads data and builds the input netcdf of the model, this tools deals with the time granularity, with some conventions. When the model requires hourly resolution and only weekly is available, the tool manages it (and it depends on the variables, eg inflows, if they are given per week then the total quantity is associated to the first time step of the week and 0 elsewhere, because it makes no difference in the results and costs less in terms of datastorage, for other kinds of data they are duplicated or spread....)

is the value for 2020-01-01T12:00+00:00 related to

the value in that hour => I think the value for 2020-01-01T12:00+00:00 HAS TO BE the value of that hour
the average over the time span 12pm until 12am => In that case we should not have 2020-01-01T12:00+00:00 in the subannual column but another identifier that explicitely represents the time span 12pm until 12am
the average over the time span 6am until 6pm => In that case we should not have 2020-01-01T12:00+00:00 in the subannual column but another identifier that explicitely represents the time span 6am until 6pm
again summer time: should this model report its data in summer time as 2020-01-01T13:00+00:00 during summer-time (to have consistent timesteps over the year), or should the averaging be adjusted on the time-shift days? => I have no idea sorry..... never thought of that

You are true....
The conventions used in tepes and plan4eu are closely related but different.... But then there is 'how it is represented in the model' and 'how it is represented in the data' which may be different. What we should work on here is agree on a convention for the data that will be on the platform. I would think that if we incorporate time series then it should either be:

  • time series with a given fixed time granularity (one value per hour, or 4 values per day, 1 value per day, 1 value per week, 1 value per month...... depending on the data that we have);
  • or time series over representative timesets (see discussion by @omnipotent12 )
    In the first case we should not have ambiguities about which time horizon is the value related to if eg we have weeks identifiers (like Week1, Week2....), months identifiers (january....), days identifiers (12/10/2034).... which means each time a different granularity is used we need to define new identifiers (eg if we have a variable with a 2 weeks frequency, we should define identifiers for the 2-weeks periods like Fortnight1=Week1+Week2, Week1=Jan1 until jan7 included, ....);
    What do you think @erikfilias @danielhuppmann @arght @omnipotent12

(about #45 (comment))

I think that:

  • Averaging hours preceding: we can interprete that if the following representative hours are consecutive: 2020-01-01T12:00, 2020-01-01T18:00 and 2020-01-01T18:00, we are averaging value and we have three representative hours per day (as an example). For representative days and weeks could go in the same way, I think.

One advantage of it is related that can be understood easily by a code line, and be compared...

hi, @tperger, and I fully agree with the two points regarding UTC and No relevance for the time zones.
Furthermore, from our point of view, there are two options for the average hour issue:

  1. Add in addition a specific key:
  • Day|Begin: this means for example that the value of the time step 2020-01-01T00:00 is valid for following time period (entire day).
  • Day|Middle: the value is in the middle of the valid time period: for example, the value at the time step 2020-01-01T12:00 is valid for the time steps between 2020-01-01T06:00 and 2020-01-01T18:00. This can be easily seen due to the delta between the values and the corresponding time steps.
  1. Convention: Average time series always have to be set at for example at the beginning of a valid time period.

Hi, in addition to @sebastianzwickl 's comment: We should specify if we use

  • Value for specific time step: e.g. '2020-01-01T12:00' or '2040' is the value for this hour or year, respectively
  • Value for the consecutive time period (cummulative): if we have '2020-01-01T12:00' and the next time step is '2020-01-01T18:00', the value is the total value from 12 to 18
  • Value for the consecutive time period (average): same as previous, but average value from 12 to 18 (in line with @erikfilias comment)

Maybe the distinction can be specified in the subannual value (e.g. subannual = hourly, subannual = 6-hourly, subannual = 6-hourly average)

Also I would again come back to @omnipotent12 's comment on #45

Can we add representative hours to this? Or is it better to think of this as a seperate issue?
For example, in CS1 we will use something like 'm1h0' - denoting a representative (average) quantitaty in January (m1) for 00:00 - 01:00 (h0)

How should representative hours be included in this notation? In GENeSYS-MOD we use a similar representation as in CS1.

Apart from that: Our time-dependent input data is full hourly and easily can be converted to UTC timestamps without summer/winter time and no time-zones, so we are fine with the proposed conventions in this topic

As a next step to the suggestion of @tperger (and this could also address the representative hours issue from @tburandt), we could describe the time series as we proposed for variable&energy, like
Secondary Energy|<Output Fuel>|<Input Fuel>|<Specification>
Therefore, we could describe the time series as

  • Consecutive,
  • Representative,

and in more detail, for example,
Representative|January|Hour|Average and Representative|Summer-day|Hour.

Therefore, we could suggest the pattern
Consecutive|<Month/Season>|<Granularity>|<Specification1>
Representative|<Month/Season>|<Granularity>|<Specification1>

I like the suggestion from @tperger @sebastianzwickl
We could define patterns such as representative, cumulative, average, ...
If I understand well,

  • sometimes timeseries are defined with one value per eg hour, day, week, month or year
  • sometimes they are defined with representatives values for chosen timesets, like representative hour1 of each monday of january
  • sometimes they are cumulative values, meaning that the value given is the total -eg. energy- for period (and models can either attach it to their first timestep of the iven period or spread it as they wish on the period)
  • sometimes they are duplicated values, meaning that the value given should be duplicated at every timestep of the given period
    This means we need to define the cumulative/consecutive...., the start time and the duration ?

and in more detail, for example,
Representative|January|Hour|Average and Representative|Summer-day|Hour.

Therefore, we could suggest the pattern
Consecutive|<Month/Season>|<Granularity>|<Specification1>
Representative|<Month/Season>|<Granularity>|<Specification1>

Thanks @sebastianzwickl and @tperger! This looks like a good solution to me as well, assuming it will work with the current framework. If I am not misunderstanding the implementation of this, these values would be reflected in the Subannual column of the IAMC data format. I believe we will also need to include the final value denoting the time slot then. So,

Representative|January|Hour|Average and Representative|Summer-day|Hour.

becomes
Representative|January|Hour|Average and Representative|Summer-day|Hour|22.
where "|22" denote the timeslot from 22:00 - 23:00, and "|Average" denotes the average hourly value for all 24 representative hours in January.

The general pattern suggested by @sebastianzwickl still holds, just that <Specification> is required and is either a number or timestamp (UTC) value.

Consecutive|<Month/Season>|<Granularity>|<Specification>
Representative|<Month/Season>|<Granularity>|<Specification>

@omnipotent12 yes, adding this pipe structure to the subannual column is the idea. So instead of subannual=year we would have subannual = Consecutive|<Month/Season>|<Granularity>|<Specification1> . I'm just not convinced that we should add the UTC timestamp to the pipe, because that way we'd lose the column structure (we'd need a new row for each time step). So the UTC timestamps should remain for each column and with the subannual specification we know what they actually stand for (average, etc.).

Thanks to all for becoming active in this discussion, great to develop a common understanding!

I'm a bit worried that the suggestions here are becoming too long to be readable. So I would suggest to rely a bit more on the intuition of the users and an explicit common understanding:

  1. a list like January, February, ... or 2020-01-01, 2020-01-02, ... is obviously consecutive and terms like summer-day are obviously representative - no need to write it explicitly in the name. If someone needs it explicitly, we could add it as attributes (similar to the duration, see here).

  2. to distinguish between different granularity levels of representative timeslices, I would switch the order and write <Granularity>|<Name of timeslice>. For example, if we have a granularity-category with items Summer-Day, Summer-Night, Winter-Day, Winter-Night, this could be called 2 Season-2 Times|Summer-Day, This will make it easier to group the various timeslices and sort the lists.

  3. rather than adding more identifiers to consecutive timeslices, I would simply define that a value is always the average (for flow variables) or the value at the start of the period (for stock) over the time-period until the start of the next timeslice.
    So Capacity or Reservoir Level at 2020-01-01T13:00 is the value at 1pm, if subannual is given as 2020-01-01, it is understood as midnight that day, if it's January, it is understood as midnight on the first day of the month.
    For flow variables like Primary Energy, it is the average between the subannual time and the subsequent one, where average is contingent on the lowest level of granularity - if you use 2020-01-01T13:00, it is hourly average, if you use 2020-01-01, it is daily average...
    NB. This would require some models (@arght mentioned that they use reservoir level at the end of the timeslice) to do explicit conversion (shifting of the time column), but this would need to happen anyway even with having very long, explicit names.

I support the suggestion to use attributes. Therefore, we can add information in detail without getting overloaded and confusing in terms of the variable name.

arght commented

I like this format. No problem with these assumptions on my side

Thanks to all for becoming active in this discussion, great to develop a common understanding!

I'm a bit worried that the suggestions here are becoming too long to be readable. So I would suggest to rely a bit more on the intuition of the users and an explicit common understanding:

  1. a list like January, February, ... or 2020-01-01, 2020-01-02, ... is obviously consecutive and terms like summer-day are obviously representative - no need to write it explicitly in the name. If someone needs it explicitly, we could add it as attributes (similar to the duration, see here).
  2. to distinguish between different granularity levels of representative timeslices, I would switch the order and write <Granularity>|<Name of timeslice>. For example, if we have a granularity-category with items Summer-Day, Summer-Night, Winter-Day, Winter-Night, this could be called 2 Season-2 Times|Summer-Day, This will make it easier to group the various timeslices and sort the lists.
  3. rather than adding more identifiers to consecutive timeslices, I would simply define that a value is always the average (for flow variables) or the value at the start of the period (for stock) over the time-period until the start of the next timeslice.
    So Capacity or Reservoir Level at 2020-01-01T13:00 is the value at 1pm, if subannual is given as 2020-01-01, it is understood as midnight that day, if it's January, it is understood as midnight on the first day of the month.
    For flow variables like Primary Energy, it is the average between the subannual time and the subsequent one, where average is contingent on the lowest level of granularity - if you use 2020-01-01T13:00, it is hourly average, if you use 2020-01-01, it is daily average...
    NB. This would require some models (@arght mentioned that they use reservoir level at the end of the timeslice) to do explicit conversion (shifting of the time column), but this would need to happen anyway even with having very long, explicit names.

This would perfectly fit the needs that I have understood I think, I support it

Thanks to all for becoming active in this discussion, great to develop a common understanding!

I'm a bit worried that the suggestions here are becoming too long to be readable. So I would suggest to rely a bit more on the intuition of the users and an explicit common understanding:

  1. a list like January, February, ... or 2020-01-01, 2020-01-02, ... is obviously consecutive and terms like summer-day are obviously representative - no need to write it explicitly in the name. If someone needs it explicitly, we could add it as attributes (similar to the duration, see here).
  2. to distinguish between different granularity levels of representative timeslices, I would switch the order and write <Granularity>|<Name of timeslice>. For example, if we have a granularity-category with items Summer-Day, Summer-Night, Winter-Day, Winter-Night, this could be called 2 Season-2 Times|Summer-Day, This will make it easier to group the various timeslices and sort the lists.
  3. rather than adding more identifiers to consecutive timeslices, I would simply define that a value is always the average (for flow variables) or the value at the start of the period (for stock) over the time-period until the start of the next timeslice.
    So Capacity or Reservoir Level at 2020-01-01T13:00 is the value at 1pm, if subannual is given as 2020-01-01, it is understood as midnight that day, if it's January, it is understood as midnight on the first day of the month.
    For flow variables like Primary Energy, it is the average between the subannual time and the subsequent one, where average is contingent on the lowest level of granularity - if you use 2020-01-01T13:00, it is hourly average, if you use 2020-01-01, it is daily average...
    NB. This would require some models (@arght mentioned that they use reservoir level at the end of the timeslice) to do explicit conversion (shifting of the time column), but this would need to happen anyway even with having very long, explicit names.

Perfect, many thanks all . I'll include all of these agreements in #45 .
If there are another detail or issue, please don't hesitate to comment.