Interchange format
Closed this issue ยท 19 comments
Is your feature request related to a problem? Please describe.
Reading data from csv files requires several steps to harmonize metadata and dimensions. It would be great to have an interchange format that has all metadata and dimension names as required for PRIMAP2 but is still in a table like format as in a csv file. This format could be used to interchange data with others in a specified format which is easy to export from and import to PRIMAP2 and other tools.
Describe the solution you'd like
For exporting a function "to_interchange_format" should be added to the "pr" xarray accessor. For importing, a function from_interchange_format should be added to the "io" module which is currently under development. Data reading from other formats should then first create an interchange format dataset (e.g. as pandas dataframe) which can then be transformed into the PRIMAP2 xarray data format.
This would not only make our code easier to read but also enable better reuse of data reading functions outside of PRIMAP2
The interchange format has every mandatory dimension + unit + entity + time points as columns. Optional dimensions can be present but don't have to be present. The column names follow the dimension names in PRIMAP2.
An open question is how to store the attrs in the interchange format. It is possible to store them in columns, but repeating information like the reference in every row is a waste of space. For storage in memory we could try to use the pandas dataframe attrs, though the feature is still experimental. For storing we should find a format which consists of a csv file with an additional metdata file.
This would be a "wide" format, with time points as columns, right?
I think your proposed solution makes sense, in particular because conversion between the proposed interchange format and primap2 xarray Datasets should be rather easy to implement.
Regarding metadata, an additional file is probably necessary (it could be a simple key/value CSV file) and in-memory it is maybe easiest to just have an additional dictionary. pandas attrs sadly get lost in many operations, so I think it will be easier to have a separate dict
, also considering that usually not a lot of operations should be necessary on the metadata, right?
How about the 'old' option to have the metadata listed in the first rows of the csv-file, as key/value pairs, and then start the dataframe-like information below the metadata? To avoid having two files.
The problem with this option is mainly that this might be more difficult to parse for other people (since it is meant as an interchange format, we don't control the other parsers). And some of the metadata (notably, at least the "history") contain newlines, which might also be weird in that format.
But just to be clear: both are solvable problems, so having one file is also doable!
Do you know the data package standard. Could also be an option: https://datahub.io/docs/data-packages
Yeah, I know the standard, I also evaluated it as an option for primap2. It was the "Potemkin" option, with very nice documentation and then absolutely no working code. So, I guess we can build an exporter to data packages, but honestly, I think that everybody will just read it as CSV anyway because the data package "ecosystem" doesn't work.
Then a csv with a simpler format for the metadata would be better. The data package yaml files are not exactly easy to read, I think.
The format is now implemented in the data reading code. It still needs a description in the docs and a format for disk storage.
OK, I suggest to use csv + attrs dict as yaml file with either the same name or "_metadata.yaml". Not zipped together because zip files are blocked by email spam filters.
Any suggestions / objections?
I like it. zipping would be great for size, but I guess that can also be done "outside" the format itself, if necessary.
or one could unzip for email. Does zipping pose any problems for reading? I assume there's tons of libraries for that in python?
Actually, there is more or less only one zipping library in python, because it is in the standard lib: zipfile. So, it is available even without installing anything.
The main reason why zipping "outside" could be better is that zipping 6 files is more efficient than zipping three times 2 files (if three datasets are shared, each resulting in two files).
True. On the other hand zipping in primap2 would minimize the risk of loosing the metadata.
And for more chaos another argument for zipping outside: to publish data e.g. on zenodo we would like the files unzipped and it would be good if we can publish in the interchange format directly and not in and unzipped version of it.
So I think I prefer not zipped.
What I dislike in the current interchange format is that the information which columns are time columns is only implicit. This shows also in from_interchange_format
which takes an additional regex not contained in data
and attrs
. I think we should explicitly store the time columns somehow.
My ideas would be:
- define a more explicit standard column naming like
time_<spec>
, which would yield e.g.time_2012
. - Just store the full list of time columns in
attrs["time_columns"]
or so.
A related problem is that the format of the time is also implicit at the moment. We could maybe dodge the question by saying it has to be parseable by pd.to_datetime
without format
argument, but maybe it would be better to also store the explicit format string in attrs["time_format"]
or so?
Makes sense. I'm also unhappy with the current undefined status.
I would like to avoid specifying a list of time cols as this will be very long for datasets like PRIMAP-hist hampering human readability of the yaml file. What about storing a regexp? When writing the file from primap2 we can format the cols accordingly and thus have no problem with different regexps.
Another thing we should specify is the file names. I think we could either have
- csv and yaml with the same name (specify csv, yaml will be used if present, warning otherwise)
- csv file name specified in yaml.
I think I prefer the second option as it's a more clearly defined format with less ambiguity
The hairy thing about a regex is that it needs to be a regex not matching the index columns. Maybe we can explicitly specify columns but the other way around? I.e. we list all index (non-date) columns in the yaml. That shouldn't be too bad regarding readability even in the PRIMAP-hist case, right?
And I like csv file name specified in yaml, with the recommendation that csv and yaml have the same name.
Sounds great.