nismod/smif

smif prepare-convert will ignore combined datasets

tomalrussell opened this issue · 0 comments

An unintended feature of our method for reading data arrays from CSVs is that multiple data variables can be stored in extra columns in a single file.

E.g. population and GVA might share a region dimension and be defined over the same timesteps, so a CSV with timestep,region,pop,gva as a header could be read to load a pop data array or a gva data array.

The smif prepare-convert command reads all data arrays associated with a model run and writes them to parquet, one by one. When a CSV file contains more than one data array, the corresponding parquet file will be written twice or more, and will only contain the last data array to be read and re-written.

Approaches:

  • maintain the unintended feature, allow in parquet too - convert would need to be aware of all files with multiple data arrays, and to do some recombination before writing
  • avoid the unintended feature - would need to clean all data in any smif user's projects to separate out datasets
  • smif csv2parquet is a simpler and less flexible workaround (see f951de5) that sets up a useable binary data store from csv. Sticking with this for now