nismod/smif

timestep=None to Store.read_scenario_variant raises SmifMismatchError

willu47 opened this issue · 3 comments

Passing timestep=None to to Store.read_scenario_variant raises a SmifMismatchError, but it should return a DataArray with timestep as a dimension of the Spec.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/repository/nismod2/smif/src/smif/data_layer/data_array.py in from_df(cls, spec, dataframe)
    188             # reindex to ensure data order and fill out NaNs
--> 189             xr_data_array = _reindex_xr_data_array(spec, xr_data_array)
    190 

~/repository/nismod2/smif/src/smif/data_layer/data_array.py in _reindex_xr_data_array(spec, xr_data_array)
    315     coords = {c.name: c.ids for c in spec.coords}
--> 316     xr_data_array = xr_data_array.reindex(indexers=coords)
    317 

~/miniconda3/envs/smif/lib/python3.7/site-packages/xarray/core/dataarray.py in reindex(self, indexers, method, tolerance, copy, **indexers_kwargs)
    970         ds = self._to_temp_dataset().reindex(
--> 971             indexers=indexers, method=method, tolerance=tolerance, copy=copy)
    972         return self._from_temp_dataset(ds)

~/miniconda3/envs/smif/lib/python3.7/site-packages/xarray/core/dataset.py in reindex(self, indexers, method, tolerance, copy, **indexers_kwargs)
   1912             self.variables, self.sizes, self.indexes, indexers, method,
-> 1913             tolerance, copy=copy)
   1914         coord_names = set(self._coord_names)

~/miniconda3/envs/smif/lib/python3.7/site-packages/xarray/core/alignment.py in reindex_variables(variables, sizes, indexes, indexers, method, tolerance, copy)
    322                     'cannot reindex or align along dimension %r because the '
--> 323                     'index has duplicate values' % name)
    324 

ValueError: cannot reindex or align along dimension 'lad_uk_2016' because the index has duplicate values

The above exception was the direct cause of the following exception:

SmifDataMismatchError                     Traceback (most recent call last)
<ipython-input-17-601e02fbed84> in <module>
----> 1 store.read_scenario_variant_data('gva', 'gva_baseline', 'gva_per_head')

~/repository/nismod2/smif/src/smif/data_layer/store.py in read_scenario_variant_data(self, scenario_name, variant_name, variable, timestep)
    548         spec_dict = _pick_from_list(scenario['provides'], variable)
    549         spec = Spec.from_dict(spec_dict)
--> 550         return self.data_store.read_scenario_variant_data(key, spec, timestep)
    551 
    552     def write_scenario_variant_data(self, scenario_name, variant_name, data, timestep=None):

~/repository/nismod2/smif/src/smif/data_layer/file/file_data_store.py in read_scenario_variant_data(self, key, spec, timestep)
     83     def read_scenario_variant_data(self, key, spec, timestep=None):
     84         path = os.path.join(self.data_folders['scenarios'], key)
---> 85         data = self._read_data_array(path, spec, timestep)
     86         data.validate_as_full()
     87         return data

~/repository/nismod2/smif/src/smif/data_layer/file/file_data_store.py in _read_data_array(self, path, spec, timestep)
    394 
    395         if spec.dims:
--> 396             data_array = DataArray.from_df(spec, dataframe)
    397         else:
    398             # zero-dimensional case (scalar)

~/repository/nismod2/smif/src/smif/data_layer/data_array.py in from_df(cls, spec, dataframe)
    194             if dups:
    195                 msg = "Data for '{name}' contains duplicate values at {dups}"
--> 196                 raise SmifDataMismatchError(msg.format(name=name, dups=dups)) from ex
    197             else:
    198                 raise ex

SmifDataMismatchError: Data for 'gva_per_head' contains duplicate values at [{'lad_uk_2016': 'E06000001'}, {'lad_uk_2016': 'E06000001'}, {'lad_uk_2016': 'E06000001'}, {'lad_uk_2016': 'E06000001'}, {'lad_uk_2016': 'E06000001'}, {'lad_uk_2016': 'E06000001'}, {'lad_uk_2016': 'E06000001'},...

To recreate, place the following in tests/data_layer/test_data_store.py:TestDataArray

    def test_read_write_data_array_all(self, handler, scenario):
        data = np.array([0, 1], dtype=float)
        spec = Spec.from_dict(scenario['provides'][0])

        da = DataArray(spec, data)

        handler.write_scenario_variant_data('mortality.csv', da, 2010)
        handler.write_scenario_variant_data('mortality.csv', da, 2015)
        actual = handler.read_scenario_variant_data('mortality.csv', spec, timestep=None)

        expected = np.array([[0, 1], [0, 1]])

        np.testing.assert_array_equal(actual.as_ndarray(), expected)

This could be worked around on the usage side - add timestep with the expected timesteps as a dimension to the spec-to-read.

A slightly-magic-but-nice option could be for read_scenario_variant_data with timestep=None to add timestep as a dimension to the DataArray that's returned, with dimension coordinates as present in the data (e.g. sorted(df.timestep.unique())).

Does the data store need to support writing partial (i.e. single-timestep) data to a scenario variant? This is implied by the two calls to write_scenario_variant_data in the snippet above. Current implementations assume that calls to write_scenario_variant_data are independent and can safely override anything that was previously present in the store.

    def test_read_write_data_array_all(self, handler, scenario):
        spec = Spec.from_dict(deepcopy(scenario['provides'][0]))

        spec_with_t = scenario['provides'][0]
        spec_with_t['dims'].insert(0, 'timestep')
        spec_with_t['coords']['timestep'] = [2010, 2015]
        spec_with_t = Spec.from_dict(spec_with_t)
        da = DataArray(spec_with_t, np.array([[0, 1], [2, 3]], dtype='float'))

        handler.write_scenario_variant_data('mortality.csv', da)
        actual = handler.read_scenario_variant_data('mortality.csv', spec_with_t)
        expected = np.array([[0, 1], [2, 3]], dtype='float')
        np.testing.assert_array_equal(actual.as_ndarray(), expected)

        da_2010 = handler.read_scenario_variant_data('mortality.csv', spec, 2010)
        expected = np.array([0, 1], dtype='float')
        np.testing.assert_array_equal(da_2010.as_ndarray(), expected)

        da_2015 = handler.read_scenario_variant_data('mortality.csv', spec, 2015)
        expected = np.array([2, 3], dtype='float')
        np.testing.assert_array_equal(da_2015.as_ndarray(), expected)

Output as dataframe:

              mortality
timestep lad
2010     a          0.0
         b          1.0
2015     a          2.0
         b          3.0

2010:

     mortality
lad
a          0.0
b          1.0

2015:

     mortality
lad
a          2.0
b          3.0

This has been addressed in #373 (and associated PR #375) but the creation of a new method on the Store called read_scenario_variant_data_multiple_timesteps() and corresponding method on Results called read_scenario_data().

Note that there is no way of providing timesteps=None get get all available data, because the spec for the resulting DataFrame needs to be known, and timesteps are a property only of the model run.

I have now updated the original method read_scenario_variant_data() to take timestep as a non-optional argument.

This can now be closed as the original call (with timestep=None) is no longer permitted.