StimulusSet data types may clash with DataAssembly data types after s3 upload
Opened this issue · 1 comments
When uploading StimulusSets, the stimulus_id
has to be coded as a string, as otherwise zip packaging of the StimulusSet fails.
If the StimulusSet['stimulus_id']
field is a string that contains e.g. only digit characters, when it is saved as a .csv and loaded from s3, the string datatype for any values that do not contain characters is not respected, resulting in other data types being loaded (as opposed to what were saved).
This is opposed to DataAssembly
, which do respect data types when being loaded.
When brain-score merges the StimulusSet
into the DataAssembly
along the stimulus_id
dim when loading the DataAssembly,
interesting errors pop up. This is because while the stimulus_id
needs to be a string in the StimulusSet
in order for the StimulusSet
to be uploaded, the stimulus_id
also needs to be a csv-inferrable type in the DataAssembly
(rather than a string) in order for the merging of the two to succeed when loading the DataAssembly
This issue is also present for fields that are not stimulus_id
: string
types are saved as .csv
and the data types of values are then inferred on a value-by-value basis. If a column of the StimulusSet
contains values where some values could be interpreted as strings, and others as integers (e.g., 'condition' = {'100', '35', 'contours', 'RGB'}
), these are inferred differently, resulting in a mix of strings and integers in the StimulusSet
after loading from s3. This results in errors on any tests that test for the integrity of the data.
Since it does not seem to be possible to fix this like above by enforcing data types on the DataAssembly
(since DataArrays don't seem to allow mixed types), the two most reasonable workarounds to this issue seem to be to either code such values explicitly as strings (e.g., 'condition' = {'100a', '35a', 'contours', 'RGB'}
instead of 'condition' = {'100', '35', 'contours', 'RGB'}
), or to enforce the data types after loading.
I would suggest saving the StimulusSet
in a data format that respects data types, e.g. xarray netcdf4
instead of .csv
, or to add more descriptive error messages when aforementioned errors occur.
Thanks @benlonnqvist for opening an issue - we will look into this and get back ASAP!