brain-score/vision

StimulusSet data types may clash with DataAssembly data types after s3 upload

Opened this issue · 1 comments

When uploading StimulusSets, the stimulus_id has to be coded as a string, as otherwise zip packaging of the StimulusSet fails.

If the StimulusSet['stimulus_id'] field is a string that contains e.g. only digit characters, when it is saved as a .csv and loaded from s3, the string datatype for any values that do not contain characters is not respected, resulting in other data types being loaded (as opposed to what were saved).

This is opposed to DataAssembly, which do respect data types when being loaded.

When brain-score merges the StimulusSet into the DataAssembly along the stimulus_id dim when loading the DataAssembly,
interesting errors pop up. This is because while the stimulus_id needs to be a string in the StimulusSet in order for the StimulusSet to be uploaded, the stimulus_id also needs to be a csv-inferrable type in the DataAssembly (rather than a string) in order for the merging of the two to succeed when loading the DataAssembly

This issue is also present for fields that are not stimulus_id: string types are saved as .csv and the data types of values are then inferred on a value-by-value basis. If a column of the StimulusSet contains values where some values could be interpreted as strings, and others as integers (e.g., 'condition' = {'100', '35', 'contours', 'RGB'}), these are inferred differently, resulting in a mix of strings and integers in the StimulusSet after loading from s3. This results in errors on any tests that test for the integrity of the data.

Since it does not seem to be possible to fix this like above by enforcing data types on the DataAssembly (since DataArrays don't seem to allow mixed types), the two most reasonable workarounds to this issue seem to be to either code such values explicitly as strings (e.g., 'condition' = {'100a', '35a', 'contours', 'RGB'} instead of 'condition' = {'100', '35', 'contours', 'RGB'}), or to enforce the data types after loading.

I would suggest saving the StimulusSet in a data format that respects data types, e.g. xarray netcdf4 instead of .csv, or to add more descriptive error messages when aforementioned errors occur.

Thanks @benlonnqvist for opening an issue - we will look into this and get back ASAP!