Cloud-Drift/clouddrift

๐Ÿ› Unit test test_subsurface_floats_opens fails

Closed this issue ยท 6 comments

Describe the bug
I get a failure when I run the unit test suite

To Reproduce

git clone https://github.com/cloud-drift/clouddrift
cd clouddrift/
mamba env create -f environment.yml --force
conda activate clouddrift
mamba install -c conda-forge matplotlib cartopy
pip install -e .
python -m unittest tests/*.py

Desktop (please complete the following information):

Additional context

ERROR: test_subsurface_floats_opens (tests.datasets_tests.datasets_tests.test_subsurface_floats_opens)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/mattia/MyGit/JOSS/clouddrift/tests/datasets_tests.py", line 47, in test_subsurface_floats_opens
    with datasets.subsurface_floats() as ds:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mattia/MyGit/JOSS/clouddrift/clouddrift/datasets.py", line 484, in subsurface_floats
    return _dataset_filecache(
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/mattia/MyGit/JOSS/clouddrift/clouddrift/datasets.py", line 630, in _dataset_filecache
    ds = get_ds()
         ^^^^^^^^
  File "/Users/mattia/MyGit/JOSS/clouddrift/clouddrift/adapters/subsurface_floats.py", line 49, in to_xarray
    source_data = scipy.io.loadmat(local_file)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mattia/miniforge3/envs/clouddrift/lib/python3.12/site-packages/scipy/io/matlab/_mio.py", line 227, in loadmat
    matfile_dict = MR.get_variables(variable_names)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mattia/miniforge3/envs/clouddrift/lib/python3.12/site-packages/scipy/io/matlab/_mio5.py", line 330, in get_variables
    res = self.read_var_array(hdr, process)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mattia/miniforge3/envs/clouddrift/lib/python3.12/site-packages/scipy/io/matlab/_mio5.py", line 290, in read_var_array
    return self._matrix_reader.array_from_header(header, process)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "_mio5_utils.pyx", line 665, in scipy.io.matlab._mio5_utils.VarReader5.array_from_header
  File "_mio5_utils.pyx", line 710, in scipy.io.matlab._mio5_utils.VarReader5.array_from_header
  File "_mio5_utils.pyx", line 885, in scipy.io.matlab._mio5_utils.VarReader5.read_cells
  File "_mio5_utils.pyx", line 663, in scipy.io.matlab._mio5_utils.VarReader5.read_mi_matrix
  File "_mio5_utils.pyx", line 694, in scipy.io.matlab._mio5_utils.VarReader5.array_from_header
  File "_mio5_utils.pyx", line 768, in scipy.io.matlab._mio5_utils.VarReader5.read_real_complex
  File "_mio5_utils.pyx", line 445, in scipy.io.matlab._mio5_utils.VarReader5.read_numeric
  File "_mio5_utils.pyx", line 350, in scipy.io.matlab._mio5_utils.VarReader5.read_element
  File "_streams.pyx", line 171, in scipy.io.matlab._streams.ZlibInputStream.read_string
  File "_streams.pyx", line 164, in scipy.io.matlab._streams.ZlibInputStream.read_into
OSError: could not read bytes

----------------------------------------------------------------------
Ran 216 tests in 34.575s

FAILED (errors=1)

JOSS: openjournals/joss-reviews#6742

I can't reproduce this.. Someone else can try?

Side note - not a requirement for JOSS:
Do you perform a checksum validation on the test data?
If not, I suggest looking into pooch. Many SciPy packages use it, and in my experience, it works well out of the box. Additionally, I'm personally not a big fan of test data being downloaded to a hidden directory in the home. By default,pooch uses the standard machine-specific cache directory.

  1. We don't do checksums at the moment. We could, but that would break tests and require more work when the underlying data (that we don't control) is updated.
  2. I did not implement the download module, but looks like pooch could be used instead.
  3. Not a fan of the hidden folder also personally haha. We do want to be able to save it locally and not in a temporary cache folder to avoid having to redownload the ~20k files of the GDP datasets.

๐Ÿ‘
Sorry, too much focus on pooch in my comment. My main point was, maybe you can't reproduce because we are using different test data? I'll try again tomorrow.

Ah ok. The tests should always pickup the latest datasets from their respective repos. There is sometime issues with timeout but right now it seems to be ok. Let me know when you give it another go.

It worked this morning, same environment. I guess it's either a flake or the data downloaded was corrupted.

We don't do checksums at the moment. We could, but that would break tests and require more work when the underlying data (that we don't control) is updated.

Maybe the best solution is to store a copy of the data that you need for unit testing where you have full control, like a dedicated GitHub repo, Zenodo, a cloud provider, ...