audeering/audbcards

Add tests for audbcards.Dataset

Closed this issue · 22 comments

hagenw commented

Please add a file tests/test_dataset.py that uses the db fixture to test return values of all methods/properties/attributes of audbcards.Dataset.
As mentioned in #4 publication and repository_link do not support file-system backend yet, so you could skip those for now.

For a start you could use this example:

import pandas as pd
import pytest

import audb
import audbcards


def test_dataset(db):

    dataset = audbcards.Dataset(pytest.NAME, pytest.VERSION)

    # __init__
    assert dataset.name == pytest.NAME
    assert dataset.version == pytest.VERSION
    assert dataset.repository == pytest.REPOSITORY
    expected_header = audb.info.header(
        pytest.NAME,
        version=pytest.VERSION,
    )   
    assert str(dataset.header) == str(expected_header)
    expected_deps = audb.dependencies(
        pytest.NAME,
        version=pytest.VERSION,
    )   
    pd.testing.assert_frame_equal(dataset.deps(), expected_deps())

    # archives
    assert dataset.archives == str(len(db.files))

    # ...

If some methods could have multiple test parameters or are more complicated to test, it might make sense to add a separate def test_<method_name>() to test such a method.

While writing the tests I noticed two strange (or maybe that's the normal) behaviors:

  1. When first executing the tests, I noticed that audb.dependencies() doesn't always return the dependencies in the correct order, the dependencies order change which causes the current statement to fail :

    pd.testing.assert_frame_equal(dataset.deps(), expected_deps())

    The dependencies order is sometimes

    Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f0.wav',
              'data/f1.wav'],
             dtype='object')
    

    and other times

    Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f1.wav',
              'data/f0.wav'],
             dtype='object')
    

    So an alternation between the files. This gets fixed when specifying a cache_root argument when calling audb.dependencies()

  2. When Dataset.player gets called, the waveplot under the name db.png gets created under the project's root and not under build or cache or anything, is it the expected behavior ? (See screenshot below)

image

Another matter is regarding publication, the property is being called and used to generate the rst file in create_datacard_page(). The property publication returns an error in the Dataset class, which prohibits us from running the tests as the method is failing in the first place, so we maybe need to edit the code temporarly and not use the artifactory related methods to avoid errors and be able to test, or do you suggest something else?
Here's the code for reference

@property
    def publication(self) -> str:
        r"""Date and author uploading dataset to repository."""
        url = (
            f'{self.repository.host}/{self.repository.name}/{self.name}/'
            f'db/{self.version}/db-{self.version}.zip'
        )
        path = audfactory.path(url)
        stat = path.stat()
        return f'{stat.ctime:%Y-%m-%d} by {stat.created_by}'

The error is coming from path.stat() as path is None in our case

One other thing which I think it requires opening another issue, the method create_datacard_page() and create_datasets_page() both require a /datasets folder to exist in order to create the rst file, however the datasets folder is only created when calling the player property (from where I could find), and the property is called in the later stages of create_datacard_page(), so the method will fail because of the non-existence of the folder before reaching the code that creates the folder. Or maybe I am missing something ?

hagenw commented

Another matter is regarding publication, the property is being called and used to generate the rst file in create_datacard_page(). The property publication returns an error in the Dataset class, which prohibits us from running the tests as the method is failing in the first place, so we maybe need to edit the code temporarly and not use the artifactory related methods to avoid errors and be able to test, or do you suggest something else? Here's the code for reference

@property
    def publication(self) -> str:
        r"""Date and author uploading dataset to repository."""
        url = (
            f'{self.repository.host}/{self.repository.name}/{self.name}/'
            f'db/{self.version}/db-{self.version}.zip'
        )
        path = audfactory.path(url)
        stat = path.stat()
        return f'{stat.ctime:%Y-%m-%d} by {stat.created_by}'

The error is coming from path.stat() as path is None in our case

Yes, as stated in #4 and the description of this issue, just skip the tests for publication and repository_link as they do not support the file-system backend.
This would then also mean do not include create_datacard_page() in the tests for now.

hagenw commented

One other thing which I think it requires opening another issue, the method create_datacard_page() and create_datasets_page() both require a /datasets folder to exist in order to create the rst file, however the datasets folder is only created when calling the player property (from where I could find), and the property is called in the later stages of create_datacard_page(), so the method will fail because of the non-existence of the folder before reaching the code that creates the folder. Or maybe I am missing something ?

No, please open an issue.

hagenw commented

While writing the tests I noticed two strange (or maybe that's the normal) behaviors:

  1. When first executing the tests, I noticed that audb.dependencies() doesn't always return the dependencies in the correct order, the dependencies order change which causes the current statement to fail :

    pd.testing.assert_frame_equal(dataset.deps(), expected_deps())

    The dependencies order is sometimes

    Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f0.wav',
              'data/f1.wav'],
             dtype='object')
    

    and other times

    Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f1.wav',
              'data/f0.wav'],
             dtype='object')
    

    So an alternation between the files. This gets fixed when specifying a cache_root argument when calling audb.dependencies()

I created audeering/audb#298.
When running the tests here it will be anyway a good idea to use a temp folder provided by pytest for the audb cache.
Compare https://github.com/audeering/audb/blob/1e4436efd13b4bfa3ac0b4dd187519c4616e54f0/tests/conftest.py#L22-L110

hagenw commented
  1. When Dataset.player gets called, the waveplot under the name db.png gets created under the project's root and not under build or cache or anything, is it the expected behavior ? (See screenshot below)

That's also a bug, please open an issue and exclude player in the tests for now.

One other thing which I think it requires opening another issue, the method create_datacard_page() and create_datasets_page() both require a /datasets folder to exist in order to create the rst file, however the datasets folder is only created when calling the player property (from where I could find), and the property is called in the later stages of create_datacard_page(), so the method will fail because of the non-existence of the folder before reaching the code that creates the folder. Or maybe I am missing something ?

No, please open an issue.

See #15

  1. When Dataset.player gets called, the waveplot under the name db.png gets created under the project's root and not under build or cache or anything, is it the expected behavior ? (See screenshot below)

That's also a bug, please open an issue and exclude player in the tests for now.

See #16

While writing the tests I noticed two strange (or maybe that's the normal) behaviors:

  1. When first executing the tests, I noticed that audb.dependencies() doesn't always return the dependencies in the correct order, the dependencies order change which causes the current statement to fail :

    pd.testing.assert_frame_equal(dataset.deps(), expected_deps())

    The dependencies order is sometimes

    Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f0.wav',
              'data/f1.wav'],
             dtype='object')
    

    and other times

    Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f1.wav',
              'data/f0.wav'],
             dtype='object')
    

    So an alternation between the files. This gets fixed when specifying a cache_root argument when calling audb.dependencies()

I created audeering/audb#298. When running the tests here it will be anyway a good idea to use a temp folder provided by pytest for the audb cache. Compare https://github.com/audeering/audb/blob/1e4436efd13b4bfa3ac0b4dd187519c4616e54f0/tests/conftest.py#L22-L110

Right but the thing is the cache is set explicitly in the Dataset class itself, see screenshot below:
image

Shall we change that to the default audb cache instead ?

Another thing I found is that the current statement fails on python 3.9 but passed on python 3.8, meaning dataset.example which uses audb.dependencies() return the dependencies in different order, although I'm not sure if this time the shift in order is due to the python version or it's random as stated in audeering/audb#298

         # example
        durations = [d.total_seconds() for d in db.files_duration(db.files)]
        median_duration = np.median([d for d in durations if 0.5 < d < 300])
        expected_example_index = min(
            range(len(durations)),
            key=lambda n: abs(durations[n] - median_duration)
        )
        expected_example = '/'.join(
            db.files[expected_example_index].split('/')[-2:]
        )
>       assert dataset.example == expected_example
E       AssertionError: assert 'data/f1.wav' == 'data/f0.wav'
E         - data/f0.wav
E         ?       ^
E         + data/f1.wav
E         ?       ^
hagenw commented

I think the problem is not audb.dependenices(), but maybe the code that tries to find the shortest file or differ the files in duration?

hagenw commented

Shall we change that to the default audb cache instead ?

Good question.

At least we should make the cache folder used by Dataset configurable, e.g. by adding it as argument to __init__.
In another step we might then decide if we want to store the audb cache there as well or maybe better use the audb cache that the user has configured for audb.
No, we should continue to use a separate

I think the problem is not audb.dependenices(), but maybe the code that tries to find the shortest file or differ the files in duration?

The ordering bug is appearing from the right member of the assertion (see below), this in the case when cache_root is not specified in expected_deps = audb.dependencies() and the cache folder exists (the data set has been loaded to cache once already)

expected_deps = audb.dependencies(
        pytest.NAME,
        version=pytest.VERSION,
        # cache_root=CACHE,
    )
    pd.testing.assert_frame_equal(dataset.deps(), expected_deps())

returns the error

E   DataFrame.index values are different (40.0 %)
E   [left]:  Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f0.wav',
E          'data/f1.wav'],
E         dtype='object')
E   [right]: Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f1.wav',
E          'data/f0.wav'],
E         dtype='object')
E   At positional index 3, first diff: data/f0.wav != data/f1.wav

pandas/_libs/testing.pyx:172: AssertionError

But in theory this shouldn't happen in the CI as we're starting from an empty cache each time.

hagenw commented

It's still strange, because the files are stored in a CSV file on the server and have a fixed order there.

It's still strange, because the files are stored in a CSV file on the server and have a fixed order there.

I'll look more into it, in parallel I'll open a new MR to add the cache_root argument to Dataset

I've conducted the following test steps:

  1. Delete cache
  2. Run Dataset.example which returned f1.wav
  3. Delete cache again
  4. Re-run Dataset.example, this time it returned f0.wav

To me it looks that it's actually coming from audb.dependencies() after all as the example file index is always resulting in 0, but deps.media is the one that changes.

Also if relevant under the cache created, I cannot find db.files.csv or db.segments.csv (see below)

image

hagenw commented

I created a minimal example to see if it is audb.dependencies() or the selection of the example media based on the duration.
In order to make the example work, you also need to remove the following line from audbcards/core/dataset.py:

cache_root=CACHE,

import tempfile

import numpy as np

import audb
import audbcards
import audeer
import audformat
import audiofile


with tempfile.TemporaryDirectory() as tmp:

    name = 'db'
    version = '1.0.0'

    host = audeer.mkdir(audeer.path(tmp, 'host'))
    repo = audeer.mkdir(audeer.path(host, 'repo'))
    db_root = audeer.mkdir(audeer.path(tmp, name))
    cache_root = audeer.mkdir(audeer.path(tmp, 'cache'))

    sampling_rate = 8000
    duration = 1
    signal = np.zeros((1, duration * sampling_rate))
    audiofile.write(audeer.path(db_root, 'f0.wav'), signal, sampling_rate)
    audiofile.write(audeer.path(db_root, 'f1.wav'), signal, sampling_rate)

    db = audformat.Database(name)
    index = audformat.filewise_index(['f0.wav', 'f1.wav'])
    db['table'] = audformat.Table(index)
    db['table']['column'] = audformat.Column()
    db['table']['column'].set(['label0', 'label1'])
    db.save(db_root)

    repository = audb.Repository(
        name='repo',
        host=host,
        backend='file-system',
    )
    audb.config.REPOSITORIES = [repository]
    audb.config.CACHE_ROOT = cache_root

    audb.publish(db_root, version, repository)

    deps = audb.dependencies(name, version=version)

    dataset = audbcards.Dataset(name, version)

    print('Dependencies:')
    print(dataset.deps())

    print('')
    print('Corresponding CSV file:')
    with open(audeer.path(db_root, 'db.csv'), 'r') as fp:
        print(fp.read())

    min_dur = 0.5 
    max_dur = 300  # 5 min
    durations = [dataset.deps.duration(file) for file in dataset.deps.media]
    selected_duration = np.median(
        [d for d in durations if d >= min_dur and d <= max_dur]
    )

    index = min(
        range(len(durations)),
        key=lambda n: abs(durations[n] - selected_duration),
    )
    print(f'Selected index: {index}')
    print(f'Selected media: {dataset.deps.media[index]}')

If you repeat this several times you get:

Dependencies:                                                                                        
                                           archive  bit_depth  channels                          checksum  duration format  removed  sampling_rate  type version
db.table.csv                                 table          0         0  dee60fcda0ef3f99f99587b5073c835b       0.0    csv        0              0     0   1.0.0
f1.wav        0610c45c-d031-532c-7423-3f47d66428b6         16         1  8cc2ed04be3808f22bc866cb7dc33c1e       1.0    wav        0           8000     1   1.0.0
f0.wav        98a3ae95-9141-a79f-b668-188943854370         16         1  8cc2ed04be3808f22bc866cb7dc33c1e       1.0    wav        0           8000     1   1.0.0

Corresponding CSV file:
,archive,bit_depth,channels,checksum,duration,format,removed,sampling_rate,type,version
db.table.csv,table,0,0,dee60fcda0ef3f99f99587b5073c835b,0.0,csv,0,0,0,1.0.0
f1.wav,0610c45c-d031-532c-7423-3f47d66428b6,16,1,8cc2ed04be3808f22bc866cb7dc33c1e,1.0,wav,0,8000,1,1.0.0
f0.wav,98a3ae95-9141-a79f-b668-188943854370,16,1,8cc2ed04be3808f22bc866cb7dc33c1e,1.0,wav,0,8000,1,1.0.0

Selected index: 0
Selected media: f1.wav

...

Dependencies:                                                                                        
                                           archive  bit_depth  channels                          checksum  duration format  removed  sampling_rate  type version
db.table.csv                                 table          0         0  dee60fcda0ef3f99f99587b5073c835b       0.0    csv        0              0     0   1.0.0
f0.wav        98a3ae95-9141-a79f-b668-188943854370         16         1  8cc2ed04be3808f22bc866cb7dc33c1e       1.0    wav        0           8000     1   1.0.0
f1.wav        0610c45c-d031-532c-7423-3f47d66428b6         16         1  8cc2ed04be3808f22bc866cb7dc33c1e       1.0    wav        0           8000     1   1.0.0

Corresponding CSV file:
,archive,bit_depth,channels,checksum,duration,format,removed,sampling_rate,type,version
db.table.csv,table,0,0,dee60fcda0ef3f99f99587b5073c835b,0.0,csv,0,0,0,1.0.0
f0.wav,98a3ae95-9141-a79f-b668-188943854370,16,1,8cc2ed04be3808f22bc866cb7dc33c1e,1.0,wav,0,8000,1,1.0.0
f1.wav,0610c45c-d031-532c-7423-3f47d66428b6,16,1,8cc2ed04be3808f22bc866cb7dc33c1e,1.0,wav,0,8000,1,1.0.0

                 
Selected index: 0     
Selected media: f0.wav

So the problem is indeed audb.dependencies(). But as the CSV file which is loaded from the backend differs already my guess is that the problem happens already during publication of the database.


So this needs to be fixed in audb to vanish. What we can do here in the tests is simply create f0.wav and f1.wav with different durations to enforce that always the same file is picked.

So this needs to be fixed in audb to vanish. What we can do here in the tests is simply create f0.wav and f1.wav with different durations to enforce that always the same file is picked.

If we have two files we would still be getting either 0 or 1 as index, and the files will keep being shuffled which leads to the file returned not being the same anyway no? I've used 3 files instead and now the pipelines passes

hagenw commented

If we have two files we would still be getting either 0 or 1 as index

The index is calculated based on the file with the smallest duration. If you have only one file with a short duration (instead of all files having the same length), the index will adjust automatically to the actual position of the file in the dependency table. So you will get back always the same file.

What is missing here? Can this not be closed?

hagenw commented

There might be still a few tests missing, but this we will find out later.

Otherwise, everything discussed here is finished and fixed.