Add tests for audbcards.Dataset
Closed this issue · 22 comments
Please add a file tests/test_dataset.py
that uses the db
fixture to test return values of all methods/properties/attributes of audbcards.Dataset
.
As mentioned in #4 publication
and repository_link
do not support file-system
backend yet, so you could skip those for now.
For a start you could use this example:
import pandas as pd
import pytest
import audb
import audbcards
def test_dataset(db):
dataset = audbcards.Dataset(pytest.NAME, pytest.VERSION)
# __init__
assert dataset.name == pytest.NAME
assert dataset.version == pytest.VERSION
assert dataset.repository == pytest.REPOSITORY
expected_header = audb.info.header(
pytest.NAME,
version=pytest.VERSION,
)
assert str(dataset.header) == str(expected_header)
expected_deps = audb.dependencies(
pytest.NAME,
version=pytest.VERSION,
)
pd.testing.assert_frame_equal(dataset.deps(), expected_deps())
# archives
assert dataset.archives == str(len(db.files))
# ...
If some methods could have multiple test parameters or are more complicated to test, it might make sense to add a separate def test_<method_name>()
to test such a method.
While writing the tests I noticed two strange (or maybe that's the normal) behaviors:
-
When first executing the tests, I noticed that
audb.dependencies()
doesn't always return the dependencies in the correct order, the dependencies order change which causes the current statement to fail :pd.testing.assert_frame_equal(dataset.deps(), expected_deps())
The dependencies order is sometimes
Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f0.wav', 'data/f1.wav'], dtype='object')
and other times
Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f1.wav', 'data/f0.wav'], dtype='object')
So an alternation between the files. This gets fixed when specifying a
cache_root
argument when callingaudb.dependencies()
-
When
Dataset.player
gets called, the waveplot under the namedb.png
gets created under the project's root and not under build or cache or anything, is it the expected behavior ? (See screenshot below)
Another matter is regarding publication
, the property is being called and used to generate the rst file in create_datacard_page()
. The property publication
returns an error in the Dataset
class, which prohibits us from running the tests as the method is failing in the first place, so we maybe need to edit the code temporarly and not use the artifactory related methods to avoid errors and be able to test, or do you suggest something else?
Here's the code for reference
@property
def publication(self) -> str:
r"""Date and author uploading dataset to repository."""
url = (
f'{self.repository.host}/{self.repository.name}/{self.name}/'
f'db/{self.version}/db-{self.version}.zip'
)
path = audfactory.path(url)
stat = path.stat()
return f'{stat.ctime:%Y-%m-%d} by {stat.created_by}'
The error is coming from path.stat()
as path is None in our case
One other thing which I think it requires opening another issue, the method create_datacard_page()
and create_datasets_page()
both require a /datasets
folder to exist in order to create the rst file, however the datasets
folder is only created when calling the player
property (from where I could find), and the property is called in the later stages of create_datacard_page()
, so the method will fail because of the non-existence of the folder before reaching the code that creates the folder. Or maybe I am missing something ?
Another matter is regarding
publication
, the property is being called and used to generate the rst file increate_datacard_page()
. The propertypublication
returns an error in theDataset
class, which prohibits us from running the tests as the method is failing in the first place, so we maybe need to edit the code temporarly and not use the artifactory related methods to avoid errors and be able to test, or do you suggest something else? Here's the code for reference@property def publication(self) -> str: r"""Date and author uploading dataset to repository.""" url = ( f'{self.repository.host}/{self.repository.name}/{self.name}/' f'db/{self.version}/db-{self.version}.zip' ) path = audfactory.path(url) stat = path.stat() return f'{stat.ctime:%Y-%m-%d} by {stat.created_by}'The error is coming from
path.stat()
as path is None in our case
Yes, as stated in #4 and the description of this issue, just skip the tests for publication
and repository_link
as they do not support the file-system
backend.
This would then also mean do not include create_datacard_page()
in the tests for now.
One other thing which I think it requires opening another issue, the method
create_datacard_page()
andcreate_datasets_page()
both require a/datasets
folder to exist in order to create the rst file, however thedatasets
folder is only created when calling theplayer
property (from where I could find), and the property is called in the later stages ofcreate_datacard_page()
, so the method will fail because of the non-existence of the folder before reaching the code that creates the folder. Or maybe I am missing something ?
No, please open an issue.
While writing the tests I noticed two strange (or maybe that's the normal) behaviors:
When first executing the tests, I noticed that
audb.dependencies()
doesn't always return the dependencies in the correct order, the dependencies order change which causes the current statement to fail :pd.testing.assert_frame_equal(dataset.deps(), expected_deps())The dependencies order is sometimes
Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f0.wav', 'data/f1.wav'], dtype='object')
and other times
Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f1.wav', 'data/f0.wav'], dtype='object')
So an alternation between the files. This gets fixed when specifying a
cache_root
argument when callingaudb.dependencies()
I created audeering/audb#298.
When running the tests here it will be anyway a good idea to use a temp folder provided by pytest
for the audb
cache.
Compare https://github.com/audeering/audb/blob/1e4436efd13b4bfa3ac0b4dd187519c4616e54f0/tests/conftest.py#L22-L110
- When
Dataset.player
gets called, the waveplot under the namedb.png
gets created under the project's root and not under build or cache or anything, is it the expected behavior ? (See screenshot below)
That's also a bug, please open an issue and exclude player
in the tests for now.
One other thing which I think it requires opening another issue, the method
create_datacard_page()
andcreate_datasets_page()
both require a/datasets
folder to exist in order to create the rst file, however thedatasets
folder is only created when calling theplayer
property (from where I could find), and the property is called in the later stages ofcreate_datacard_page()
, so the method will fail because of the non-existence of the folder before reaching the code that creates the folder. Or maybe I am missing something ?No, please open an issue.
See #15
- When
Dataset.player
gets called, the waveplot under the namedb.png
gets created under the project's root and not under build or cache or anything, is it the expected behavior ? (See screenshot below)That's also a bug, please open an issue and exclude
player
in the tests for now.
See #16
While writing the tests I noticed two strange (or maybe that's the normal) behaviors:
When first executing the tests, I noticed that
audb.dependencies()
doesn't always return the dependencies in the correct order, the dependencies order change which causes the current statement to fail :pd.testing.assert_frame_equal(dataset.deps(), expected_deps())The dependencies order is sometimes
Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f0.wav', 'data/f1.wav'], dtype='object')
and other times
Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f1.wav', 'data/f0.wav'], dtype='object')
So an alternation between the files. This gets fixed when specifying a
cache_root
argument when callingaudb.dependencies()
I created audeering/audb#298. When running the tests here it will be anyway a good idea to use a temp folder provided by
pytest
for theaudb
cache. Compare https://github.com/audeering/audb/blob/1e4436efd13b4bfa3ac0b4dd187519c4616e54f0/tests/conftest.py#L22-L110
Right but the thing is the cache is set explicitly in the Dataset
class itself, see screenshot below:
Shall we change that to the default audb cache instead ?
Another thing I found is that the current statement fails on python 3.9 but passed on python 3.8, meaning dataset.example
which uses audb.dependencies()
return the dependencies in different order, although I'm not sure if this time the shift in order is due to the python version or it's random as stated in audeering/audb#298
# example
durations = [d.total_seconds() for d in db.files_duration(db.files)]
median_duration = np.median([d for d in durations if 0.5 < d < 300])
expected_example_index = min(
range(len(durations)),
key=lambda n: abs(durations[n] - median_duration)
)
expected_example = '/'.join(
db.files[expected_example_index].split('/')[-2:]
)
> assert dataset.example == expected_example
E AssertionError: assert 'data/f1.wav' == 'data/f0.wav'
E - data/f0.wav
E ? ^
E + data/f1.wav
E ? ^
I think the problem is not audb.dependenices()
, but maybe the code that tries to find the shortest file or differ the files in duration?
Shall we change that to the default audb cache instead ?
Good question.
At least we should make the cache folder used by Dataset
configurable, e.g. by adding it as argument to __init__
.
In another step we might then decide if we want to store the audb
cache there as well or maybe better use the audb
cache that the user has configured for audb
.
No, we should continue to use a separate
I think the problem is not
audb.dependenices()
, but maybe the code that tries to find the shortest file or differ the files in duration?
The ordering bug is appearing from the right member of the assertion (see below), this in the case when cache_root
is not specified in expected_deps = audb.dependencies()
and the cache folder exists (the data set has been loaded to cache once already)
expected_deps = audb.dependencies(
pytest.NAME,
version=pytest.VERSION,
# cache_root=CACHE,
)
pd.testing.assert_frame_equal(dataset.deps(), expected_deps())
returns the error
E DataFrame.index values are different (40.0 %)
E [left]: Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f0.wav',
E 'data/f1.wav'],
E dtype='object')
E [right]: Index(['db.files.csv', 'db.segments.csv', 'db.speaker.csv', 'data/f1.wav',
E 'data/f0.wav'],
E dtype='object')
E At positional index 3, first diff: data/f0.wav != data/f1.wav
pandas/_libs/testing.pyx:172: AssertionError
But in theory this shouldn't happen in the CI as we're starting from an empty cache each time.
It's still strange, because the files are stored in a CSV file on the server and have a fixed order there.
It's still strange, because the files are stored in a CSV file on the server and have a fixed order there.
I'll look more into it, in parallel I'll open a new MR to add the cache_root argument to Dataset
I've conducted the following test steps:
- Delete cache
- Run Dataset.example which returned f1.wav
- Delete cache again
- Re-run Dataset.example, this time it returned f0.wav
To me it looks that it's actually coming from audb.dependencies()
after all as the example file index is always resulting in 0, but deps.media
is the one that changes.
Also if relevant under the cache created, I cannot find db.files.csv
or db.segments.csv
(see below)
I created a minimal example to see if it is audb.dependencies()
or the selection of the example media based on the duration.
In order to make the example work, you also need to remove the following line from audbcards/core/dataset.py
:
audbcards/audbcards/core/dataset.py
Line 43 in ce589ac
import tempfile
import numpy as np
import audb
import audbcards
import audeer
import audformat
import audiofile
with tempfile.TemporaryDirectory() as tmp:
name = 'db'
version = '1.0.0'
host = audeer.mkdir(audeer.path(tmp, 'host'))
repo = audeer.mkdir(audeer.path(host, 'repo'))
db_root = audeer.mkdir(audeer.path(tmp, name))
cache_root = audeer.mkdir(audeer.path(tmp, 'cache'))
sampling_rate = 8000
duration = 1
signal = np.zeros((1, duration * sampling_rate))
audiofile.write(audeer.path(db_root, 'f0.wav'), signal, sampling_rate)
audiofile.write(audeer.path(db_root, 'f1.wav'), signal, sampling_rate)
db = audformat.Database(name)
index = audformat.filewise_index(['f0.wav', 'f1.wav'])
db['table'] = audformat.Table(index)
db['table']['column'] = audformat.Column()
db['table']['column'].set(['label0', 'label1'])
db.save(db_root)
repository = audb.Repository(
name='repo',
host=host,
backend='file-system',
)
audb.config.REPOSITORIES = [repository]
audb.config.CACHE_ROOT = cache_root
audb.publish(db_root, version, repository)
deps = audb.dependencies(name, version=version)
dataset = audbcards.Dataset(name, version)
print('Dependencies:')
print(dataset.deps())
print('')
print('Corresponding CSV file:')
with open(audeer.path(db_root, 'db.csv'), 'r') as fp:
print(fp.read())
min_dur = 0.5
max_dur = 300 # 5 min
durations = [dataset.deps.duration(file) for file in dataset.deps.media]
selected_duration = np.median(
[d for d in durations if d >= min_dur and d <= max_dur]
)
index = min(
range(len(durations)),
key=lambda n: abs(durations[n] - selected_duration),
)
print(f'Selected index: {index}')
print(f'Selected media: {dataset.deps.media[index]}')
If you repeat this several times you get:
Dependencies:
archive bit_depth channels checksum duration format removed sampling_rate type version
db.table.csv table 0 0 dee60fcda0ef3f99f99587b5073c835b 0.0 csv 0 0 0 1.0.0
f1.wav 0610c45c-d031-532c-7423-3f47d66428b6 16 1 8cc2ed04be3808f22bc866cb7dc33c1e 1.0 wav 0 8000 1 1.0.0
f0.wav 98a3ae95-9141-a79f-b668-188943854370 16 1 8cc2ed04be3808f22bc866cb7dc33c1e 1.0 wav 0 8000 1 1.0.0
Corresponding CSV file:
,archive,bit_depth,channels,checksum,duration,format,removed,sampling_rate,type,version
db.table.csv,table,0,0,dee60fcda0ef3f99f99587b5073c835b,0.0,csv,0,0,0,1.0.0
f1.wav,0610c45c-d031-532c-7423-3f47d66428b6,16,1,8cc2ed04be3808f22bc866cb7dc33c1e,1.0,wav,0,8000,1,1.0.0
f0.wav,98a3ae95-9141-a79f-b668-188943854370,16,1,8cc2ed04be3808f22bc866cb7dc33c1e,1.0,wav,0,8000,1,1.0.0
Selected index: 0
Selected media: f1.wav
...
Dependencies:
archive bit_depth channels checksum duration format removed sampling_rate type version
db.table.csv table 0 0 dee60fcda0ef3f99f99587b5073c835b 0.0 csv 0 0 0 1.0.0
f0.wav 98a3ae95-9141-a79f-b668-188943854370 16 1 8cc2ed04be3808f22bc866cb7dc33c1e 1.0 wav 0 8000 1 1.0.0
f1.wav 0610c45c-d031-532c-7423-3f47d66428b6 16 1 8cc2ed04be3808f22bc866cb7dc33c1e 1.0 wav 0 8000 1 1.0.0
Corresponding CSV file:
,archive,bit_depth,channels,checksum,duration,format,removed,sampling_rate,type,version
db.table.csv,table,0,0,dee60fcda0ef3f99f99587b5073c835b,0.0,csv,0,0,0,1.0.0
f0.wav,98a3ae95-9141-a79f-b668-188943854370,16,1,8cc2ed04be3808f22bc866cb7dc33c1e,1.0,wav,0,8000,1,1.0.0
f1.wav,0610c45c-d031-532c-7423-3f47d66428b6,16,1,8cc2ed04be3808f22bc866cb7dc33c1e,1.0,wav,0,8000,1,1.0.0
Selected index: 0
Selected media: f0.wav
So the problem is indeed audb.dependencies()
. But as the CSV file which is loaded from the backend differs already my guess is that the problem happens already during publication of the database.
So this needs to be fixed in audb
to vanish. What we can do here in the tests is simply create f0.wav
and f1.wav
with different durations to enforce that always the same file is picked.
So this needs to be fixed in audb to vanish. What we can do here in the tests is simply create f0.wav and f1.wav with different durations to enforce that always the same file is picked.
If we have two files we would still be getting either 0
or 1
as index, and the files will keep being shuffled which leads to the file returned not being the same anyway no? I've used 3 files instead and now the pipelines passes
If we have two files we would still be getting either 0 or 1 as index
The index is calculated based on the file with the smallest duration. If you have only one file with a short duration (instead of all files having the same length), the index will adjust automatically to the actual position of the file in the dependency table. So you will get back always the same file.
What is missing here? Can this not be closed?
There might be still a few tests missing, but this we will find out later.
Otherwise, everything discussed here is finished and fixed.