Index Rounding? Error since audinterface 1.0.0
schruefer opened this issue · 14 comments
When running interface on a Multi index, the timestamps are sometimes rounded.
So e.g. instead of the initial index "0 days 0 days 00:00:01.877812" audinterface returns a dataframe with the index "0 days 00:00:01.877812500"
This behavior occurs only since version 1.0.0, the previous version 0.10.2 works fine.
import audb
import os
import audinterface
media = [
'wav/03a01Fa.wav',
'wav/03a01Nc.wav',
'wav/16b10Wb.wav',
'wav/03a01Wa.wav'
]
db = audb.load(
'emodb',
version='1.3.0',
media=media,
verbose=False,
)
files = list(db.files)
folder = os.path.dirname(files[0])
df = db['emotion'].get(as_segmented = True, allow_nat=False)
print(df)
def features(signal, sampling_rate):
return [signal.mean(), signal.std()]
interface = audinterface.Feature(
['mean', 'std'],
process_func=features,
)
df = interface.process_index(df.index)
print(df)
Outputs (for audinterface==1.0.0 and 1.0.1):
emotion emotion.confidence
file start end
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 happiness 0.90
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 neutral 1.00
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Wa.wav 0 days 0 days 00:00:01.877812 anger 0.95
/data/audb/emodb/1.3.0/d3b62a9b/wav/16b10Wb.wav 0 days 0 days 00:00:02.522499 anger 1.00
mean std
file start end
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.000311 0.082317
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.000312 0.125304
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Wa.wav 0 days 0 days 00:00:01.877812500 -0.000296 0.127394
/data/audb/emodb/1.3.0/d3b62a9b/wav/16b10Wb.wav 0 days 0 days 00:00:02.522499999 -0.000464 0.095558
Outputs (for audinterface==0.10.2):
emotion emotion.confidence
file start end
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 happiness 0.90
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 neutral 1.00
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Wa.wav 0 days 0 days 00:00:01.877812 anger 0.95
/data/audb/emodb/1.3.0/d3b62a9b/wav/16b10Wb.wav 0 days 0 days 00:00:02.522499 anger 1.00
mean std
file start end
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Fa.wav 0 days 0 days 00:00:01.898250 -0.000311 0.082317
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Nc.wav 0 days 0 days 00:00:01.611250 -0.000312 0.125304
/data/audb/emodb/1.3.0/d3b62a9b/wav/03a01Wa.wav 0 days 0 days 00:00:01.877812 -0.000296 0.127394
/data/audb/emodb/1.3.0/d3b62a9b/wav/16b10Wb.wav 0 days 0 days 00:00:02.522499 -0.000464 0.095558
Python 3.8 all packages:
audb 1.4.2
audbackend 0.3.18
audeer 1.19.0
audfactory 1.0.12
audformat 0.16.1
audinterface 1.0.1
audiofile 1.2.1
audmath 1.2.1
audobject 0.7.9
audresample 1.2.1
certifi 2022.12.7
cffi 1.15.1
charset-normalizer 3.1.0
dohq-artifactory 0.8.4
filelock 3.10.7
idna 3.4
importlib-metadata 6.1.0
iso-639 0.4.5
iso3166 2.1.1
numpy 1.24.2
oyaml 1.0
pandas 2.0.0
pip 20.0.2
pkg-resources 0.0.0
pycparser 2.21
PyJWT 2.6.0
python-dateutil 2.8.2
pytz 2023.3
PyYAML 6.0
requests 2.28.2
setuptools 44.0.0
six 1.16.0
soundfile 0.12.1
tqdm 4.65.0
tzdata 2023.3
urllib3 1.26.15
zipp 3.15.0
Thanks for reporting, we will try to find out what's going on. As a temporary fix you can use preserve_index=True
:
...
df = interface.process_index(df.index, preserve_index=True)
print(df)
file start end
/media/jwagner/Data/audb/emodb/1.3.0/d3b62a9b/... 0 days 0 days 00:00:01.898250 -0.000311 0.082317
0 days 00:00:01.611250 -0.000312 0.125304
0 days 00:00:01.877812 -0.000296 0.127394
0 days 00:00:02.522499 -0.000464 0.095558
Ok, it's actually an interesting issue. The reason we see a difference between the versions is that pre 1.0.0 we kept the end time from the index and now we overwrite it with the duration we calculate from the number of samples that are processed. Theoretically these values should match of course. Maybe it's because we use the sloppy=True
when we calculate the duration in audb
or it's some rounding issue when the duration is stored to CSV as part of the dependency table. In any case, the behavior is not nice and we should make sure that we keep the end value from the index.
Or maybe not :)
One advantage of the current implementation is that it returns the correct time if end is out-of-bounds, e.g.:
file = '/media/jwagner/Data/audb/emodb/1.3.0/d3b62a9b/wav/16b10Wb.wav'
interface.process_file(file, end='999999s')
With pre 1.0.0 it returns:
mean std
file start end
/media/jwagner/Data/audb/emodb/1.3.0/d3b62a9b/... 0 days 11 days 13:46:39 -0.000464 0.095558
But with 1.0.0:
mean std
file start end
/media/jwagner/Data/audb/emodb/1.3.0/d3b62a9b/... 0 days 0 days 00:00:02.522499999 -0.000464 0.095558
So I would argue we should keep the new behavior and encourage the user to use preserve_index=True
if the index must not change.
@hagenw opinion?
I also think that the current behavior makes sense.
But as an intermediate step we should try to find out at which place exactly we are getting rounding errors. Maybe there is a way to avoid those.
But as an intermediate step we should try to find out at which place exactly we are getting rounding errors. Maybe there is a way to avoid those.
Can it be related to setting sloopy=True
when we read the file duration in audb.publish()
? Even if we work with WAV files?
No, sloppy
is not applied to WAV files: https://github.com/audeering/audiofile/blob/0ae2de5ac552a2982417e7cfde0d9b39322ef7c4/audiofile/core/info.py#L161-L165
soundfile.info(file).duration
most likely reads the duration from the header. I don't know if there is a way you can create WAV files that have a duration in the header that does not match the number of samples. But different libraries might round 0.5
differently.
Ok, I think I have found the guilty one:
dur = 2.5225
pd.to_timedelta(dur, 's').total_seconds()
2.522499
There is a workaround proposed in pandas-dev/pandas#46819
>>> pd.to_timedelta(dur, 's') / pd.Timedelta(seconds=1)
2.522499999
I guess to achieve the exact same output we need to use less than nano-second precision:
>>> round(pd.to_timedelta(dur, 's') / pd.Timedelta(seconds=1), ndigits=8)
2.5225
Ah nice. I guess we need to apply it in several spots, though:
- https://github.com/audeering/audformat/blob/main/audformat/core/database.py#L420
- https://github.com/audeering/audinterface/blob/main/audinterface/core/process.py#L588
- https://github.com/audeering/audinterface/blob/main/audinterface/core/utils.py#L473
Possibly more...
I don't think we can handle this already when doing the pd.to_timedelta(dur, unit='s')
conversion, e.g.
>>> pd.to_timedelta(round(dur, ndigits=8), 's')
Timedelta('0 days 00:00:02.522499999')
Looks like we can only do it when converting back to seconds.
Or as an alternative we could check if there is a way to avoid converting to timedelta in the first place.
So I would argue we should keep the new behavior and encourage the user to use preserve_index=True if the index must not change.
Would it be possible to set preserve_index=True by default?
I would assume that the majority of people using process_index would like to keep the index.
We cannot easily do that, since so far we always return a segmented index by default. But with preserve_index=True
it can happen that the result is a filewise index (if also the input is a filewise index).
The following workaround seems to work:
>>> pd.to_timedelta(dur * 10 ** 9, 'ns')
Timedelta('0 days 00:00:02.522500')
>>> pd.to_timedelta(dur * 10 ** 9, 'ns').total_seconds()
2.5225