Speed up feature extraction
frankenjoe opened this issue · 6 comments
When extracting features with Feature
we currently rely on Process
under the hood, which returns a pd.Series
with feature vectors. We then convert these to a list and afterwards call pd.concat(list)
to combine them to a single matrix. The last step can take quite long (sometimes as long or longer as the feature extraction itself). We could speed this up if we pre-allocate a matrix beforehand and directly assign the values. At least when not processing with a sliding window this should be possible.
To demonstrate there's quite some room for improvement:
import pandas as pd
import audb
import audinterface
import audiofile
db = audb.load(
'emodb',
version='1.3.0',
format='wav',
sampling_rate=16000,
mixdown=True,
)
files = db.files
def process_func(x, sr):
return [x.mean(), x.std()]
# slow
feature = audinterface.Feature(
['mean', 'std'],
process_func=process_func,
)
t = time.time()
df = feature.process_files(files)
print(time.time() - t)
# fast
t = time.time()
data = np.empty(
(len(files), 2),
dtype=np.float32,
)
for idx, file in enumerate(files):
signal, sampling_rate = audiofile.read(file)
data[idx, :] = process_func(
signal,
sampling_rate,
)
df_fast = pd.DataFrame(
data,
index=df.index,
columns=df.columns,
)
print(time.time() - t)
pd.testing.assert_frame_equal(df, df_fast)
5.972992181777954
0.17418813705444336
We then convert these to a list
I guess the idea for a solution is to avoid this step?
Yes, especially the concatenation of the DataFrames seems awefully slow. So the idea would be to create a matrix of the expected size (samples x features) and directly assign the extracted features. This is of course only possible if no sliding window is selected as otherwise we cannot know the shape of the final matrix.
I guess not, the comparison is also not 100% fair as in the second case we rely on the index created by Feature
. What is still missing is a sped up of Segment
. So we either expand this issue or we create a new one.