How was the test data prepared?
Closed this issue · 5 comments
How can I make p225_001.wav
taken from VCTK Corpus with audio rate of 16000
be like p225_001.wav test data such that computing the output length of
Line 137 in b67354a
90
.
To elaborate my question, have a look at the following code:
import pickle
from soundfile import read
from utils import *
def get_spk_meta(spk_wav_path, spk_meta_path):
spk = os.path.splitext(os.path.basename(spk_wav_path))[0]
spk_meta = pickle.load(open(spk_meta_path, "rb"))
spk_key = spk.split("_")[0]
spk_id_gender = spk_meta.get(spk_key)
if spk_id_gender is None:
raise Exception(f"{spk_key} speaker not found")
return spk_id_gender + (spk_key, spk)
def main():
spk_meta_path = 'spk_meta.pkl'
files = ['data/test/p225_001.wav', 'data/test/p258_001.wav']
for source_path in files:
spk_id, spk_gender, spk_key, spk = get_spk_meta(source_path, spk_meta_path)
wav, _ = read(source_path)
fs = 16000
if spk_gender == 'M':
lo, hi = 50, 250
else:
lo, hi = 100, 600
if wav.shape[0] % 256 == 0:
wav = np.concatenate((wav, np.array([1e-06])), axis=0)
_, f0_norm = extract_f0(wav, fs, lo, hi)
f0, sp, ap = get_world_params(wav, fs)
f0 = average_f0s([f0])[0]
wav_mono = get_monotonic_wav(wav, f0, sp, ap, fs)
fea = get_spenv(wav_mono)
print(f"len(fea)={len(fea)}, source_path={source_path}")
if __name__ == '__main__':
main()
Output of the above-mentioned code gives:
len(fea)=90, source_path=data/vctk/test/p225_001.wav
len(fea)=60, source_path=data/vctk/test/p258_001.wav
When I use p225_001.wav
taken from VCTK Corpus with audio rate of 16000
I get len(fea)=129
.
How was the test data prepared?
FYI:
Recipe I used for converting p225_001.wav
taken from VCTK Corpus to have audio rate of 16000
is
ffmpeg -y -i "$f" -ar 16000 "${dst_dir}/$(basename "$f")"
We removed silence in the test data so there might be a length mismatch
In demo.ipynb there is function pad_fea()
defined as
def pad_fea(fea):
return np.pad(fea, ((0,T-len(fea)), (0,0)), 'constant')
where
T = 192 # maximum number of frames in the output mel-spectrogram
How was the value of T
determined?
For the demo.ipynb, defining T
as
T = max([config.max_len_pad, len(src_spk_fea), len(tgt_spk_fea)])
seems to work for any input source and target files.
Therefore, I don't have to worry about preparing test data to have len(fea)<=192
.
Not sure if this is the correct way to do it.
I'll highly appreciate your opinions.
When we trained the model we truncated the data to guarantee each utterance has length less than 192 frames. Feel free to modify the parameter to fit your need
Thanks