biggytruck/SpeechSplit2

How was the test data prepared?

Closed this issue · 5 comments

How can I make p225_001.wav taken from VCTK Corpus with audio rate of 16000 be like p225_001.wav test data such that computing the output length of

def get_spenv(wav, cutoff=3):
gives 90.

To elaborate my question, have a look at the following code:

import pickle
from soundfile import read
from utils import *


def get_spk_meta(spk_wav_path, spk_meta_path):
    spk = os.path.splitext(os.path.basename(spk_wav_path))[0]
    spk_meta = pickle.load(open(spk_meta_path, "rb"))
    spk_key = spk.split("_")[0]
    spk_id_gender = spk_meta.get(spk_key)
    if spk_id_gender is None:
        raise Exception(f"{spk_key} speaker not found")
    return spk_id_gender + (spk_key, spk)


def main():
    spk_meta_path = 'spk_meta.pkl'
    files = ['data/test/p225_001.wav', 'data/test/p258_001.wav']
    for source_path in files:
        spk_id, spk_gender, spk_key, spk = get_spk_meta(source_path, spk_meta_path)
        wav, _ = read(source_path)
        fs = 16000
        if spk_gender == 'M':
            lo, hi = 50, 250
        else:
            lo, hi = 100, 600
        if wav.shape[0] % 256 == 0:
            wav = np.concatenate((wav, np.array([1e-06])), axis=0)
        _, f0_norm = extract_f0(wav, fs, lo, hi)
        f0, sp, ap = get_world_params(wav, fs)
        f0 = average_f0s([f0])[0]
        wav_mono = get_monotonic_wav(wav, f0, sp, ap, fs)
        fea = get_spenv(wav_mono)
        print(f"len(fea)={len(fea)}, source_path={source_path}")


if __name__ == '__main__':
    main()

Output of the above-mentioned code gives:

len(fea)=90, source_path=data/vctk/test/p225_001.wav
len(fea)=60, source_path=data/vctk/test/p258_001.wav

When I use p225_001.wav taken from VCTK Corpus with audio rate of 16000 I get len(fea)=129.
How was the test data prepared?

FYI:
Recipe I used for converting p225_001.wav taken from VCTK Corpus to have audio rate of 16000 is

ffmpeg -y -i "$f" -ar 16000 "${dst_dir}/$(basename "$f")"

We removed silence in the test data so there might be a length mismatch

In demo.ipynb there is function pad_fea() defined as

def pad_fea(fea):
    return np.pad(fea, ((0,T-len(fea)), (0,0)), 'constant')

where

T = 192 # maximum number of frames in the output mel-spectrogram

How was the value of T determined?

For the demo.ipynb, defining T as

T = max([config.max_len_pad, len(src_spk_fea), len(tgt_spk_fea)])

seems to work for any input source and target files.
Therefore, I don't have to worry about preparing test data to have len(fea)<=192.
Not sure if this is the correct way to do it.
I'll highly appreciate your opinions.

When we trained the model we truncated the data to guarantee each utterance has length less than 192 frames. Feel free to modify the parameter to fit your need

Thanks