realzza/espnet

Feb 10th Meeting

realzza opened this issue · 0 comments

Todos

  • Check: Does VAD change speech data in data prep (P1)

    No. The VAD step computes the VAD information only, and store it in the dumpdir, in the file vad.scp. The VAD step is used to mark to non-speech segments, and then exclude those segment information from training. However, it is true that these missing blanks could affect our reconstruction loss. But it can improve the quality of synthesized audios. It is a tradeoff we need to be aware.

  • Keep VITS with xvector and VAD training

  • #3

    • No, the decoded wav sample rate is still 22050. Trying the following steps.
      • check the training process
      • check tts_inference.py file on sample rate usage.
    • Inference jobs are not eligible to submit since Feb 13th. Couldn't decode to see if meet correct requirement.
    • Applied retrained model. Speaker information is integrated! /ocean/projects/cis210027p/zzhou5/espnet/egs2/librispeech_100/tts_vits/exp/16k_xvector/tts_beta_lib100_vits_tts_all16k_char_xvector/decode_with_trained_16k_vocoder
  • If 3 does not work, consult Jiatong (p2)

  • Run inference w/o trained vocoder

  • Integrate VITS model in cyclic systems (p3)