Abstract
Samples
Audio samples are taken from the VCTK data set [1].
A. Traditional voice conversion
Traditional many-to-many voice conversions are performed between different speakers that are seen during training. Some samples are presented in the table below.
<tbody>
<tr>
<th scope="row">M2M</th>
<td>
<audio controls="" >
<source src="resources/audio/M2M_source.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/M2M_target.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/M2M_nvcneto.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/M2M_nvcnet.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
<tr>
<th scope="row">M2F</th>
<td>
<audio controls="" >
<source src="resources/audio/M2F_source.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/M2F_target.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/M2F_nvcneto.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/M2F_nvcnet.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
<tr>
<th scope="row">F2M</th>
<td>
<audio controls="" >
<source src="resources/audio/F2M_source.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/F2M_target.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/F2M_nvcneto.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/F2M_nvcnet.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
<tr>
<th scope="row">F2F</th>
<td>
<audio controls="" >
<source src="resources/audio/F2F_source.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/F2F_target.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/F2F_nvcneto.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/F2F_nvcnet.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
Source | Target | NVC-Net† | NVC-Net |
---|
M2M: Male to male; M2F: Male to Female; F2M: Female to male; F2F: Female to female
B. Zero-shot voice conversion
Zero-shot many-to-many voice conversions are performed from/to speakers that are unseen during training. Some samples are presented in the table below.
<tbody>
<tr>
<th scope="row">S2U</th>
<td>
<audio controls="" >
<source src="resources/audio/S2U_source.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/S2U_target.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/S2U_nvcnet.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
<tr>
<th scope="row">U2S</th>
<td>
<audio controls="" >
<source src="resources/audio/U2S_source.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/U2S_target.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/U2S_nvcnet.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
<tr>
<th scope="row">U2U</th>
<td>
<audio controls="" >
<source src="resources/audio/U2U_source.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/U2U_target.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
<td>
<audio controls="" >
<source src="resources/audio/U2U_nvcnet.wav" type="audio/wav">
Your browser does not support the audio element.
</audio>
</td>
</tr>
</tbody>
Source | Target | NVC-Net |
---|
S2U: Seen to unseen; U2S: Unseen to seen; U2U: Unseen to seen
C. Diversity
NVC-Net can synthesize diverse samples by changing the latent representation of the speaker embedding. For a given reference utterance, the speaker network produces a Gaussian distribution. This allows us to sample multiple speaker embeddings.
Source | Target |
---|---|
Samples produced by NVC-Net | ||
---|---|---|
D. Additional studies
Below are samples comparing the outputs between NVC-Net wo (without normalization on the content code) and NVC-Net w (with normalization on the content code).
Source | Target | NVC-Net wo | NVC-Net w |
---|---|---|---|