syang1993/gst-tacotron

Train as a Tacotron1 script problem

dazenhom opened this issue · 4 comments

Thanks for your great work, but I found that if I set the hyperparameter use_gst=False and run, it seemed different from my understanding of Tacotron1. The tacotron.py code is part of here.

      if reference_mel is not None:
        # Reference encoder
        refnet_outputs = reference_encoder(
          reference_mel, 
          filters=hp.reference_filters, 
          kernel_size=(3,3),
          strides=(2,2),
          encoder_cell=GRUCell(hp.reference_depth),
          is_training=is_training)                                                 # [N, 128]
        self.refnet_outputs = refnet_outputs                                       

        if hp.use_gst:
          # Style attention
          style_attention = MultiheadAttention(
            tf.expand_dims(refnet_outputs, axis=1),                                   # [N, 1, 128]
            tf.tanh(tf.tile(tf.expand_dims(gst_tokens, axis=0), [batch_size,1,1])),            # [N, hp.num_gst, 256/hp.num_heads]   
            num_heads=hp.num_heads,
            num_units=hp.style_att_dim,
            attention_type=hp.style_att_type)

          style_embeddings = style_attention.multi_head_attention()                   # [N, 1, 256]
        else:
          style_embeddings = tf.expand_dims(refnet_outputs, axis=1)                   # [N, 1, 128]
      else:
        print("Use random weight for GST.")
        random_weights = tf.random_uniform([hp.num_heads, hp.num_gst], maxval=1.0, dtype=tf.float32)
        random_weights = tf.nn.softmax(random_weights, name="random_weights")
        style_embeddings = tf.matmul(random_weights, tf.nn.tanh(gst_tokens))
        style_embeddings = tf.reshape(style_embeddings, [1, 1] + [hp.num_heads * gst_tokens.get_shape().as_list()[1]])

Original Tacotron1 code shoudn't train with the reference encoder part right?
However, your code pass the non-gst mode data into a reference_encoder, which sounds strange ?
Maybe we can exchange the two IF condition codes to make it correct.

if hp.use_gst: 
***
if reference_mel is not None:  
***

THANKS

@dazenhom Hi, thanks for your notes. In this repo, using use_gst=False doesn't mean the tacotron1 model. Google also has another paper, which uses reference encoder to do style and multi-speaker synthesis. You can found it at https://arxiv.org/abs/1803.09047.

@syang1993 Thanks for your reply, I took a mistake with Tacotron1 from your work. I shall find another version of Tacotron1 to run my test. Thanks anyway.

I have try “use_gst=False”, but it seems to be the same as tacotron1? Although the refnet_outputs will change, but the generated audio will hardly change with different reference audio.

@hyzhan In my experienc, maybe it's because of your data. If you use some expressive speakers as your trainning data and do the inference, the speech can be different(changed with the reference audio) . Otherwise, it can remain little different as you mentioned.