syang1993/gst-tacotron

Style Token Layer implementation question

acetylSv opened this issue · 20 comments

Thanks a lot for this nice implementation, but at section 3.2.2 in the original paper https://arxiv.org/pdf/1803.09017.pdf, the authors mentioned that:

we found that applying a tanh activation to GSTs before applying attention led to greater token diversity.

So I am a little confused that in your implementation:

style_embeddings = tf.nn.tanh(style_attention.multi_head_attention()) # [N, 1, 256]

Should we first apply the tanh activation to GSTs embedding then compute the multi-head attention weights instead of first compute the weighted sum of GSTs and then apply tanh?

@acetylSv

Thanks for your comment. To be honest, I'm not sure whether this implementation matches the details of the original paper, since the paper didn't talk much about it.

About the tanh, applying it before or after the attention process could compress the style embedding into the same scale of encoder state. But as the paper suggested, maybe it's better to apply it before style attention. You can add the tanh operation in the below line to match the paper.

tf.tile(tf.expand_dims(gst_tokens, axis=0), [batch_size,1,1]), # [N, hp.num_gst, 256/hp.num_heads]

I will also compare and change it. Thanks.

@syang1993 great work! I'm having a go at this myself in pytorch - just to clarify one thing - the query to the style attention is the hidden state from the reference encoder and both the keys and values are the style tokens right?

@fatchord Yes, the query of the multi-head attention is from the reference encoder. The values are the style tokens, and a transform layer is applied to the tokens to get keys like other attention methods.

@syang1993 Thanks! It makes perfect sense now.

@syang1993 sorry to be bothering you again. I'm just curious if you'd like to swap notes?

I've implemented the gst model in pytorch and the reference audio works quite well. The only problem is using the style tokens themselves without reference audio - I am not getting good results. I'm wondering have you tried using just style tokens? Any luck with it?

I've tried content attention too and again, reference audio works great but the style tokens aren't working by themselves.

@fatchord Surely we can communicate and work together! In my earlier experiments, I test the gst-model without reference audio. To achieve this, I used some random weights for the style tokens, sometimes it can generate good audio, but sometimes it cannot.
For the new code, I didn't test it since I'm not at the school this month. But I guess it may suffer the same problem.
I also confused this problem, I don't know whether it's because the data size or implementation error. How do you think about it? Besides, it is very helpful if you can share your pytorch repo. : ) I also began using pytorch from last month, I think I can learn a lot from you.

@syang1993 My initial thoughts on the problem was that perhaps it was the multi-headed attention. Since it increases the effective number of tokens by num_heads (or is my intuition wrong here?). So when I picked a single style token - it would have multiple 'heads' in it - in other words, the attention mechanism would likely never pick all heads in a single style token at any one time. And same as you, I only had luck with random tokens.

So then I tried content attention and I just realised a couple hours ago that I made a silly mistake in the attention - after training for 700k steps! So I have to retrain it again. Anyway, I'm thinking the advantage of the content attention is that it will allow for straightforward selection of a single style token.

I'm in two minds about creating a repo for it. I really need to polish the code - and there could be more silly mistakes in there so I'll have recheck every single line again.

@fatchord Thanks for your thoughts. I'm also not so sure about the multi-head attention since the paper didn't talk the details. I will also do more experiments after I go back to school to verify this.
If you get future results or conclusions, could you share it to me?

Yeah sometime small mistake will have a bad influence, my earlier repo also has mistakes so that the performance isn't so good as current one. I'm looking forward to your new repo when you finish it.

So I trained the content based attention for the style tokens and again, I get the same problem. Not sure how to move forward on this problem.

@syang1993 The only thing I can think of - perhaps the softmax attention is sharpened? That would force the model to choose mainly one style token at any time - thus making it more 'natural' to condition the decoder on a single style token at inference. What do you think?

@fatchord Yeah I will check the weights of each tokens to see how it happens. If it always has a large value for one token, maybe this is the problem you mentioned.

But in the paper, they said they found each token had specific meaning, such as speaking speed. I also didn't find this things, and I'm not sure whether because the limited data. I'll get more data soon, then I can check it.

@fatchord Is your pytorch code available online?

@fazlekarim not at the minute. It'll probably be online sometime next week or so.

@syang1993 Not sure if you're aware of it - but new paper from the tacotron crew: https://arxiv.org/pdf/1808.01410.pdf

@fatchord thats so sad! when you are about to come up with something new, they just come up with something even better.

@fazlekarim That's the way it goes I guess! On the upside, the additional ideas introduced in the paper should be fairly straightforward to implement.

@fatchord Thanks for the remind! I just toke a look at it yesterday, as you said, it's easy to implement it since they only added an extra module to predict the style embedding from text.

@syang1993 One other thing - I had a look at the Blizzard2013 dataset, and it looks like they stripped out all the quotation marks. I think this could be a problem because the woman narrating changes her voice style dramatically when the text is in quotes. Without them, the model should find it more difficult to model prosody I think.

@fatchord Thanks for the remind! I just toke a look at it yesterday, as you said, it's easy to implement it since they only added an extra module to predict the style embedding from text.

@syang1993 is there an update to this repo with the TP-GST feature?

@fatchord it is very helpful if you can share your pytorch repo. : ) !