qk_norm and kv_heads conflict

Hi @lucidrains , hope all is good.

I believe there is a conflict when using both qk_norm and kv_heads != heads that should be solved changing:

x-transformers/x_transformers/x_transformers.py

Line 1022 in eeed503

self.qk_norm_k_scale = nn.Parameter(torch.ones(heads, 1, dim_head))

to:
self.qk_norm_k_scale = nn.Parameter(torch.ones(kv_heads, 1, dim_head))

Thanks!

thanks Alejandro!

@alexdemartos have you seen e2-tts btw?

@alexdemartos have you seen e2-tts btw?

Yes, this was an interesting work. I tried a very similar approach. In my case it works generally OK, but I'm finding some stability issues with long sequences (basically the model doesn't seem to extrapolate well to longer sequences than those seen during training). Explicit duration prediction is robust to long sequences at the expense of minor naturalness degradation.

What was your experience?

@alexdemartos indeed, although i've also heard it improves with training on said longer sequences

a student has open sourced quite a good model based on e2-tts. think we will see a lot of follow up research along these lines soon

quite a surprising paper!

Thanks for sharing, I'll take a closer look tomorrow.

I'm very much interested in works trying to address stability/hallucination issues in both AR and E2TTS-like architectures. Most people/companies often omit details or try to hide them...

I'm very much interested in works trying to address stability/hallucination issues in both AR and E2TTS-like architectures

FYI, on this topic I just found about a new Tacotron-series paper (Very Attentive Tacotron):
https://arxiv.org/pdf/2410.22179

@alexdemartos nice! so some position engineering and an RNN does the trick

@alexdemartos have you ever tried adding relative positions to the cross attention block? i have a few ideas

@alexdemartos have you ever tried adding relative positions to the cross attention block? i have a few ideas

Only for synchronous sequences, so not really. I guess the tricky part here is how to solve the alignment problem between target and cross-attended sequences (i.e. how to center the relative position in the cross-attention sequence at each target timestep). Happy to hear about your ideas!

@alexdemartos haha don't get too excited, its just a simple idea for generating attention bias with monotonicity

@alexdemartos i'll throw it into a separate branch and ping you when i do

Hi @lucidrains . (FYI) Again, on the same topic, I found relevant the following work proposing Position-Aware Cross-Attention (Section 2.1): https://arxiv.org/pdf/2406.04467