lucidrains/voicebox-pytorch

suggestion for symmetric alibi implementation.

seastar105 opened this issue · 7 comments

as you mentioned, alibi is not trivial to apply for bidirectional encoder.

there's some work from meta, data2vec 2.0, where they employed alibi with bidirectional encoder, especially for audio encoder.

voicebox adopted similar architecture, incorporating both convolutional positional encoding and alibi. and it appears they utilized a symmetric version of alibi. you can see alibi bias part of data2vec 2.0 code.

https://github.com/facebookresearch/fairseq/blame/100cd91db19bb27277a06a25eb4154c805b10189/examples/data2vec/models/modalities/base.py#L568-L577

what do you think about applying symmetric alibi?

@seastar105 hey Haesung, i've tried symmetric Alibi before, but the best results i've had is still dynamic positional bias. that is what i would use if i had to use attention bias

since flash attention came on the scene though, it is not preferable to use attention bias. rotary embeddings should be a good fit here, given it has been proven out in a number of significant models (palm, llama)

@seastar105 actually, i spoke too soon - let me read how they did their implementation this morning. do you know if there was a paper with the necessary data to show the effectiveness of their proposal?

@seastar105 ok, just took a quick look. i would say that is not correct. how would the network differentiate between left and right given the same relative distance? my past attempt involved allowing the network to learn different slopes between left and right, but it didn't work that well as just letting it completely parameterized by an MLP (like NERF)

i don't think i'm going to go with this until i see a follow up paper

actually i have no idea, and do not know papers to suggest why symmetric alibi bias can work better than dynamic positional bias in bidirectional encoder, especially for rope, also why symmetric alibi is better than assymetric way. i've just noticed alibi is used for audio in data2vec 2.0, and also voicebox.

my hyphothesis was learned alibi bias could be lightweight alternative to conformer since it penalizes attention scores to be local(at least, 0.5 is quite high penalty for attention scores). conformer performs better than vanila transformers in many works, even generation of speech, appeared in espnet's report (https://arxiv.org/pdf/2010.13956.pdf). so, suggestion is not backed on concrete experiments.

i really appreciate your experience of alibi bias in bidirectional trasformer. i'm gonna compare conformer models trained for ASR and fine-tuned alibi feature encoder. if i have any decent results, i'm gonna report here.

@lucidrains Sorry this is a bit late but writing this for completeness. We updated the paper with details on Alibi Bias. We did find Alibi bias to converge faster than fixed positional embeddings or no pos. It is similar to the symmetric option here: ofirpress/attention_with_linear_biases#5

Note: Flash Attention v2 also supports Alibi bias now.

image

@apoorv2904 sounds good, i think i will stick with rotary for this repo