dbolya/tomesd

bidirectional version?

Xynonners opened this issue · 4 comments

Hello there,
I just came across another paper which proposed bidirectional bipartite matching.

https://arxiv.org/pdf/2403.10030 (near the end of page 4)

"""
Still, as shown in Step2 of Figure 3, the number of target tokens Xβ cannot be reduced. To handle this issue, MCTF performs bidirectional bipartite soft matching by conducting the matching in the opposite direction with the updated token sets ˜Xα, and ˜Xβ as in Step 3, 4 of Figure 3.
"""

Would this be possible to be applied to tomesd as well?

Thanks for the suggestion!

I gave the paper a read, and yeah it looks like it would be usable. The only issue is it likely wouldn't help much (if at all). Even in their own paper, they gain only +0.1% accuracy from using it (Table 4). That's probably not even worth the extra time taken from doing the additional matching step (even with reusing the similarity scores as they suggest).

Also, tomesd specifically has fewer tokens in the "Xβ" set than the original ToMe: by default, only 25% of tokens. So I'd imagine the accuracy increase would be strictly worse on tomesd than ToMe (which split the tokens 50-50).

Thanks for the suggestion!

I gave the paper a read, and yeah it looks like it would be usable. The only issue is it likely wouldn't help much (if at all). Even in their own paper, they gain only +0.1% accuracy from using it (Table 4). That's probably not even worth the extra time taken from doing the additional matching step (even with reusing the similarity scores as they suggest).

Also, tomesd specifically has fewer tokens in the "Xβ" set than the original ToMe: by default, only 25% of tokens. So I'd imagine the accuracy increase would be strictly worse on tomesd than ToMe (which split the tokens 50-50).

Thanks for the quick response!

I see, I was reading the code and wondering what it meant by "only the top left corner."

if the accuracy gains are so low, then you're right that it's probably not worth it.

I was reading another paper (CrossGET) that did a full similarity matrix? (I believe though they do say the algorithm is "approximate"): https://arxiv.org/abs/2305.17455 and some other stuff like token stacking but I think the O(nD) complexity probably also wouldn't be worth it.

I was reading another paper (CrossGET) that did a full similarity matrix? (I believe though they do say the algorithm is "approximate"): https://arxiv.org/abs/2305.17455 and some other stuff like token stacking but I think the O(nD) complexity probably also wouldn't be worth it.

Yeah I even have a full similarity matrix matching algorithm in the original ToMe paper and it only gets +0.11% over the bipartite matching scheme, which is why I didn't even consider it. That kind of thing could be useful in some situations, but with the "destination" or "Xβ" being as small as it is in tomesd, I think most of these methods are unlikely to help enough for their speed hit to be worth it.

What might help more is improving the similarity metric (e.g., by adding an importance weight). You'd have to be careful when computing it though, since computing the similarity matrix is the most expensive part of ToMe. I notice a lot of these papers that include other metrics don't report speed but instead only FLOPs. Suspicious.

I was reading another paper (CrossGET) that did a full similarity matrix? (I believe though they do say the algorithm is "approximate"): https://arxiv.org/abs/2305.17455 and some other stuff like token stacking but I think the O(nD) complexity probably also wouldn't be worth it.

Yeah I even have a full similarity matrix matching algorithm in the original ToMe paper and it only gets +0.11% over the bipartite matching scheme, which is why I didn't even consider it. That kind of thing could be useful in some situations, but with the "destination" or "Xβ" being as small as it is in tomesd, I think most of these methods are unlikely to help enough for their speed hit to be worth it.

What might help more is improving the similarity metric (e.g., by adding an importance weight). You'd have to be careful when computing it though, since computing the similarity matrix is the most expensive part of ToMe. I notice a lot of these papers that include other metrics don't report speed but instead only FLOPs. Suspicious.

Thanks for the responses! Closing now.