Distance Embedding

Question

Distance Embedding

kizunasunhy opened this issue 2 years ago · 4 comments

Could you please kindly explain why the distance embedding should be like this?

array([[19, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13],
       [ 1, 19, 10, 11, 11, 12, 12, 12, 12, 13, 13, 13],
       [ 2,  1, 19, 10, 11, 11, 12, 12, 12, 12, 13, 13],
       [ 2,  2,  1, 19, 10, 11, 11, 12, 12, 12, 12, 13],
       [ 3,  2,  2,  1, 19, 10, 11, 11, 12, 12, 12, 12],
       [ 3,  3,  2,  2,  1, 19, 10, 11, 11, 12, 12, 12],
       [ 3,  3,  3,  2,  2,  1, 19, 10, 11, 11, 12, 12],
       [ 3,  3,  3,  3,  2,  2,  1, 19, 10, 11, 11, 12],
       [ 4,  3,  3,  3,  3,  2,  2,  1, 19, 10, 11, 11],
       [ 4,  4,  3,  3,  3,  3,  2,  2,  1, 19, 10, 11],
       [ 4,  4,  4,  3,  3,  3,  3,  2,  2,  1, 19, 10],
       [ 4,  4,  4,  4,  3,  3,  3,  3,  2,  2,  1, 19]])

Thank you.

Answer 1 · 2022-11-09T13:15:01.000Z

This is a distance index matrix, instead of the embedding. Each index is used to obtain the corresponding distance embedding.

Answer 2 · 2022-11-09T13:25:15.000Z

Oh yes sorry for the mistake. But why it's organized as the power of 2 and why the number in the middle is 19?

Answer 3 · 2022-11-09T13:48:05.000Z

The distance index organized as the power of 2 is to avoid the data sparse problem. The token pair with long distance usually has a low frequency. The number 0 is used for padding, so I use 19 to replace it.

Answer 4 · 2022-11-09T15:09:51.000Z

The distance index organized as the power of 2 is to avoid the data sparse problem. The token pair with long distance usually has a low frequency. The number 0 is used for padding, so I use 19 to replace it.

Thank you!