Paper to code mapping
hoagy-davis-digges opened this issue · 6 comments
Hi, thanks for publishing the code for this paper.
I’m trying to understand how the lines 514 to 597 map on to the update rule you laid out in Figure (7) in the paper. Is there any further explanation you could offer as to how the variables match up? Particularly what variables f and h represent and how the cosine calculations are being made.
Hi,
Thanks for the question. I have added more comments to the source code file jose.c, especially on the optimization part. I hope those comments help. Please feel free to let me know if anything still remains unclear to you!
Best,
Yu
Thanks, that's really helpful, I just had two things that I'm still struggling to understand.
Firstly, I'm still not clear on where the (I - xxT) component of the update rule is. Secondly, why do you multiply h by 2 to calculate the step for the negative word update?
- I will take the following line for updating positive center word embeddings as an example, and the other implementations are similar.
Spherical-Text-Embedding/src/jose.c
Lines 518 to 519 in 0d79061
When updating the embedding for a positive center wordu
,syn0[c + l1]
refers to the positive center word embeddingu
;syn1neg[c + l3]
refers to the context word embeddingv
;f = cos(v, u) = v * u
. According to Equation 3 (the objective) in the paper, the negative Euclidean gradient ofu
is equal tov + d
(if the margin is not satisfied; otherwise it's 0). In the above code, we updateu
according to word context only, i.e., ignoring the document contextd
, so we only havev
as the negative Euclidean gradient. After multiplying it (v
) with the linear projection term(I - u u^T)
, we obtain its negative Riemannian gradientv - u u^T v = v - u * f
, which is implemented in the above code. Next, we updateu
according to document context only, i.e., ignoring the word contextv
, so we only haved
as the negative Euclidean gradient. Similarly, we have the following implementation:
Spherical-Text-Embedding/src/jose.c
Lines 573 to 574 in 0d79061
- As mentioned in the paper, the "step" value (which is a multiplier to the calculated gradient) is computed differently for positive samples and negative samples. For positive ones, we use cosine distance (i.e.,
1 - cosine similarity
) between the current embedding direction and the gradient direction as the multiplier, whose range is[0, 2]
. For negative ones, we use negative cosine similarity as the multiplier, whose range is[-1, 1]
. To make the updating scale (i.e., the maximum magnitude of the "step" value) the same for both positive and negative samples, we multiply 2 to the negative cosine similarity so that its range becomes[-2, 2]
.
Then you might wonder why the positive samples have an asymmetrical multiplier range ([0, 2]
) while the negative ones have a symmetrical multiplier range ([-2, 2]
). This is because the positive center word embeddings should always move closer to the context word embedding, while the negative sample embeddings do not always move farther away from the context word embedding. Recall that the negative center words are randomly sampled from the vocabulary, and it is very unlikely that it has exactly the same or exactly the opposite meaning with the real center word. Therefore, when the negative samples' embeddings are too close to the context word embedding, the model applies a big positive step to push it away from the positive center word; when the negative samples' embeddings are too far away from the context word embedding, the model applies a big negative step to drag it closer to the center word. In whichever case for negative samples, the maximum magnitude of the step value should match that of the positive one, so that the positive and negative samples are updated at the same pace. That's the reason why we multiply 2 to the negative step value.
I hope my explanations help. Please feel free to let me know if you have any further questions!
Best,
Yu
Thank you, that has completely cleared up the second point. I should have been clearer with my first question, I can see the reasoning behind including that in the update rule, but I can't see where in the code it has been implemented.
Thanks for the clarification. I have edited my comments above. I hope it answers your question. Please let me know there is still anything unclear!
Ah, I see, was having trouble spotting it because it had already been multiplied out into a different form. That all makes sense now, thanks so much!