Currently, it is only transformer with pointer generator!
It supports:
Dynamic vocab during generation- Beam search
- Span Positional Encoding
- Decaying Copy Rate
Since BPE almost eliminates the need for OOV copying in Machine Translation, the dev branch implements a new positional encoding and decaying copying rate to guide the model to copy entire span of tokens. If you still need OOV copy capability, use master branch instead.
Each token is encoded into one number to show how far it is to the closest space to the left. E.g.
▁F ollow ▁him ▁on ▁Twitter ▁at ▁https :// t witter . com / ge or ge na v atr ue .
would have
0 1 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
As you can see, tokens within the same span are encoded monotonically. It helps the model to recognize the whole span better.
So at step ▁https
, then for the next step, the probability of copying that word would be partially transfered to the next token within the same span. i.e. ://
will have some probability coming from ▁https
.
$$ p't(token_i) = Prob{t-1}(token_{i-1}) * 0.1 + 0.9 * Prob_t(token_i) \ Prob'_t(token) = softmax(p'_t(token)) $$