Differences with subword-nmt
loretoparisi opened this issue · 5 comments
I have found that loading a fastBPE
codes and vocabulary against subword-nmt
I get a different result in the bpe codes:
Using fastBPE
hoy quiero que te qu@@ ede &@@ apo@@ s@@ ; a dormir
this song is gonna make you mad
Using subword-nmt
ho@@ y qui@@ ero que te que@@ de &@@ apo@@ s@@ ; a dor@@ mir
th@@ is son@@ g is gon@@ na make you mad
using the same codes and vocabulary, with minimal adaptation in the latter package. My understanding of BPE
was that the implementation should be almost the same.
I have asked subword-nmt
author as well: rsennrich/subword-nmt#76
[UPDATE]
Considering the new Python API Wrapper subword-nmt
is not necessary anymore, by the way would be interesting to understand those differences!
Thanks a lot!
How large is the dataset on which you learned the BPE codes? I believe in the original implementation they do not merge two BPE splits if the resulting word only appear once in the original corpus. This will make a difference for small training datasets (typically, if the word "gonna" only appears once in your training set), but otherwise there will be no difference.
@glample it's FAIR LASER :)
So codes and vocabulary are
Codes: https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes
Vocab: https://dl.fbaipublicfiles.com/laser/models/93langs.fvocab
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab
. 87264459
, 78156033
de 19001435
- 13731976
? 13338524
a 13062980
i 8917603
en 8272731
" 8258142
la 7623301
root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes
e n 52708119
e r 51024442
e n</w> 47209692
a n 46619244
i n 44583543
s t 42633672
a r 34974160
o n 31941788
t i 30717853
d e 30509691
@glample it's FAIR LASER :)
So codes and vocabulary areCodes: https://dl.fbaipublicfiles.com/laser/models/93langs.fcodes
Vocab: https://dl.fbaipublicfiles.com/laser/models/93langs.fvocabroot@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fvocab . 87264459 , 78156033 de 19001435 - 13731976 ? 13338524 a 13062980 i 8917603 en 8272731 " 8258142 la 7623301 root@3f40ea8e2cc4:/tornado_api# head -n10 /root/laser_models/93langs.fcodes e n 52708119 e r 51024442 e n</w> 47209692 a n 46619244 i n 44583543 s t 42633672 a r 34974160 o n 31941788 t i 30717853 d e 30509691
why subword-nmt‘s vocab has label "@@",but fastBPE has no.
closing this because solved my issue. Not sure about @zpppy question.