glample/fastBPE

getvocab issue using french text

ctoffo opened this issue · 0 comments

Hello guys,

I applied getvocab on a french text with the following line
./fast getvocab marie_claire.txt > new_vocab

However, I have seen a bug (if it is a bug!) : some tokens are duplicated, with the second copied token written with a line break. Here an example (it's just a cut extract of the full initial vocab output) :

Capture d’écran 2019-10-24 à 17 00 32

You can see et and de in the example above. Furthermore, the vocab starts exactly as reported : a line break, a space and the frequence (2439). Still a bug ?

Here the french text :
wget -O marie_claire.txt http://www.gutenberg.org/cache/epub/58501/pg58501.txt

Any idea ?

Thanks a lot for your help :)