getvocab issue using french text
ctoffo opened this issue · 0 comments
ctoffo commented
Hello guys,
I applied getvocab
on a french text with the following line
./fast getvocab marie_claire.txt > new_vocab
However, I have seen a bug (if it is a bug!) : some tokens are duplicated, with the second copied token written with a line break. Here an example (it's just a cut extract of the full initial vocab output) :
You can see et
and de
in the example above. Furthermore, the vocab starts exactly as reported : a line break, a space and the frequence (2439). Still a bug ?
Here the french text :
wget -O marie_claire.txt http://www.gutenberg.org/cache/epub/58501/pg58501.txt
Any idea ?
Thanks a lot for your help :)