Need some information about the tokenizer
xtfocus opened this issue · 0 comments
xtfocus commented
Hi, thanks for the great work.
I'm new to the Vietnamese language modeling scene. I came across some major articles from 2019-2021 where people perceived word segmentation as the standard step before tokenization, which I appreciate but still not quite sure if it is actually necessary. Cannot find much information to answer that myself.
Then I take a look into your paper: you trained a BPE tokenization (sort of a sub-word tokenization). I have a few questions:
- Is it correct that word segmentation is not used at all to create PhoGPT? If that's correct, I would love some reasoning.
- You used segmentation in PhoBert. Why didn't you use BPE back then?