Need some information about the tokenizer

Question

Need some information about the tokenizer

xtfocus opened this issue 5 months ago · 0 comments

Hi, thanks for the great work.

I'm new to the Vietnamese language modeling scene. I came across some major articles from 2019-2021 where people perceived word segmentation as the standard step before tokenization, which I appreciate but still not quite sure if it is actually necessary. Cannot find much information to answer that myself.

Then I take a look into your paper: you trained a BPE tokenization (sort of a sub-word tokenization). I have a few questions:

Is it correct that word segmentation is not used at all to create PhoGPT? If that's correct, I would love some reasoning.
You used segmentation in PhoBert. Why didn't you use BPE back then?