malaysia-ai/prepare-tokenizer
Prepare SentencePiece and BPE on Malaysian texts (Jawi, Melayu, Manglish, Mandarin, Tamil).
Jupyter Notebook
Issues
- 0
Train sentencepiece tokenizer 32k size
#3 opened by huseinzol05 - 0
combine texts
#2 opened by huseinzol05 - 0
Dedup texts
#1 opened by huseinzol05