/Sentencepiece-Pretrained-Models

pretrained models and a training code for sentencepiece

Primary LanguagePython

Sentencepiece-Pretrained-Models

Pretrained models

pretrained model dataset vocab size model type max sentence length
movie-corpus_8000 Cornel movie-dialogs corpus 8,000 bpe 999,999
kowiki_8000 korean wikipedia dumps 8,000 bpe 999,999


Test Model (sample.py)

flag description example default
-m, --model pretrained model file path -m output/movie-corpus_8000.model required
-i, --inputs input texts as list -i "I am Iron man" "hello from the other side" required
$ python sample.py -i "i am ironman" "hello from the other side" -m output/movie-corpus_8000.model
i am ironman
[7701, 397, 6636, 627]

hello from the other side
[3619, 279, 21, 467, 1250]


Train Model (train.py)

flag description example default
-m, --mode preprocessing mode -m wiki required
-i, --input input file path -i src/movie-corpus/utterances.jsonl required
-p, --prefix custom name for model -p movie-corpus required
-v, --vocab vocab size -v 8000 required
-t, --model_type model type -t bpe bpe
-l, --max_sentence_length max sentence length -l 999999 999999
  • trained models will be saved in outputs/
python train.py --mode movie-corpus --input src/movie-corpus/utterances.jsonl --prefix movie-corpus --vocab 8000


ISSUE

  • wikiextractor
    • (Error) ValueError: cannot find context for 'fork’
      • This is a known issue with Windows 10, For now, no official fix for this is provided. But you can fix certain files on your own as follows: : huggingface/transformers#16898