Sentencepiece-Pretrained-Models

Pretrained models

pretrained model	dataset	vocab size	model type	max sentence length
movie-corpus_8000	Cornel movie-dialogs corpus	8,000	bpe	999,999
kowiki_8000	korean wikipedia dumps	8,000	bpe	999,999

Test Model (`sample.py`)

flag	description	example	default
-m, --model	pretrained model file path	-m output/movie-corpus_8000.model	required
-i, --inputs	input texts as list	-i "I am Iron man" "hello from the other side"	required

$ python sample.py -i "i am ironman" "hello from the other side" -m output/movie-corpus_8000.model
i am ironman
[7701, 397, 6636, 627]

hello from the other side
[3619, 279, 21, 467, 1250]

Train Model (`train.py`)

flag	description	example	default
-m, --mode	preprocessing mode	-m wiki	required
-i, --input	input file path	-i src/movie-corpus/utterances.jsonl	required
-p, --prefix	custom name for model	-p movie-corpus	required
-v, --vocab	vocab size	-v 8000	required
-t, --model_type	model type	-t bpe	bpe
-l, --max_sentence_length	max sentence length	-l 999999	999999

trained models will be saved in outputs/

python train.py --mode movie-corpus --input src/movie-corpus/utterances.jsonl --prefix movie-corpus --vocab 8000

ISSUE

wikiextractor
- (Error) ValueError: cannot find context for 'fork’
  - This is a known issue with Windows 10, For now, no official fix for this is provided. But you can fix certain files on your own as follows: : huggingface/transformers#16898

FloweryK/Sentencepiece-Pretrained-Models

Sentencepiece-Pretrained-Models

Pretrained models

Test Model (sample.py)

Train Model (train.py)

ISSUE

Test Model (`sample.py`)

Train Model (`train.py`)