-
Download corpus texts
# download Tekstaro texts python download-tekstaro.py \ --tmp_dir=./tmp \ --output_dir=./corpus # download Wikipedia texts python download-wikipedia.py \ --page_list=./wikipedia-featured.txt \ --output_dir=./corpus python download-wikipedia.py \ --page_list=./wikipedia-legindaj.txt \ --output_dir=./corpus # download Marvirinstrato wget -O ./corpus/marvirinstrato.txt \ https://www.smashwords.com/books/download/267558/6/latest/0/0/marvirinstrato-originalaj-noveloj-en-esperanto-esperanto-edi.txt # download OSCAR text wget -O ./corpus/oscar.eo.txt \ https://cdn-datasets.huggingface.co/EsperBERTo/data/oscar.eo.txt
The corpus consists of text from the following sources:
- Tekstaro excluding Homaranismo (1906)
- Wikipedia Elstaraj artikoloj and Legindaj artikoloj
- Marvirinstrato
- Esperanto subset of OSCAR
-
Compile the corpus
python compile-corpus.py --split --split_len=2048
-
Train the tokenizer
python train-tokenizer.py --init --train_file=corpus.txt
-
Split training and test data
python split.py
-
Train the model
python train.py --init --epocs=1
-
Prompt the model
python inference.py --text="Saluton"