Here I tested three approaches for Text Generation.
Using Pytorch and word embedding. Original post: https://machinetalk.org/2019/02/08/text-generation-with-pytorch/
I added Spacy library for testing their embedding algorythm.
To use it, modify the train_file hyper paraeter to point to the *.txt file you wanna train the model with. You can find in Source/TXT/merged.txt a file made with 178 books in spanish. Once you have defined your train_file (and any other hyper parameter that you want to modify), you can execute the script
Using Tensorflow and one hot encoding. Original post: https://www.analyticsvidhya.com/blog/2018/03/text-generation-using-python-nlp/
To use it, modify the path to the *.txt file you wanna train the model with, in line 9. You can find in Source/TXT/merged.txt a file made with 178 books in spanish. Once you have defined your train file, you can execute the script
Using Transformers under Pytorch. Original post: https://towardsdatascience.com/train-a-gpt-2-transformer-to-write-harry-potter-books-edf8b2e3f3db
To use it, call it to train the model with the following command:
!python /content/run_lm_finetuning.py --output_dir=output --model_type=gpt2 --model_name_or_path=gpt2-medium --do_train --train_data_file='/content/TXT/merged.txt' --do_eval --eval_data_file='/content/TXT/merged copia.txt' --overwrite_output_dir --block_size=200 --per_gpu_train_batch_size=1 --save_steps 4000 --num_train_epochs=4
If you want to know more about the arguments you can use, please go to the original post
For testing, call the test script with the following command:
!python /content/run_generation.py --model_type=gpt2 --model_name_or_path=output --length 300 --prompt "Todo era caos, y nadie alcanzaba a comprender." --temperature=1.0
You can fin in Source/ two scripts to convert a pdf file to text for data processing