This project has been realized as a part of Natural Langage Processing class teached by Matthias Gallé, Naverlabs for the MSc AI (CentraleSupelec 2019/2020). Project members are:
- Gaël de Léséleuc
- Alexandre Duval
- Thomas Lamson
A detailed description of the project can be found in the report folder. The idea is to develop an automatic writing tool to help authors in their task. The final intended product is a web-service where authors can write text and ask for paragraph automatic generation based on previous text, following text, desired size/theme, a list of entities to mention in the generation as well as an optional summary of the content. More concretly, we fine-tuned an OpenAI GPT2 model on novel specific data for controllable and contextualised text generation.
You can either install the project to simply use it and have fun with text generation, or you can dive deeper and install it for training.
A frontend interface is available at http://textgen.thomas-lamson.com/, but there is no backend running behind it (for cost reasons). However, you can easily run the model yourself locally and use the frontend to interact with it! Here are the steps to do so:
- Clone this repository:
git clone https://github.com/WeazelDev/AITextGenerator.git
cd AITextGenerator
pip install -r requirements.txt
- Download the pretrained models:
It contains the NER model we used as is, and our trained version of GPT2-Small.
- Download the archive at: https://drive.google.com/open?id=1svTqyugLI36zaX6Fo6hr-Od4ATRxH2vf (~1.6 Go)
- Extract it directly into the project's root folder.
- Go into
webserver/running_directory/
and runlaunch_allinone_backend.py
.
This will run a local backend on default port 7777 (you don't need to open it as it is working on localhost). If you want to change this, change the NER or generation models or dissociate the backend servers and ports for master, generation and NER backends, change the values inside config.json
.
-
Navigate to: http://textgen.thomas-lamson.com/ with your favorite browser. (works best with Google Chrome)
-
Check the tick saying
"Run on local server?"
at the top of the page.
Again, change the connection port on the frontend page if needed.
Please note that the first NER and generation orders will likely fail and result into a "Servers overloaded" error as heavy models are loading on the backend side. Just wait for a bit, type something in the editor to actualise it and it should work fine. If not, maybe you don't have enough computing power to stay under the frontend's timeout delays. You might want to grab your copy of the frontend's source code to modify it: https://github.com/WeazelDev/AITextGeneratorFront.
If you want to train the model, adapt it to your own projects or even plug another model into our code general structure, feel free to do so! Here is how to get started for the original model however:
- Clone this repository:
git clone https://github.com/WeazelDev/AITextGenerator.git
cd AITextGenerator
pip install -r requirements.txt
- Download the data archive: (if needed)
It contains all the input and output data extracted and generated by this project.
- Download the archive at: https://drive.google.com/open?id=19b_x5dsie21Z6ZW7R6vnvwN3IPaKPXMv (~250Mo)
- Extract it directly into the project's root folder.
Refer to the data.py file to observe the framework we utilised to obtain this data. If you want to process new books or to re-run the code, you first need to populate the Gutenberg cache (long process). To do so, uncomment the first lines of the code in main.py.
- Download the pretrained models:
It contains the NER model we used as is, and our trained version of GPT2-Small.
- Download the archive at: https://drive.google.com/open?id=1svTqyugLI36zaX6Fo6hr-Od4ATRxH2vf (~1.6 Go)
- Extract it directly into the project's root folder.
In order to make our experiments, we used the following script that can be found on project root
- splitter.py: to split the raw text file in paragraph and save them in json files
- ner.py: to perform entities recognition on the paragrap
- summarization.py: to summarize with differents summarizer the paragraph
- finetuning.py : to finetune GPT. We used the fine-tuning script proposed by Huggingface. We make a few changes to be able to load our custom torch dataset and correctly handle some specifity of our projects.
- evaluation.py : to generate text with a fine-tune GPT2 and compute our custom metrics on the generated paragraph
- json_generation handles all the text preprocessing : from raw text and metadata (extracted from Gutenberg) to final json file containing the novel split by paragraph and related information : the list of entities inside the paragraph, size, summaries, etc
- torch_loader module is used to load and vectorize on the fly the data (preprocessed json file) so that it can be directly feed into a GPT2 model for fine-tuning
- model_training contains the script to fine-tune the GPT2 model. It is simply an adaptation of huggingface/run_langage_modeling that allows its use on our custom dataset
- model_evaluation module used to evaluate the output quality of our fine-tuned GPT2 model
- model_use module interface our GPT2 fine-tune model with the web_service backend
- web_server contains the back_end interface of our web service. The web service front-end has been pushed to a separate repo and can been found at : https://github.com/WeazelDev/AITextGeneratorFront
- third_party folder contains several framework that has been cloned directly into our project.