CrossOver is an AI NLP project which is made to create a crossover between two documents. The goal of the project is to have an output story where the hero of one input is fighting the villain of the other input, or maybe the heroes of both inputs are fighting each other! In this project, the possibilities are endless, as we attempt to do something only fanfic writers have done before.
Summarizing a document is equivalent to “shortening” the document to contain only important text from the input. CrossOver, in contrast, aims to not summarize but to mix relevant portions of the two texts together to create meaningful crossovers.
CrossOver is a generalized code which can be invoked on any pdf and text files. You can create crossovers between your favourite comics (as many DC Marvel crossovers as you want!), or your favourite books (Harry Potter meets Katniss Everdeen?), or two of your favourite news articles (political news mixed with some Hollywood drama?). Any textual documents existing as a pdf or a text file can be used as inputs!
Two files on which crossovers are to be created
Crossover generated on the input files
{
"filename1":"twilight.pdf",
"filename2": "hunger_games.pdf",
"no_of_sentences":1000,
"similarity_threshold": 0.5,
"TRAIN_WITH_GPT2": true,
"num_train_epochs_value": 1,
"per_gpu_train_batch_size_value": 2,
"max_length_words": 500,
"temperature": 0.5,
"max_sequence_length_gru": 100,
"step_size_gru": 3,
"max_len_chars": 1500,
"n_neurons_gru": 256,
"starting_text": "Edward and Katniss were alone in the dark",
"preprocessed_file_names" : "hunger_games_afterPreprocessing.pkl,twilight_afterPreprocessing.pkl",
"similarity_df_name" : "twilight_hunger_games_similarity_df.pkl",
"train_data_name" : "twilight_hunger_games_train_data.pkl",
"model_name": "",
"skip_gpt2_train": true
}
The description of the different parameters are :
-
filename1 & filename2 - The name of the pdf/txt files. Note that these files should be present in the Data Files folder.
-
no_of_sentences - The no of sentences to take from each of the inputs. In case if you have a powerful system, please proceed with a higher count like greater than 5k.
-
similarity_threshold - Inorder to generate proper training data, our pipeline does a NER swap between similar sentences. So this threshold indicates the threshold beyond which two sentences(s1 from input1 and s2 from input2) should be considered similar.
-
TRAIN_WITH_GPT2 - A boolean indicator that denotes whether to train and use gpt2 for generating crossovers. Highly advisable to set this flag as true if you have a dedicated GPU.
-
num_train_epochs_value - A common parameter for GPT2 and GRU. Denotes the no of training epochs to run the desired model. Higher epochs will lead to better outputs but will be computationally expensive or might overfit the data.
-
per_gpu_train_batch_size_value - GPT2 configurable hyperparameter. Values greater than 2 can cause significant load to the GPU. Please proceed with higher values if you have a strong GPU.(We used a RTX 2060, and values like 4 overloaded the GPU in certain cases).
-
max_length_words - GPT2 based parameter. It denotes the no of words that should be present in the generated output.
-
temperature - A common parameter for both GPT2 and GRU. Lower values lead to output getting more and more biased towards the most common word/char. Higher values lead to more randomness in outputs. Values in the range 0.5 - 0.7 have turned out to give interesting results.
-
max_sequence_length_gru - GRU based parameter. It denotes the maximum sentence length to take for each input row
-
step_size_gru - GRU based parameter. For the GRU based training, the train data is generated by sliding across the inputs taking characters the same as the max_sequence_length_gru param. When sliding the window, we skip the input by step_size_gru no of characters and then pick another window.
-
max_len_chars - GRU based parameters. It denotes the number of characters that should be present in the output generated by GRU.
-
n_neurons_gru - GRU based hyperparameter. It denotes the number of neurons present in the GRU layer.
-
starting_text - A common parameter for both GPT2 & GRU. It denotes the starting text that is to be used for crossover generation. Can keep it blank as well, in that case a random sentence from the train will be picked.
-
Saved files params - All the above mentioned intermediate files can be read through the config file using the following params - "preprocessed_file_names","similarity_df_name","train_data_name" ,"model_name"
-
skip_gpt2_train - A boolean indicating whether to train gpt2 again or reuse the same model. Can be helpful if you want to run the pipeline again with a different starting text or want to increase the number of words generated by the model. This param helps in cutting down the time by skipping the training process.
Since the entire pipeline is heavy we have tried to cache the intermediate files in a folder called SavedFiles. The files that we have tried to cache are -
- Preprocessing file
- Similarity dataframe
- Final data that will be used for training the model.
We have the capability to read from the cached files in case the entire pipeline is run again with minimal changes. It helps us to speed up the overall pipeline and get the results quickly.
The project is divided into three categories -
- Preprocessing
- Train Set Creation
- Text Generation
As anyone familiar with NLP would tell you, this is the most fundamental part of the project. The input files go through some basic preprocessing and data cleaning necessary for the upcoming pipelines.
After preprocessing, we aim to create one single train set from the two input files.
- Similarity First we get the embeddings for each sentence of both the inputs. (Embeddings can be chosen, we went for GloVe(50d) for convenience). Then using the Smooth Inverse Frequency method, we find the pairs of sentences from each document which are extremely similar to each other.(threshold for similar sentences can be configurable)
- NER Swapping After getting most similar sentence pairs, we pass them through a BERT based named entity recognition model, which tags each named entity as a person, location, etc. Named Entities of two sentences which belong to the same tag are then swapped, and each newly created sentence is added to the train set.
Using this train set we train AI text generation techniques to generate new text which make up the output file. There are two alternative ways of completing this task -
- Character Level Text Generation
- World Level Text Generation
Character level method uses an GRU model, while the word level method uses and finetunes a GPT2 transformer.
-
Clone the git repo from github to your local system. git clone https://github.com/SauravPattnaikCS60/Cross-Over-Alpha.git
-
Setup your system before running by installing the necessary dependencies from the requirements.txt file. pip install -r requirements.txt
-
Head over to the Data Files folder in the clone repository in your local system. You can either select two files which are already present, or place your desired files here. The input files names will be passed in the config file- "filename1": "twilight.pdf", "filename2": "hunger_games.pdf"
-
The config file can be found in the Data Files folder. Please make the required changes in the config file to suit your requirements. Don’t change the name and the keys of the file. For more information on what parameters can be passed through this file, please check this section
-
Please open the project in a suitable code editor like Pycharm, VSCode and run the main file.
-
You may check the results in the Results folder, and the intermediate files in the SavedFiles folder. For more information on what files are being saved, please check this section
- main - This is the entrypoint of the project. This file acts as an orchestration file running all the different modules, saving necessary files and storing the outputs.
- read_pdf - Reading of the inputs.
- preprocessing - Perform basic preprocessing on the inputs.
- similarity_pipeline - Run the similarity module on the inputs to generate the pairs of similar sentences based on a threshold.
- ner_swaps - Compute the NER Swaps between the pairs of similar sentences.
- LSTM_pipeline - Train the lstm model on the training data.
- GPT2_pipeline - Train the GPT2 model on the training data.
- utils - Helper functions to save and read intermediate files.
- run_lm_finetuning - This file is used for finetuning the gpt2 model. This file is downloaded from the huggingface repository.
There are primarily two future plans for crossover project -
- Enhancements
- Making it open source
Some of the enhancements that are in progress are -
- Optimizing the project Similarity modules and training takes a significant chunk of the time. To get good outputs we might need to run the pipeline for 3-5 hours,assuming you have a good laptop/desktop. We are currently looking at ways to reduce the time taken.
- Better crossovers generation by making better training sets. Similarity backed NER swaps did help to mix characters from both inputs to generate crossover. However significant work is still needed to improve the grammar and meaning of the crossover to be generated. Furthermore work is needed to improve the GRU based pipeline as compared to the GPT2 based one.
- Web application deployment. With the above two critical tasks done, the final aim will be to ensure that the overall pipeline is deployable in a web application.
We have plans to make it open source so that we are quickly able to complete the enhancements that we have planned for the project.
MIT License