Repository to store notebooks and data used to generate realistic text using a model trained on reddit data.
Example data sets and models are included in this repository. For exploration of other more complex models used in this project please see /text_gen/
on this page. NOTE This repo is a cleaned example version of the project, if you wish to view the full repo expect less documentation.
Feel free to download your own copy, or submit a pull request.
This project uses recurrent neural networks and text downloaded from reddit to produce langauge models. This project was intended as a fun practice of implimenting some theory, and as such the models trained are on relatively small data sets and for short periods of time.
There are four main components to this project, as follows:
- Data gathering
- Data pre-processing
- Model training and text generation
- Feature-Target analysis
To run the code in this repo you will need common python data anlysis modules (numpy matplotlib etc)
In addition, you will need keras
, tensorflow
and scikit-learn
.
\TODO add requirements.txt
A collection of raw data are provided within /raw_data/
with all text posts within one file, separated by new lines.
To collect new data from different subreddits or of different sizes / date ranges you will need to run /data_gathering/reddit_download.ipynb
.
I have excluded the authentication tokens and logins used to do this a user can generate their own.
The download modules used interface with praw
and psaw
. However, device needs to authenticated using the method described in reddit's API. How to do this can be found here
The text data downloaded contains metadata in the file title. This is described within the notebook itself.
The raw data is unsurpisingly not suitable to be passed to a neural network. There are two main reasons for processing the data:
- convert it into a strucutre suitable for training
- clean and transform the data in accordance with the task in mind
Which raw file to process is selected at the top of the /src/pre-processing.ipynb
notebook.
NOTE some cells will take a significant amount of memory and time to run. The largest amount of data I was able to process on a laptop was 50,000, there may be unknown bottlenecks above this so I would recommend going above 20,000 at your own risk.
Parameters such as the size of the sequence length and the minimum frequency of word occurance can be specified.
Processed data is output as a .pickle
file, this is done to preserve all structure, and to pass parameters, such as sequence length, and objects, such as the tokeniser used, to the model.
To train the model the processed data is required. The values in the names of the .pickle
file give information about the data contained within (size, sequence length etc).
The keras framework used is sequential. To better model the complexity of the data feel free to increase the number of embeddings, and number of GRU units.
For the 5,000 post data set, training will take around 10 minutes with early stopping on my machine.
I was interested in the limit of prediction ability with this data set. Using the processed data some analysis of unique feature-target pairs are conducted. For 5,000 posts approx 25% have multiple possible feature-target pairs. This increases with corpus size.