/predict-subreddit

NLP model that predicts subreddit based on the title of a post

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Predict Subreddit

Generic badge

An NLP model that predicts subreddit based on the title of a post.

Play with it on HuggingFace Space

Post on r/MachineLearning

Data Collection

The model was trained using the titles of the top 1000 posts from the top 250 subreddits scraped using PRAW.

Dataset hosted on HuggingFace

Steps to create the dataset using dataset.py script:

  • Make sure to install the requirements using pip install -r requirements.txt
  • Create a .env file consisting of reddit authentication info like this
ID = <YOUR_ID>
SECRET = <YOUR_SECRET>
AGENT = <YOUR_AGENT>
  • Now run the script to create the dataset like this
python3 dataset.py <npage> <dfilename>

npage is the no of pages to scrape for top subreddits from redditlist.com (1 page => 125 subs) and filename is the csv filename to save the dataset to.

  • After the above steps are run, a csv file will be created under give filename consisting of title and subreddit pairs.

Modelling

HuggingFace Transformers' DistilBERT, is fine-tuned on the dataset of post titles labelled with their respective subreddit.

For steps to make the model check out the model notebook in the repo or open in Colab.

Model hosted on HuggingFace

Examples

Limitations and bias

  • Since the model is trained on top 250 subreddits (for reference) therefore it can only categorise within those subreddits.
  • Some subreddits have a specific format for their post title, like r/todayilearned where post title starts with "TIL" so the model becomes biased towards "TIL" --> r/todayilearned. This can be removed by cleaning the dataset of these specific terms.
  • In some subreddit like r/gifs, the title of the post doesn't matter much, so the model performs poorly on them.

Contributing

If you want to contribute code, simply create a pull request. If you have an idea, create an issue and the developers will look into it!