Multilabel Scifi Tags Classifier

Classify 160 different tags from scifi and fantasy questions.

Overview • Data Collection • Data Preprocessing • Model Training • Compression • Model Deployment • Web Deployment • Build from Source • Contact

📋 Overview

A multi-label text classification model from data collection, model training and deployment. The model can classify 160 different types of question tags from https://scifi.stackexchange.com. The keys of tag_types_encoded.json shows the list of question tags.

🗂️ Data Collection

Data was collected from https://scifi.stackexchange.com/questions, a part of the Stack Exchange network of Q&A sites dedicated to science fiction and fantasy topics. Some popular question tags include Star Wars, Harry Potter, Marvel, DC Comics, Star Trek, Lord of the Rings, and Game of Thrones. The data collection was divided into two steps:

Question URL Scraping: The scifi question URLs were scraped with question_url_scraper.py and the URLs are stored along with the question titles in question_urls.csv file. Scroll to this section for details.
Fetching Question Details: For each of the question URL in question_urls.csv, the question details (title, URL, description, tags) were fetched with fetch_question_detail.py. The question details are stored in question_details.csv file. Alternatively, question_detail_scraper.py could be used to scrape the question details. Scroll to this section for details.

In total, 30,000 scifi question details were collected.

🔄 Data Preprocessing

Initially, there were 2095 different question tags in the dataset. After analyzing, it was found that 1935 of them were rare tags (tags that appeared in less than 0.2% of the questions). So, the rare tags were removed. As a result, a very small portion of the questions were void of any tag at all. So those question rows were removed as well. Finally, the dataset had 160 different tags across 27,493 questions.

💪 Model Training

Three different models from HuggingFace Transformers were fine-tuned using Fastai and Blurr. All of the models achieved 99%+ accuracy. Following are the list of models:

The model training notebooks can be viewed here.

📦 Model Compression and ONNX Inference

The trained models required a storage space between 300-500MB. So, the models were compressed using ONNX quantization which reduced its size between 80-120MB. Following are the key performance metrics for each of the models and their compressed version respectively:

	distilroberta-base	distilroberta-base (quantized)	roberta-base	roberta-base (quantized)	bert-base-uncased	bert-base-uncased (quantized)
F1 Score (Micro)	0.745	0.745	0.783	0.782	0.715	0.717
F1 Score (Macro)	0.278	0.277	0.521	0.518	0.146	0.15
Size	314 MB	79 MB	476 MB	120 MB	419 MB	105 MB

🤗 Model Deployment

distilroberta-base (quantized) with 99.37% accuracy was the final compressed model that was deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or here.

🌐 Web Deployment

Developed a Flask Webapp and deployed to Vercel. It takes scifi and fantasy questions as input and classifies the relevant tags associated with the question via HuggingFace API. The webapp is live here.

The webapp takes scifi and fantasy questions as input:

It utilizes HuggingFace API to classify the relevant tags:

⚙️ Build from Source

Clone the repo

git clone https://github.com/zzarif/Multilabel-Scifi-Tags-Classifier.git
cd Multilabel-Scifi-Tags-Classifier/

Initialize and activate virtual environment

virtualenv --no-site-packages venv
source venv/Scripts/activate

Install dependencies

pip install -r requirements.txt

Note: Select virtual environment interpreter from Ctrl+Shift+P

Run the Selenium Scraper

python scraper/question_url_scraper.py

Wait for the script to finish (might take a few hours depending on your network bandwidth). When complete, this will generate question_urls.csv file. This file has 30,000 StackExchange Scifi question titles and URLs. Now, we need to fetch the question details (description, tags, etc.) for each of these question URLs.

Fetch Question Details

Fetching details from 30,000 question URLs is a rather resource intensive task. To efficiently fetch the details we can either request the question details via Stack API (recommended) or we can scrape the details with selenium scraper. (There are other ways too as mentioned here.)

Method 1: Request question details via Stack API

This method explains how we can utilize StackExchange REST APIs to request 30,000 question details. To do so:

Register your v2.0 application at Stackapps to get an API key.
deactivate your active virtual environment.
Open your venv/Scripts/activate file and add this line at the end of file (replace <your_api_key> with the API key from Stackapps):

export STACK_API_KEY="<your_api_key>"

Activate the virtual environment again:

source venv/Scripts/activate

Now, fetch the question details via Stack API:

python stackapi/fetch_question_detail.py

Wait for the script to finish (might take a few hours depending on your network bandwidth). The script might get interrupted midway, because the Stack API is throttled to max 10,000 calls per day for registered apps and it only allows for only a limited number of calls within a timeframe. In that case, simply wait and re-run the script the next day and it will resume from where it got interrupted. When complete, this will generate question_details.csv file. It has the details (title, url, description, tags) of 30,000 scifi and fantasy questions from StackExchange.

Method 2: Scrape question details via Selenium Scraper

python scraper/question_detail_scraper.py

This method scrapes question details using selenium and multiprocessing. This is how the scraping is done:

The script reads the question_urls.csv file containing the question URLs.
It divides the URLs into chunks based on the number of CPU cores available.
For each chunk, it creates a separate process to scrape the question details concurrently.
Each process uses selenium to navigate to each question URL, scrape the title, description, and tags, and store the data in a list.
After scraping all the questions in a chunk, the process saves the data as a CSV file specific to that chunk.
The script waits for all processes to finish before terminating.

When complete, we have to merge all the chunk specific CSV files into one question_details.csv file. By utilizing multiprocessing, the script can scrape multiple question details simultaneously, improving the overall efficiency of the scraping process. However, this method is, often times, not reliable due to SE's screen-scraping guidelines as mentioned here and poses the potential risk of IP range ban.

zzarif/Multilabel-Scifi-Tags-Classifier