Overview • Data Collection • Data Preprocessing • Model Training • Compression • Model Deployment • Web Deployment • Build from Source • Contact
A multi-label text classification model from data collection, model training and deployment. The model can classify 160 different types of question tags from https://scifi.stackexchange.com. The keys of tag_types_encoded.json
shows the list of question tags.
Data was collected from https://scifi.stackexchange.com/questions, a part of the Stack Exchange network of Q&A sites dedicated to science fiction and fantasy topics. Some popular question tags include Star Wars, Harry Potter, Marvel, DC Comics, Star Trek, Lord of the Rings, and Game of Thrones. The data collection was divided into two steps:
- Question URL Scraping: The scifi question URLs were scraped with
question_url_scraper.py
and the URLs are stored along with the question titles inquestion_urls.csv
file. Scroll to this section for details. - Fetching Question Details: For each of the question URL in
question_urls.csv
, the question details (title, URL, description, tags) were fetched withfetch_question_detail.py
. The question details are stored inquestion_details.csv
file. Alternatively,question_detail_scraper.py
could be used to scrape the question details. Scroll to this section for details.
In total, 30,000 scifi question details were collected.
Initially, there were 2095 different question tags in the dataset. After analyzing, it was found that 1935 of them were rare tags (tags that appeared in less than 0.2% of the questions). So, the rare tags were removed. As a result, a very small portion of the questions were void of any tag at all. So those question rows were removed as well. Finally, the dataset had 160 different tags across 27,493 questions.
Three different models from HuggingFace Transformers were fine-tuned using Fastai and Blurr. All of the models achieved 99%+ accuracy. Following are the list of models:
The model training notebooks can be viewed here.
The trained models required a storage space between 300-500MB. So, the models were compressed using ONNX quantization which reduced its size between 80-120MB. Following are the key performance metrics for each of the models and their compressed version respectively:
distilroberta-base | distilroberta-base (quantized) | roberta-base | roberta-base (quantized) | bert-base-uncased | bert-base-uncased (quantized) | |
---|---|---|---|---|---|---|
F1 Score (Micro) | 0.745 | 0.745 | 0.783 | 0.782 | 0.715 | 0.717 |
F1 Score (Macro) | 0.278 | 0.277 | 0.521 | 0.518 | 0.146 | 0.15 |
Size | 314 MB | 79 MB | 476 MB | 120 MB | 419 MB | 105 MB |
distilroberta-base
(quantized) with 99.37% accuracy was the final compressed model that was deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or here.
Developed a Flask Webapp and deployed to Vercel. It takes scifi and fantasy questions as input and classifies the relevant tags associated with the question via HuggingFace API. The webapp is live here.
- The webapp takes scifi and fantasy questions as input:
- It utilizes HuggingFace API to classify the relevant tags:
- Clone the repo
git clone https://github.com/zzarif/Multilabel-Scifi-Tags-Classifier.git
cd Multilabel-Scifi-Tags-Classifier/
- Initialize and activate virtual environment
virtualenv --no-site-packages venv
source venv/Scripts/activate
- Install dependencies
pip install -r requirements.txt
Note: Select virtual environment interpreter from Ctrl
+Shift
+P
python scraper/question_url_scraper.py
Wait for the script to finish (might take a few hours depending on your network bandwidth). When complete, this will generate question_urls.csv file. This file has 30,000 StackExchange Scifi question titles and URLs. Now, we need to fetch the question details (description, tags, etc.) for each of these question URLs.
Fetching details from 30,000 question URLs is a rather resource intensive task. To efficiently fetch the details we can either request the question details via Stack API (recommended) or we can scrape the details with selenium scraper. (There are other ways too as mentioned here.)
Method 1: Request question details via Stack API
This method explains how we can utilize StackExchange REST APIs to request 30,000 question details. To do so:
- Register your v2.0 application at Stackapps to get an API key.
deactivate
your active virtual environment.- Open your
venv/Scripts/activate
file and add this line at the end of file (replace<your_api_key>
with the API key from Stackapps):
export STACK_API_KEY="<your_api_key>"
- Activate the virtual environment again:
source venv/Scripts/activate
- Now, fetch the question details via Stack API:
python stackapi/fetch_question_detail.py
Wait for the script to finish (might take a few hours depending on your network bandwidth). The script might get interrupted midway, because the Stack API is throttled to max 10,000 calls per day for registered apps and it only allows for only a limited number of calls within a timeframe. In that case, simply wait and re-run the script the next day and it will resume from where it got interrupted. When complete, this will generate question_details.csv file. It has the details (title, url, description, tags) of 30,000 scifi and fantasy questions from StackExchange.
python scraper/question_detail_scraper.py
This method scrapes question details using selenium
and multiprocessing
. This is how the scraping is done:
- The script reads the question_urls.csv file containing the question URLs.
- It divides the URLs into chunks based on the number of CPU cores available.
- For each chunk, it creates a separate process to scrape the question details concurrently.
- Each process uses
selenium
to navigate to each question URL, scrape the title, description, and tags, and store the data in a list. - After scraping all the questions in a chunk, the process saves the data as a CSV file specific to that chunk.
- The script waits for all processes to finish before terminating.
When complete, we have to merge all the chunk specific CSV files into one question_details.csv file. By utilizing multiprocessing
, the script can scrape multiple question details simultaneously, improving the overall efficiency of the scraping process. However, this method is, often times, not reliable due to SE's screen-scraping guidelines as mentioned here and poses the potential risk of IP range ban.