
Objective: The goal of this assignment is to build a text classification model using the Hugging Face library to classify a dataset of text into one of multiple categories. Using a pre-trained model such as BERT or GPT-2 as a starting point and fine-tune it on the classification task.

Primary LanguageJupyter Notebook

NLP Engineer Assignment

Assignment: Text Classification using Hugging Face

Definition & Scope of the problem statement

Goal: Build a text classification model using the Hugging Face library to classify each document into one of multiple categories.
Dataset: text documents
Model: fine-tuned from a pre-trained language model such as BERT or GPT-2
Evaluation: held-out test set

Experimental setup

Dataset: IMDb dataset for sentiment analysis with 50,000 movie reviews labeled as either positive or negative.
    Data preprocessing
    Fine-tuning a pre-trained language model
    Model evaluation

Explanation of the measurements/metrics used

Accuracy: Proportion of correctly classified instances over the total number of instances
Precision: Proportion of true positives over the total number of instances classified as positive
Recall: Proportion of true positives over the total number of positive instances
F1 score: Harmonic mean of precision and recall

Flowchart/workflow of the experiments

Load and preprocess data
    Load text data and labels
    Clean and tokenize text data
    Convert labels into numerical form
Fine-tune pre-trained language model
    Load pre-trained language model
    Define and compile classification model
    Train classification model on preprocessed data
Evaluate model performance
    Load held-out test set
    Predict labels for test set using trained model
    Compute accuracy, precision, recall, and F1 score
    Test with a pipeline example


Pre-trained language model: DISTILBERT-BASE-UNCASED
Architecture diagram:
    6 layers
    768 hidden units
    12 self-attention heads
Pre-training: English Wikipedia and BooksCorpus using masked language modeling (MLM) objective.
Uncased variant: trained on lowercased text, treats uppercase and lowercase letters as the same.
Advantages of pre-trained language model: significantly reduce the amount of data required for fine-tuning a text classification model.


Evaluation metric: Accuracy
Accuracy measures the proportion of correctly classified instances over the total number of instances in a dataset.