Fine tuning involves adapting a pre-trained model to a particular use case through additional training.
Pre-trained models are developed via unsupervised learning, which precludes the need for large-scale labeled datasets. Fine-tuned models can then exploit pre-trained model representations to significantly reduce training costs and improve model performance compared to training from scratch
BERT, short for Bidirectional Encoder Representations from Transformers, is a machine learning (ML) framework for natural language processing. In 2018, Google developed this algorithm to improve contextual understanding of unlabeled text across a broad range of tasks by learning to predict text that might come before and after (bi-directional) other text.
USECASES:
- Sentiment Analysis
- chatbot question answer
- Help predicts text when writing an email
- Can quickly summarize long legal contracts
- Differentiate words that have multiple meanings based on the surrounding text
BERT
- Bidirectional Can process text left-to-right and right- to-left. BERT uses the encoder segment of a transformation model.
- Applied in Google Docs, Gmail, smart compose, enhanced search, voice assistance, analyzing customer reviews, and so on.
- GLUE score = 80.4% and 93.3% accuracy on the SQUAD dataset.
- Uses two unsupervised tasks, masked language modeling, fill in the blanks and next sentence prediction e.g. does sentence B come after sentence A?
GPT
- Autoregressive and unidirectional. Text is processed in one direction. GPT uses the decoder segment of a transformation model.
- Applied in application building, generating ML code, websites, writing articles, podcasts, creating legal documents, and so on.
- 64.3% accuracy on the TriviaAQ benchmark and 76.2% accuracy on LAMBADA, with zero-shot learning
- Straightforward text generation using autoregressive language modeling
- If You don't have an account of Hugging Face, Create an account on HF hub.
- Create a new token here.
- Upload it into Secrets if using Google Colab or you can view this article.
-
Dataset Loading:
Loaded the phishing site classification dataset from Hugging Face usingload_dataset
. -
Model Initialization:
Defined the model path (bert-base-uncased
) and set up a binary classification model usingAutoModelForSequenceClassification
, with two labels: "Safe" and "Not Safe", and froze all base model parameters except for the pooling layers to fine-tune only those. -
Preprocessing:
Tokenized the dataset using apreprocess_function
, which applies truncation to the text inputs. The tokenized dataset was created usingdataset_dict.map
. -
Data Collation:
UsedDataCollatorWithPadding
to handle padding while batching the data for training and evaluation. -
Metrics Loading:
Loaded accuracy and ROC AUC score metrics using theevaluate
library. -
Metric Computation:
In thecompute_metrics
function, computed the ROC AUC and accuracy after applying softmax to the predictions to calculate class probabilities. -
Training Arguments:
Defined hyperparameters including a learning rate of2e-4
, batch size of 8, and trained for 10 epochs. Strategies for logging, evaluation, saving checkpoints, and loading the best model were also set. -
Model Training:
TheTrainer
class was initialized with the model, tokenizer, datasets, and metrics. The training process was executed using thetrain()
method. -
Validation:
After training, predicted the validation set using the model and computed the accuracy and ROC AUC scores using thecompute_metrics
function. -
Saving the Model:
Saved the model to drive for future use.