BERT Quantization with PyTorch

This repository contains a Jupyter Notebook that demonstrates the quantization of a BERT model using PyTorch. The notebook reduces the model size and enhances computational efficiency by converting weights to lower precision (4-bit quantization). This process optimizes the BERT model for deployment in resource-constrained environments.

Introduction
Background
Getting Started
Notebook Overview
Quantization Process
Results

Introduction

Quantization is a technique that reduces the memory and computational requirements of deep learning models by using lower precision for weights and activations. This notebook focuses on quantizing a BERT model to 4-bit precision, significantly reducing its memory footprint while maintaining its performance.

Key Features:

Quantizes linear layers in a BERT model using a custom PyTorch module.
Reduces model size and memory usage after quantization.
Verifies the functionality of the quantized model with a forward pass example.

Background

BERT Layers and Theory

BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art language representation model developed by Google. It is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right contexts in all layers. This makes BERT particularly effective for a wide range of natural language understanding tasks.

Architecture of BERT:

BERT is based on the Transformer architecture and consists of the following key components:

Input Embedding Layer:
- Converts input tokens (words or sub-words) into dense vector representations.
- The input embeddings are the sum of:
  - Token embeddings: Represent each token in the vocabulary.
  - Position embeddings: Capture the position of each token in the sequence.
  - Segment embeddings: Distinguish different sentences or segments in tasks like sentence pair classification.
Multiple Encoder Layers:
- BERT uses a stack of Transformer encoder layers. Each encoder layer consists of:
- Self-Attention Mechanism:
  - Allows each token to attend to every other token in the input sequence, capturing contextual relationships.
  - Computes a weighted sum of the values, with the weights determined by a similarity score (typically using scaled dot-product attention).
- Feed-Forward Neural Network:
  - Applies a fully connected feed-forward network to the output of the self-attention layer.
  - Typically consists of two linear transformations with a ReLU activation in between.
- Layer Normalization and Residual Connections:
  - Ensures stability and improves training by normalizing the output of each sub-layer and adding residual connections.
Output Layer:
- The output of BERT is a sequence of hidden states corresponding to each input token.
- These hidden states can be used for various NLP tasks such as text classification, named entity recognition, and question answering.

Figure 1: BERT Architecture showing input embeddings, multi-layer encoders, and output representations

Theory Behind Quantization:

Quantization is the process of mapping input values from a large set (like floating-point numbers) to output values in a smaller set (like integers). In deep learning, quantization reduces the precision of the weights and activations of a model to reduce its size and computational requirements, making it suitable for deployment in environments with limited resources.

Types of Quantization:

Post-Training Quantization:
- Converts a pre-trained model's weights to lower precision without additional training.
Quantization-Aware Training:
- Simulates quantization during training to improve the accuracy of the quantized model.

Benefits of Quantization:

Reduced Model Size:
- Lowers the memory footprint, making the model easier to deploy on edge devices.
Faster Inference:
- Reduces the computational complexity, resulting in faster inference times.
Energy Efficiency:
- Decreases power consumption, beneficial for mobile and embedded applications.

Quantization Process

The quantization process involves the following steps:

Initialize Quantized Weights:
- Initialize weights in a lower precision format.
Quantize Weights:
- Convert original weights to a 4-bit format using a custom quantization method.
Pack and Unpack Weights:
- Efficiently pack weights to further reduce memory usage.
Replace Standard Layers with Quantized Layers:
- Replace the standard linear layers in the BERT model with quantized layers.
Calculate and Compare Model Size:
- Measure the model size before and after quantization.
Run Forward Pass:
- Verify the quantized model by running a forward pass with a sample input.

Getting Started

To run the notebook, you need Jupyter Notebook (or Jupyter Lab) and the required Python libraries.

1. Clone the Repository

git clone https://github.com/yourusername/bert-quantization.git
cd bert-quantization

2. Install the Required Libraries

Install the necessary dependencies using pip:

pip install torch transformers jupyter

3. Open the Jupyter Notebook

Launch Jupyter Notebook or Jupyter Lab:

jupyter notebook

Notebook Overview

The notebook is structured as follows:

Setup and Imports: Import necessary libraries and set up the device (GPU or CPU) for model training and inference.
Load Pre-trained BERT Model: Load the BERT model (bert-base-uncased) and its tokenizer using the HuggingFace Transformers library.
Define Custom Quantized Linear Layer: Define a QuantizedLinearLayer class that quantizes the weights of linear layers to 4-bit precision.
Replace Linear Layers with Quantized Layers: Use the replace_linearlayer function to replace all linear layers in the BERT model with quantized versions.
Calculate Model Size and Memory Footprint: Calculate the model size before and after quantization to demonstrate the memory reduction achieved.
Verify Quantized Model Functionality: Perform a forward pass using the quantized model to ensure it functions correctly.
Results and Observations: Display and analyze the reduction in model size and memory footprint.

Quantization Process

The quantization process involves the following steps:

Initialize Quantized Weights: Initialize weights in a lower precision format.
Quantize Weights: Convert original weights to a 4-bit format using a custom quantization method.
Pack and Unpack Weights: Efficiently pack weights to further reduce memory usage.
Replace Standard Layers with Quantized Layers: Replace the standard linear layers in the BERT model with quantized layers.
Calculate and Compare Model Size: Measure the model size before and after quantization.
Run Forward Pass: Verify the quantized model by running a forward pass with a sample input.

Results

The quantization achieves significant memory reduction:

Original Model Size: 417.64 MB
Quantized Model Size: 91.39 MB

Model State	Size (MB)	Memory Usage (GB)
Original BERT Model	417.64	0.4379
Quantized BERT Model	91.39	0.1389

Quantization reduces the model size by approximately 78%, demonstrating significant memory savings.

srimoyee1212/BERT-Quantization

BERT Quantization with PyTorch

Table of Contents

Introduction

Key Features:

Background

BERT Layers and Theory

Architecture of BERT:

Theory Behind Quantization:

Types of Quantization:

Benefits of Quantization:

Quantization Process

Getting Started

1. Clone the Repository

2. Install the Required Libraries

3. Open the Jupyter Notebook

Notebook Overview

Quantization Process

Results