/Llama3-8B_Emotion_Text_Classification_LoRA

Emotion text classification using Llama3-8b with LoRA and FlashAttention. Based on LLaMA-Factory.

Primary LanguagePythonApache License 2.0Apache-2.0

Emotion Text Classification Using Llama3-8b and LoRA

Introduction

This project explores emotion text classification using the Llama3-8b model, enhanced with LoRA and FlashAttention techniques. The model is optimized for identifying six emotion categories: joy, sadness, anger, fear, love, and surprise. The Llama3-8b model demonstrates superior performance with an accuracy of 0.9262, surpassing other transformer models such as Bert-Base, Bert-Large, Roberta-Base, and Roberta-Large.

Background

Natural Language Processing (NLP) has become a key focus area for sentiment analysis, also known as sentiment classification or sentiment detection. This technology helps businesses understand consumer emotions and opinions, enhancing customer satisfaction and product development. The vast amount of data in large companies makes manual analysis impractical, leading to the adoption of AI and NLP algorithms.

Key Features

  • Model: Llama3-8b, fine-tuned using supervised learning.
  • Techniques: Utilized LoRA for efficient parameter tuning and FlashAttention for optimized attention computation.
  • Dataset: Emotion text dataset with six categories.
  • Performance: Achieved an accuracy of 0.9262, surpassing other NLP models.

Methods

Architecture of Llama3-8b
Figure 1: Architecture of Llama3-8b

Llama3-8b Model

The Llama3-8b model, developed by Meta AI, is a large language model optimized for dialogue use cases. It contains 8 billion parameters and features significant improvements over previous models. The Llama3 series incorporates a multi-phase training process that includes pretraining, supervised fine-tuning, and iterative refinement using reinforcement learning with human feedback (RLHF). This process ensures that the model aligns closely with human preferences for helpfulness and safety.

The architectural advancements in Llama3 include the implementation of Grouped-Query Attention (GQA). GQA clusters queries to share key-value pairs, thus reducing memory and computational costs while maintaining high performance. This method significantly enhances the efficiency of attention calculations, particularly in large-scale models.

Llama3-8b is pretrained on a diverse dataset comprising more than 15 trillion tokens from publicly available data, with the model's knowledge cutoff set at March 2023. The fine-tuning phase utilized publicly available instruction datasets and over 10 million human-annotated examples, ensuring a robust understanding of various language tasks.

Table 1: Llama3-8b Model Details
Feature Specification
Training Data Publicly available data
Parameters 8B
Context Length 8k
GQA Yes
Token Count 15T+
Knowledge Cutoff March 2023

Instruction Fine-Tuning

Instruction fine-tuning enhances the model's zero-shot learning capabilities across diverse tasks. This technique involves training the model on datasets specifically designed to improve its ability to follow instructions. For example, models trained on datasets like Alpaca-7B can exhibit behaviors similar to OpenAI's text-davinci-003 in understanding and executing instructions.

LoRA Method for Training

LoRA (Low-Rank Adaptation) is a technique used to integrate trainable rank decomposition matrices into each layer of the Transformer architecture. This method significantly reduces the number of trainable parameters while adapting large language models to specific tasks or domains. Unlike full fine-tuning, LoRA keeps the pretrained model weights unchanged, updating only the low-rank matrices during the adaptation process. This approach enhances training efficiency, reduces storage needs, and does not increase inference latency compared to fully fine-tuned models.

LoRA Training Method
Figure 2: LoRA Training Method

Flash Attention V2

FlashAttention V2 is an optimization technique designed to accelerate the attention mechanism in Transformer models. It focuses on improving computational efficiency and reducing memory usage during training. FlashAttention achieves this by breaking down attention computation into smaller, more manageable chunks, thereby enhancing cache utilization and reducing memory access. Additionally, it employs sparse matrix operations to leverage the sparsity in attention mechanisms, which helps bypass unnecessary computations. Pipelined operations enable parallel execution of different computation stages, further minimizing processing time.

Experimentation

Emotion Text Label Distribution
Figure 3: Emotion Text Label Distribution

Data Analysis

The dataset used for training the model consists of text labeled with six emotions: joy, sadness, anger, fear, love, and surprise. The distribution of the dataset is relatively balanced, with "Joy" being the most common emotion and "Surprise" the least. This balanced distribution provides a strong foundation for the model to accurately classify emotions without bias towards any particular category.

Experiment Settings

The Llama3-8b model's hyperparameters are set as follows:

Table 2: Experiment Settings for Llama3-8b
Parameter Setting
Optimizer Adam
Learning Rate 5e-5
Batch Size 5
Epochs 3
LoRA Rank 8
Gradient Accumulation Steps 4
Max Length 512

The model is trained using the Adam optimizer, known for its adaptive learning rate capabilities. A cosine learning rate schedule is employed to adjust the learning rate during training. The batch size is set to 5, with gradient accumulation over 4 steps to optimize memory usage. The model is trained for 3 epochs, with the FP16 precision format used to save GPU memory while maintaining performance. The LoRA rank of 8 indicates the order of the low-rank matrix used in the adaptation process.

Evaluation Metrics

The primary metric used to evaluate the model's performance is accuracy. This metric measures the proportion of correct predictions made by the model out of all predictions. The formula for accuracy is:

$$ \text{Accuracy} = \frac{\text{TP} + \text{FN}}{\text{TP} + \text{FP} + \text{FN} + \text{TN}} $$

Where:

  • TP = True Positive
  • FP = False Positive
  • FN = False Negative
  • TN = True Negative

Experiment Analysis

The model's performance is compared against other popular NLP models, such as Bert-Base, Bert-Large, Roberta-Base, and Roberta-Large. The Llama3-8b model achieves the highest accuracy of 0.9262, demonstrating the effectiveness of instruction fine-tuning and the model's large parameter set. The superior performance of Llama3-8b in this task underscores the advantages of large language models in achieving high accuracy across diverse and challenging text classification tasks.

Table 3: Accuracy Results for Different Models
Model Accuracy
Bert-Base 0.9063
Bert-Large 0.9086
Roberta-Base 0.9125
Roberta-Large 0.9189
Llama3-8b 0.9262

Conclusion

This project demonstrates the potential of large language models, such as Llama3-8b, in domain-specific tasks like emotion text classification. The model's performance, boosted by specialized techniques like LoRA and FlashAttention, underscores the effectiveness of large models in achieving high accuracy in NLP applications.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

This project is based on modifications to the original work available under LLaMA-Factory, which is licensed under the Apache License 2.0.

Contact

For any questions or issues, please contact Daoyuan Li at lidaoyuan2816@gmail.com.