/NLP-LLM

Primary LanguageJupyter NotebookMIT LicenseMIT

NLP-LLM

This repository is an ongoing project aimed at exploring state-of-the-art (SOTA) methods in NLP, and will continue to evolve with more practical algorithms and approaches...

For Natural Language Processing (NLP) and Large Language Models (LLMs), the exploration and implementation of innovative methodologies like Combined Topic Modeling (CTM), fine-tuning Llama 2 for sentiment analysis, constructing multimodal RAG pipelines, sentence sentiment classification with DistilBERT, sentence-to-sentence translation, and the design of RAG using Llama-2, Langchain, and Chromadb represent a significant leap towards harnessing the vast potential of text data. Each of these subsections delves into specific aspects of NLP-LLM, from uncovering latent topics within large text corpora and refining models for nuanced sentiment analysis to facilitating seamless sentence translation and augmenting LLMs with retrieval mechanisms for enhanced query responses. These projects not only demonstrate the practical applications of cutting-edge NLP techniques but also highlight the synergistic integration of various tools and frameworks to solve complex linguistic challenges, pushing the boundaries of what can be achieved in the field of natural language understanding and generation.

Working with LLMs and NLP on large unstructured datasets is a comprehensive process that involves cleaning, preparing, and transforming text data into a structured format that models can understand and learn from. Leveraging the power of LLMs can significantly enhance the ability to derive insights, predictions, and generate text, pushing the boundaries of what's possible in the field of NLP.

Here I have a summary of each method.

1. Introduction to Combined Topic Modeling (CTM) with Kaggle Dataset

For text analysis, the fusion of Bag of Words (BoW) and Sentence-BERT (SBERT) within Combined Topic Modeling (CTM) offers a nuanced approach to discerning underlying themes in large text corpora. CTM leverages the simplicity of BoW for surface-level text features while integrating the deep contextual insights of SBERT embeddings, providing a comprehensive understanding of textual content. This method is particularly adept at handling diverse datasets, such as the Kaggle Wikipedia abstracts, by capturing both explicit and implicit topic information.

Enhanced Topic Discovery through CTM By combining BoW's straightforward count-based method with the nuanced, context-aware representations from SBERT, CTM achieves a balance between computational efficiency and depth of analysis. This dual approach allows for the extraction of coherent and meaningful topics, even from complex and varied text sources like Wikipedia. As a result, CTM not only simplifies the exploration of large datasets but also enriches the analysis by uncovering more intricate patterns and relationships within the text, making it a powerful tool for both researchers and practitioners in the field of natural language processing.

2. Fine-tune Llama 2 for Sentiment Analysis

Fine-tuning Llama 2 for sentiment analysis on financial and economic information involves leveraging the FinancialPhraseBank dataset, which consists of 5000 human-annotated sentences categorized into positive, neutral, or negative sentiments towards financial news headlines. This task is significant for extracting market insights, managing risks, and making informed investment decisions. The process starts with installing essential libraries like accelerate, peft, bitsandbytes, transformers, and trl, which facilitate efficient model training and adaptation. The data preparation phase includes reading the dataset, creating stratified splits, and generating prompts tailored for sentiment analysis within the Llama 2 framework.

The training phase entails loading and quantizing the Llama 2 model for efficient computation, followed by setting up the fine-tuning configuration using LoraConfig and TrainingArguments to specify training parameters. The SFTTrainer class orchestrates the fine-tuning process, focusing on parameter-efficient training to reduce computational costs. Evaluation is conducted by predicting sentiments for the test set and assessing the model's performance using standard classification metrics. This fine-tuning approach enhances Llama 2's capabilities for sentiment analysis in financial texts, offering valuable insights into the sentiments embedded in financial news headlines.

3. Construct a simple multimodal RAG pipeline using LlamaParse in JSON mode combined with LlamaIndex.

To create a simple multimodal RAG pipeline using LlamaParse in JSON mode and LlamaIndex, start by setting up the environment, including the installation of necessary libraries like llama-index, llama-parse, and llama-index-llms-anthropic. Define environment variables for API access, ensuring secure connections to LlamaCloud and Anthropic services. Utilize LlamaParse to read and parse documents in JSON mode, extracting structured data including both text and images from sources such as PDFs.

For image extraction and indexing, download the images identified by LlamaParse and employ a multimodal model to interpret and generate descriptive text for each image. This process enriches the content by adding a layer of textual information to visual data, making it searchable and analyzable alongside traditional text. Integrate these text descriptions with the original textual content extracted from the document, building a comprehensive index that spans both modalities.

This multimodal RAG pipeline leverages the strengths of LlamaParse for content extraction and the capabilities of multimodal models to understand and describe images, creating a rich, searchable index. This approach is particularly valuable for applications requiring nuanced understanding and retrieval of information from complex documents that contain a mix of text and visual elements.

4. Sentence Sentiment Classification: DistilBERT

In this Python-based NLP project, the initial steps involve utilizing a Kaggle Python environment, pre-loaded with analytical libraries, to handle data processing tasks like reading CSV files using Pandas and performing linear algebra operations with NumPy. The project leverages a Docker image specified by Kaggle, which includes various helpful packages. Data is sourced from the Kaggle input directory, and the environment allows for substantial data manipulation, with the capacity to write up to 20GB to the current directory, preserving outputs for future reference.

The core of the project revolves around sentence sentiment classification using DistilBERT, a lighter version of BERT optimized for faster performance without significantly compromising on effectiveness. This pre-trained model from HuggingFace is adept at generating sentence embeddings that serve as inputs for a Logistic Regression model, essentially capturing the essence of sentences in vector form. The methodology involves tokenizing sentences, padding them to uniform length, and creating attention masks to focus on meaningful tokens. The logistic regression model, trained on these embeddings, aims to classify sentence sentiments accurately, demonstrating an interplay between cutting-edge NLP techniques and traditional machine learning models for effective sentiment analysis.

5. Sentence to Sentence translation:

This repository contains a project that utilizes a Python 3 environment with various analytical libraries installed, tailored for data processing and linear algebra operations using Pandas and NumPy respectively. The setup is based on the Kaggle Python Docker image, designed for seamless data analysis and machine learning tasks. The project focuses on parsing and processing input data from the "/kaggle/input/" directory, showcasing file exploration with Python's os module.

The core of the project is a sequence-to-sequence (seq2seq) model for sentence translation, leveraging the PyTorch library for implementing neural networks. The model architecture includes an encoder-decoder framework, where the encoder processes the input sentence into a context vector, and the decoder generates the translated sentence. The implementation demonstrates the use of GRU units, attention mechanisms, and various data preprocessing steps like tokenization, normalization, and padding. This project serves as an illustrative example of advanced NLP techniques applied to sentence translation, with the potential for adaptation to other seq2seq tasks.

6. Design Rag using llama-2, langchain and chromadb:

This project aims to develop a Retrieval Augmented Generation (RAG) system leveraging Llama 2.0, Langchain, and ChromaDB to enhance query responses on documents not included in the training data of Large Language Models (LLMs). The system first retrieves relevant documents from a vector database where the data is indexed, then generates responses, combining the power of LLMs with external information sources. Llama 2.0, a highly advanced LLM by Meta, serves as the generator component, while ChromaDB, a vector database, facilitates efficient document retrieval. The integration is orchestrated using Langchain, a framework designed to simplify LLM application development. This setup allows for more informed and accurate responses from the LLM, even on topics it was not explicitly trained on, by dynamically pulling in relevant external data during the generation process.

The implementation process includes setting up Llama 2.0 with a specific model variation and leveraging Langchain for efficient orchestration. ChromaDB is used to store and retrieve document embeddings, which are then utilized to fetch pertinent information corresponding to user queries. This approach enhances the model's ability to provide detailed and contextually relevant answers by incorporating up-to-date information from the indexed documents. The project demonstrates the practical application of RAG systems in improving the utility and accuracy of responses from LLMs by augmenting their capabilities with a retriever mechanism, showcasing a significant step forward in the field of natural language processing and information retrieval.

7. Introduction to RAG:

The project explores Retrieval Augmented Generation (RAG) and highlights its complexities beyond the initial setup, emphasizing that achieving proficiency with RAG involves more than merely integrating documents into a vector database and overlaying a language model. While this basic setup might occasionally suffice, it often falls short in effectively leveraging the RAG architecture. This underscores the nuanced nature of RAG, where the integration and optimization of various components, such as document retrieval and language model generation, play a critical role in realizing its full potential. The project aims to navigate these intricacies by employing specific libraries and frameworks like chromadb for vector database management, sentence transformers for embedding generation, langchain for the RAG framework, and cohere models for enhanced text generation capabilities.

The project's implementation involves a comprehensive setup encompassing the installation of essential libraries, dataset preparation, vector database creation, and the establishment of a QA chain with memory using Langchain. This process entails mapping dataset information into a structured format suitable for vector database storage, followed by the creation of document embeddings using a designated embedding model. These embeddings are stored in a vector database, which facilitates efficient document retrieval based on vector similarity. The QA chain, powered by Langchain, integrates memory capabilities to maintain context across interactions, enhancing the conversational model's coherence and relevance. The project also delves into reranking methodologies, which refine the retrieval process by assessing document relevance through a second-stage model, thereby improving the quality and precision of the generated responses. This intricate setup aims to harness the synergistic potential of RAG, vector databases, and advanced language models to deliver sophisticated and contextually relevant natural language processing solutions.