/PoRAG

Fully Configurable RAG Pipeline for Bengali Language RAG Applications. Supports both Local and Huggingface Models, Built with Langchain.

Primary LanguagePythonMIT LicenseMIT

PoRAG (পরাগ), Bangla Retrieval-Augmented Generation (RAG) Pipeline

Banner

LinkedIn: Abdullah Al Asif LinkedIn: Hasan Ali Emon

Welcome to the Bangla Retrieval-Augmented Generation (RAG) Pipeline! This repository provides a pipeline for interacting with Bengali text data using natural language.

Use Cases

  • Interact with your Bengali data in Bengali.
  • Ask questions about your Bengali text and get answers.

How It Works

Configurability

  • Customizable LLM Integration: Supports Hugging Face or local LLMs compatible with Transformers.
  • Flexible Embedding: Supports embedding models compatible with Sentence Transformers (embedding dimension: 768).
  • Hyperparameter Control: Adjust max_new_tokens, top_p, top_k, temperature, chunk_size, chunk_overlap, and k.
  • Toggle Quantization mode: Pass --quantization argument to toggle between different types of model including LoRA and 4bit quantization.

Installation

  1. Install Python: Download and install Python from python.org.
  2. Clone the Repository:
    git clone https://github.com/Bangla-RAG/PoRAG.git
    cd PoRAG
  3. Install Required Libraries:
    pip install -r requirements.txt
Click to view example `requirements.txt`
transformers
bitsandbytes 
peft 
accelerate 
chromadb
langchain 
langchain-community
sentence_transformers
argparse
rich

Running the Pipeline

  1. Prepare Your Bangla Text Corpus: Create a text file (e.g., test.txt) with the Bengali text you want to use.
  2. Run the RAG Pipeline:
    python main.py --text_path test.txt
  3. Interact with the System: Type your question and press Enter to get a response based on the retrieved information.

Example

আপনার প্রশ্ন: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কোথায়?
উত্তর: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কলকাতার জোড়াসাঁকোর 'ঠাকুরবাড়ি'তে।

Configuration (Default)

  • Default Chat Model: hassanaliemon/bn_rag_llama3-8b
  • Default Embedding Model: l3cube-pune/bengali-sentence-similarity-sbert
  • Default k: 4 (number of documents to retrieve)
  • Default top_k: 2 (for chat model)
  • Default top_p: 0.6 (for chat model)
  • Default temperature: 0.6 (for chat model)
  • Default chunk_size: 500 (for text splitting)
  • Default chunk_overlap: 150 (for text splitting)
  • Default max_new_tokens: 1024 (maximum length of the response messages)
  • Default quantization: False (sets load_in_4bit boolean)

You can change these values in the main.py script.

Key Milestones

  • Default LLM: Trained a LLaMA-3 8B model hassanaliemon/bn_rag_llama3-8b for context-based QA.
  • Embedding Model: Tested sagorsarker/bangla-bert-base, csebuetnlp/banglabert, and found l3cube-pune/bengali-sentence-similarity-sbert to be most effective.
  • Retrieval Pipeline: Implemented Langchain Retrieval pipeline and tested with our fine-tuned LLM and embedding model.
  • Ingestion System: Settled on text files after testing several PDF parsing solutions.
  • Question Answering Chat Loop: Developed a multi-turn chat system for terminal testing.
  • Generation Configuration Control: Attempted to use generation config in the LLM pipeline.
  • Model Testing: Tested with the following models(quantized and lora versions):
    1. asif00/bangla-llama
    2. hassanaliemon/bn_rag_llama3-8b
    3. asif00/mistral-bangla
    4. KillerShoaib/llama-3-8b-bangla-4bit

Limitations

  • PDF Parsing: Currently, only text (.txt) files are supported due to the lack of reliable Bengali PDF parsing tools.
  • Quality of answers: The qualities of answer depends heavily on the quality of your chosen LLM, embedding model and your Bengali text corpus.
  • Scarcity of Pre-trained models: As of now, we do not have a high fidelity Bengali LLM Pre-trained models available for QA tasks, which makes it difficult to achieve impressive RAG performance. Overall performance may very depending on the model we use.

Future Steps

  • PDF Parsing: Develop a reliable Bengali-specific PDF parser.
  • User Interface: Design a chat-like UI for easier interaction.
  • Chat History Management: Implement a system for maintaining and accessing chat history.

Contribution and Feedback

We welcome contributions! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.

Top Contributors

LinkedIn: Abdullah Al Asif Abdullah Al Asif

LinkedIn: Hasan Ali Emon Hasan Ali Emon

Disclaimer

This is a work-in-progress and may require further refinement. The results depend on the quality of your Bengali text corpus and the chosen models.

References

  1. Transformers
  2. Langchain
  3. ChromaDB
  4. Sentence Transformers
  5. hassanaliemon/bn_rag_llama3-8b
  6. l3cube-pune/bengali-sentence-similarity-sbert
  7. sagorsarker/bangla-bert-base
  8. csebuetnlp/banglabert
  9. asif00/bangla-llama
  10. KillerShoaib/llama-3-8b-bangla-4bit
  11. asif00/mistral-bangla