Welcome to the Bangla Retrieval-Augmented Generation (RAG) Pipeline! This repository provides a pipeline for interacting with Bengali text data using natural language.
- Interact with your Bengali data in Bengali.
- Ask questions about your Bengali text and get answers.
- LLM Framework: Transformers
- RAG Framework: Langchain
- Chunking: Recursive Character Split
- Vector Store: ChromaDB
- Data Ingestion: Currently supports text (.txt) files only due to the lack of reliable Bengali PDF parsing tools.
- Customizable LLM Integration: Supports Hugging Face or local LLMs compatible with Transformers.
- Flexible Embedding: Supports embedding models compatible with Sentence Transformers (embedding dimension: 768).
- Hyperparameter Control: Adjust
max_new_tokens
,top_p
,top_k
,temperature
,chunk_size
,chunk_overlap
, andk
. - Toggle Quantization mode: Pass
--quantization
argument to toggle between different types of model including LoRA and 4bit quantization.
- Install Python: Download and install Python from python.org.
- Clone the Repository:
git clone https://github.com/Bangla-RAG/PoRAG.git cd porag
- Install Required Libraries:
pip install -r requirements.txt
Click to view example `requirements.txt`
transformers
bitsandbytes
peft
accelerate
chromadb
langchain
langchain-community
sentence_transformers
argparse
rich
- Prepare Your Bangla Text Corpus: Create a text file (e.g.,
text.txt
) with the Bengali text you want to use. - Run the RAG Pipeline:
python main.py --text_path text.txt
- Interact with the System: Type your question and press Enter to get a response based on the retrieved information.
আপনার প্রশ্ন: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কোথায়?
উত্তর: রবীন্দ্রনাথ ঠাকুরের জন্মস্থান কলকাতার জোড়াসাঁকোর 'ঠাকুরবাড়ি'তে।
- Default Chat Model:
hassanaliemon/bn_rag_llama3-8b
- Default Embedding Model:
l3cube-pune/bengali-sentence-similarity-sbert
- Default
k
:4
(number of documents to retrieve) - Default
top_k
:2
(for chat model) - Default
top_p
:0.6
(for chat model) - Default
temperature
:0.6
(for chat model) - Default
chunk_size
:500
(for text splitting) - Default
chunk_overlap
:150
(for text splitting) - Default
max_new_tokens
:1024
(maximum length of the response messages) - Default
quantization
:False
(setsload_in_4bit
boolean)
You can change these values in the main.py
script.
- Default LLM: Trained a LLaMA-3 8B model
hassanaliemon/bn_rag_llama3-8b
for context-based QA. - Embedding Model: Tested
sagorsarker/bangla-bert-base
,csebuetnlp/banglabert
, and foundl3cube-pune/bengali-sentence-similarity-sbert
to be most effective. - Retrieval Pipeline: Implemented Langchain Retrieval pipeline and tested with our fine-tuned LLM and embedding model.
- Ingestion System: Settled on text files after testing several PDF parsing solutions.
- Question Answering Chat Loop: Developed a multi-turn chat system for terminal testing.
- Generation Configuration Control: Attempted to use generation config in the LLM pipeline.
- Model Testing: Tested with the following models(quantized and lora versions):
- PDF Parsing: Currently, only text (.txt) files are supported due to the lack of reliable Bengali PDF parsing tools.
- Quality of answers: The qualities of answer depends heavily on the quality of your chosen LLM, embedding model and your Bengali text corpus.
- Scarcity of Pre-trained models: As of now, we do not have a high fidelity Bengali LLM Pre-trained models available for QA tasks, which makes it difficult to achieve impressive RAG performance. Overall performance may very depending on the model we use.
- PDF Parsing: Develop a reliable Bengali-specific PDF parser.
- User Interface: Design a chat-like UI for easier interaction.
- Chat History Management: Implement a system for maintaining and accessing chat history.
We welcome contributions! If you have suggestions, bug reports, or enhancements, please open an issue or submit a pull request.
This is a work-in-progress and may require further refinement. The results depend on the quality of your Bengali text corpus and the chosen models.