Welcome to the LLM Fine-Tuning for Amharic Language repository, initiated by AIQEM, an African startup specializing in AI and Blockchain solutions. Our goal is to enhance technological innovations' impact in Ethiopia and Africa's business landscape. Our latest flagship project, Adbar, is an AI-based Telegram Ad solution that optimally places ads on Telegram channels through data analysis and bots.
As Telegram grows, AIQEM adapts its advertising strategy to align with this platform. Our focus is on improving ad effectiveness by integrating powerful AI for Amharic text manipulation. We aim to create an Amharic RAG pipeline that generates creative text ads for Telegram channels based on campaign details, including brand and product information.
Success ensures ads are catchy and relevant to the Telegram community. To achieve this, we need quality Amharic text embedding and generation capabilities. This involves fine-tuning open-source Large Language Models (LLMs) like Mistral, Llama 2, Falcon, or Stable AI 2 to meet business objectives.
Model | Parameters |
---|---|
Microsoft Phi | 2B |
StableLM | 2B |
LLaMA 2 | 7B |
Mistral | 7B |
A tokenizer with 100k vocabulairies that was trained by bpe (Byte-pair-encoding) Trained a custom tokenizer for the language that is currently available for inference on huggingface on the following link
https://huggingface.co/BiniyamAjaw/amharic_tokenizer/blob/main/README.md
Dataset that was gathered from public telegram channels that are in the following categories
- News
- Sports
- Literature
- E-comerece
Data preparation and preprocessing pipeline was created in order to create a huge corpus of data
The data is available on huggingface on the following link
https://huggingface.co/datasets/BiniyamAjaw/amharic_dataset_v2
The project is divided into tasks:
- Literature Review & Huggingface Ecosystem:
- Understand LLMs and explore Huggingface for fine-tuning.
- Load an LLM and Use It for Inference:
- Set up the environment and test model inference.
- Data Preprocessing and Preparation:
- Clean Telegram data for fine-tuning.
- Fine-Tuning the LLM:
- Train and fine-tune LLMs for Amharic text.
- Build RAG Pipeline for Amharic Ad Generation:
- Implement RAG techniques for ad content generation.
- pretraining: Scripts for pretraining on the corpus.
- notebooks: Jupyter notebooks for analysis.
- utils: Helper functions.
- backend: FastAPI backend.
- frontend: React frontend.
-
Clone the Repository:
git clone https://github.com/biniyam69/Amharic-LLM-Finetuning.git
-
Setup Backend:
cd backend pip install -r requirements.txt uvicorn main:app --reload
-
Setup Frontend:
cd ../frontend npm install npm start
-
Access the RAG Ad Builder:
- Open your browser and go to http://localhost:3000.
This project is licensed under the MIT License.
Feel free to contribute, provide feedback, or use the code as needed!
- Back-End:Back-End-Folder
- Front-End:Front-End-Folder