hinglish-training

Code related to dataset curation, training/fine-tuning, evaluation and inference of Hindi/Hinglish instruct/chat models.

The tasks are outlined in the Spreadsheet

Dataset Curation

This section lists all resources such as relevant research papers, existing datasets, models, chatbots, APIs that can help with this project.

Collection of the Datasets on 🤗 Hub: Hindi/Hinglish Instruct/Chat Datasets

Collection of the models on 🤗 Hub: Hindi/Hinglish Models

sarvamai/OpenHathi-7B-Hi-v0.1-Base: First model in the OpenHathi series of models that will be released by Sarvam AI. This is a 7B parameter, based on Llama2, trained on Hindi, English, and Hinglish. More details about the model, its training procedure, and evaluations can be found here.
Hi-NOLIN-9B: A checkpoint at the 600B tokens on the new training dataset that contains Hindi. We also observe the Bilingual model to generalize to Code-Mixed English-Hindi informal language of Hinglish - a popular mixed language currently spoken by over 350 million people.
Krutrim, OpenAI, Gemini Pro on Bard