/open-hinglish-model

Code related to training/fine-tuning Hindi/Hinglish models.

Primary LanguageJupyter Notebook

hinglish-training

Code related to dataset curation, training/fine-tuning, evaluation and inference of Hindi/Hinglish instruct/chat models.

The tasks are outlined in the Spreadsheet

Dataset Curation

Model training

Evaluation

Inference/Chatbot

Resources

This section lists all resources such as relevant research papers, existing datasets, models, chatbots, APIs that can help with this project.

Papers

Datasets

Collection of the Datasets on 🤗 Hub: Hindi/Hinglish Instruct/Chat Datasets

Models/APIs/Chabots

Collection of the models on 🤗 Hub: Hindi/Hinglish Models

  1. sarvamai/OpenHathi-7B-Hi-v0.1-Base: First model in the OpenHathi series of models that will be released by Sarvam AI. This is a 7B parameter, based on Llama2, trained on Hindi, English, and Hinglish. More details about the model, its training procedure, and evaluations can be found here.
  2. Hi-NOLIN-9B: A checkpoint at the 600B tokens on the new training dataset that contains Hindi. We also observe the Bilingual model to generalize to Code-Mixed English-Hindi informal language of Hinglish - a popular mixed language currently spoken by over 350 million people.
  3. Krutrim, OpenAI, Gemini Pro on Bard

Evaluation benchmarks