/nlp-for-hinglish

Primary LanguageJupyter NotebookMIT LicenseMIT

NLP for Hinglish (Code mixed Hindi+English)

This repository contains Language model for Code mixed Hinglish (Hindi and English) - spoken in Indian sub-continent.

Methodology followed in this repo is detailed in this paper, accepted at Dravidian-Codemix-HASOC2020@FIRE2020

Dataset

  1. Synthetically Generated Hinglish Dataset from Wikipedia Articles

Results

Language Model Perplexity (on validation set)

Architecture/Dataset Synthetically Generated Wikipedia Articles Dataset
ULMFiT 86.48

Visualizations

Word Embeddings
Architecture Visualization
ULMFiT Embeddings projection

Pretrained Models

Language Models

Download pretrained ULMFiT LM from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here