- Implemented computational linguistics techniques for Hindi language, involving Unicode correction, corpus cleaning, tokenization, syllable analysis, and BPE processing.
- Applied BPE with varying vocabulary sizes, calculating unigram and bi-gram frequencies, and evaluating the precision, recall, and F-scores of BPE-generated tokens.
- Extracted lemmas and surface forms from tagged files and drawing frequency distributions of different linguistic elements, assessing their adherence to Zipfian distributions.
- Fine-tuned a pre-trained BERT model for the UPOS prediction task for hindi language and achieving a f1-score of 94.49%