Computational-Linguistics-for-Indian-Languages


  • Implemented computational linguistics techniques for Hindi language, involving Unicode correction, corpus cleaning, tokenization, syllable analysis, and BPE processing.
  • Applied BPE with varying vocabulary sizes, calculating unigram and bi-gram frequencies, and evaluating the precision, recall, and F-scores of BPE-generated tokens.
  • Extracted lemmas and surface forms from tagged files and drawing frequency distributions of different linguistic elements, assessing their adherence to Zipfian distributions.
  • Fine-tuned a pre-trained BERT model for the UPOS prediction task for hindi language and achieving a f1-score of 94.49%