Group:
- Pratyaksh Gautam (2020114002)
- Nukit Tailor (2020114012)
The original code is under the directory code_release/
The Facebook, Twitter and Whatsapp data was all downloaded from: http://amitavadas.com/Code-Mixing.html
- The English word list "resources/EN.words.txt" was downloaded from: http://wordlist.aspell.net/
- The Hindi transliteration word list "resources/HI.trans.fire2013.txt" was downloaded from: https://web.archive.org/web/20160312153954/http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/
- The Hindi word list was compiled by Gupta et al. (2012): http://www.lrec-conf.org/proceedings/lrec2012/pdf/365_Paper.pdf
The main annotation script is "process.py". It should be run as follows:
python3 process.py <src_file> [-top_n int] -out <out_file>
Where <src_file> is the input text file in CoNLL-format (1 token per line), and <out_file> is the name of the output file that will be generated. The -top_n flag controls how much of the manually created word list will be used to classify tokens. By default, it uses the whole word list.
Now , to check the scores , run the following command:
python3 scorer.py -hyp <out_file> -ref <ref_file> [-v]
Where <out_file> is the output file generated by process.py
, and <ref_file> is the reference file. The -v flag is optional and will print the scores.
With our modifications to the source, we were able to achieve the following improved F-scores as compared to the original code:
--------
WHATSAPP
--------
en hi univ
en 294 420 30
hi 32 1988 37
univ 37 131 249
Old-scores New-scores
CLASS P R F1 CLASS P R F1
en 39.516 80.992 53.117 en 39.783 80.992 53.358
hi 96.646 78.299 86.51 hi 96.65 78.417 86.584
univ 59.712 78.797 67.94 univ 59.427 78.797 67.75
--------
FACEBOOK
--------
en hi univ
en 12997 397 530
hi 127 2446 173
univ 90 14 3841
Old-scores New-scores
CLASS P R F1 CLASS P R F1
en 93.335 98.35 95.777 en 93.342 98.358 95.785
hi 89.043 85.614 87.295 hi 89.075 85.614 87.31
univ 97.363 84.507 90.481 univ 97.364 84.529 90.494
--------
TWITTER
-------
en hi univ
en 3038 1047 227
hi 575 8034 243
univ 119 698 3330
Old-scores New-scores
CLASS P R F1 CLASS P R F1
en 70.255 81.324 75.385 en 70.455 81.404 75.535
hi 90.721 82.084 86.187 hi 90.759 82.156 86.243
univ 80.352 87.605 83.822 univ 80.299 87.632 83.805