/alsii

(alsii) Automated Language detection in Social Interactions on the Internet

Primary LanguagePython

Language and Society Project

Group:

  • Pratyaksh Gautam (2020114002)
  • Nukit Tailor (2020114012)

The original code is under the directory code_release/

Data

The Facebook, Twitter and Whatsapp data was all downloaded from: http://amitavadas.com/Code-Mixing.html

Resources

  1. The English word list "resources/EN.words.txt" was downloaded from: http://wordlist.aspell.net/
  2. The Hindi transliteration word list "resources/HI.trans.fire2013.txt" was downloaded from: https://web.archive.org/web/20160312153954/http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/
  3. The Hindi word list was compiled by Gupta et al. (2012): http://www.lrec-conf.org/proceedings/lrec2012/pdf/365_Paper.pdf

Running the code

The main annotation script is "process.py". It should be run as follows: python3 process.py <src_file> [-top_n int] -out <out_file> Where <src_file> is the input text file in CoNLL-format (1 token per line), and <out_file> is the name of the output file that will be generated. The -top_n flag controls how much of the manually created word list will be used to classify tokens. By default, it uses the whole word list.

Now , to check the scores , run the following command: python3 scorer.py -hyp <out_file> -ref <ref_file> [-v] Where <out_file> is the output file generated by process.py, and <ref_file> is the reference file. The -v flag is optional and will print the scores.

Results

With our modifications to the source, we were able to achieve the following improved F-scores as compared to the original code:

--------
WHATSAPP                          
--------
        en      hi      univ
en      294     420     30        
hi      32      1988    37  
univ    37      131     249         

        Old-scores                              New-scores

CLASS   P       R       F1              CLASS   P       R       F1
en      39.516  80.992  53.117          en      39.783  80.992  53.358
hi      96.646  78.299  86.51           hi      96.65   78.417  86.584
univ    59.712  78.797  67.94           univ    59.427  78.797  67.75                       

--------
FACEBOOK
--------
        en      hi      univ
en      12997   397     530
hi      127     2446    173
univ    90      14      3841

        Old-scores                              New-scores

CLASS   P       R       F1              CLASS   P       R       F1
en      93.335  98.35   95.777          en      93.342  98.358  95.785
hi      89.043  85.614  87.295          hi      89.075  85.614  87.31
univ    97.363  84.507  90.481          univ    97.364  84.529  90.494   

--------
TWITTER
-------
        en      hi      univ
en      3038    1047    227
hi      575     8034    243
univ    119     698     3330

        Old-scores                              New-scores

CLASS   P       R       F1              CLASS   P       R       F1
en      70.255  81.324  75.385          en      70.455  81.404  75.535
hi      90.721  82.084  86.187          hi      90.759  82.156  86.243
univ    80.352  87.605  83.822          univ    80.299  87.632  83.805