/Language-Identification

Language identification at sentence as well as word level in both monolingual as well as code-mixed bilingual texts

Primary LanguageJupyter Notebook

Language-Identification

Language identification at sentence as well as word level in both monolingual as well as code-mixed bilingual texts.


Task-1

  • Our task was focussed on developing a model for Language Identification service that returns the language code of the language in which the text is written. We had datsets from three languages Spanish (ES), Portuguese (PT) and English (EN) and had to predict sample texts in the mention 3 languages.

Data

  • We had 3 datasets (en, es, pt) with 2.6Mn lines in each file.
  • We cannot process such a large file at once hence we used random sampler and got 20,000 sampes from all the three files. (command: shuf -n 20000 input_file > output_file)
  • We gave languaegs ID's as {En: 0, Es: 1, Pt: 2} respectively.
  • Please contact us for accessing the data.

Pre-Processing

  • All the texts were converted to lower case.
  • Removal of punctuations.
  • All the digits were removed from the text sentences.
  • Series of contiguous white spaces were replaced by single space.
  • Removal of hyperlinks

Representation

  • We used TfidfVectorizer for representing the text in our corpus.

Data split

  • We splitted the data into 80/20 for trianing and testing at the same time keeping in mind to have similar number of instances for all the three langauges in test too.
  • Train dataset: {En: 14197, Es: 15279, Pt: 15550}
  • Test dataset: {En: 4178, Es: 3503, Pt: 3576}

ML Model

  • Used sklearn for importing the models.
  • We have used LogisticRegression algorithm for our classification with solver='lbfgs'.

Results

  • Our system achieved an accuracy of 93%.

  • The confusion matrix for the result is as below:

    0 1 2
    0 3620 299 259
    1 38 3388 77
    2 22 71 3483
  • The Classification report is as below:

    precision recall f1-score support
    En 0.98 0.87 0.92 4178
    Es 0.90 0.97 0.93 3503
    Pt 0.91 0.97 0.94 3576

Running the script and results

  • To run our model the command is as below:
  • python3 script_task1.py data.en data.es data.pt langid.test
  • In the above command 1st agument should be english data file, 2nd as spanish 3rd as potruguese and 4th as the testing file.
  • test_results.txt will contain the language label predicted for the langid.test file once ran on our model.
  • test_results.txt above contains the output on the langid.tets file provided for us to test.
  • Tags here are numerical 0: en, 1: es, 2: pt

Task-2

  • Our task was focussed on developing a model to distinguish between language variants. Here we wish to distinguish between European Portuguese (PT-PT) and Brazilian Portuguese (PT-BR).

Data

  • We had 2 datasets (pt-pt and pt-br) with 1.9Mn and 1.5Mn lines.
  • Due to unability to process such a large file (not high spec system) at once hence we used random sampler and got 65,000 sampels from pt-br and 50,000 samples from pt-pt respectively. (command: shuf -n N input_file > output_file)
  • We gave languaegs ID's as {pt-br: 0, pt-pt: 1} respectively.

Pre-Processing

  • All the texts were converted to lower case.
  • Removal of punctuations.
  • All the digits were removed from the text sentences.
  • Series of contiguous white spaces were replaced by single space.
  • Removal of hyperlinks

Representation

  • We used TfidfVectorizer for representing the text in our corpus.
  • Keeping top 6000 words for reprsentation of the sentences.

Data split for our model train and test##

  • We splitted the data into 80/20 for trianing and testing at the same time keeping in mind to have similar number of instances for all the three langauges in test too.
  • Total pt-br: 47554 and pt-pt: 50000.
  • Train dataset: {pt-br: 38052, pt-pt: 39991}
  • Test dataset: {pt-br: 9502, pt-pt: 10009}

ML Model

  • Used sklearn for importing the models.
  • We have used LogisticRegression algorithm for our classification with solver='lbfgs'.

Results

  • Our system achieved an accuracy of 81.9%.

  • The confusion matrix for the result is as below:

    0 1
    0 7735 1767
    1 1761 8248
  • The Classification report is as below:

    precision recall f1-score support
    pt-br 0.81 0.81 0.81 9502
    pt-pt 0.82 0.82 0.82 10009

Running the script and results

  • To run our model the command is as below:
  • python3 script_task2.py data.pt-br data.pt-pt langid-variants.test
  • In the above command 1st agument should be Brazilian Portuguese (pt-br) file, 2nd as European Portuguese (pt-pt) and 3rd as the testing file.
  • test_result.txt will contain the language label predicted for the langid-variants.test file once ran on our model.
  • test_results.txt above contains the output on the langid.tets file provided for us to test.
  • Tags here are numerical 0: pt-br, 1: pt-pt

Task-3

  • Implement a deep learning model (recommended: a BILSTM tagger) to detect code switching (language mixture) and return both a list of tokens and a list with one language label per token.
  • To simplify our work was focussed on English and Spanish, so we were only needed to return for each token either 'en', 'es' or 'other'.

Data

  • For code switching we will focus on Spanish and English, and the data provided is derived from http://www.care4lang.seas.gwu.edu/cs2/call.html.

  • This data is a collection of tweets, in particular you have three files for the training set and three for the validation set:

  • offsets_mod.tsv

  • tweets.tsv

  • data.tsv

  • The first file has the id information about the tweets, together with the tokens positions and the gold labels.

  • The second has the ids and the actual tweet text.

  • The third has the combination of the previous files, with the tokens of each sentence and the gold labels associated. More specifically, the columns are: offsets_mod.tsv: {tweet_id, user_id, start, end, gold label} tweets.tsv: {tweet_id, user_id, tweet text} data.tsv: {tweet_id, user_id, start, end, token, gold label}

The gold labels can be one of three:

  • en
  • es
  • other

For this task, we were required to implement a BILSTM tagger.


Approach

  • We tried to implement a BiLSTM model with character embeddings and see how our model performs for this task.
  • To encode the character-level information, we will use character embeddings and a LSTM to encode every word to a vector.

Data processing

  • There were lines in the *_data.tsv files which had " as a token and was inhibittng the entire reading of file in pandas read_csv function.
  • Hence we removed all the lines from both train as well as dev data files which had " in them.
  • Keeping a tweet together as this will later adds to context if our model can learn that too.
  • We created a list of list of tuples, in which each word/token was as a tuple with it's tag and inside a list which contains all the tuples of words from a single tweet.

Results

  • Our system achieved an accuracy of 96.2% when trained and tested on the train_data.tsv file only.

  • The confusion matrix for the result is as below:

    0 1 2
    0 1816 45 26
    1 96 4651 70
    2 65 51 2545
  • The Classification report is as below:

    precision recall f1-score support
    Other 0.92 0.96 0.94 1887
    En 0.98 0.97 0.97 4817
    Es 0.96 0.96 0.96 2661

Final test result

  • Our system achieved an accuracy of 96.5% when trained on the train_data.tsv file and tested on dev_data.tsv file.

  • The confusion matrix for the result is as below:

    0 1 2
    0 17929 156 230
    1 876 45412 618
    2 796 469 24715
  • The Classification report is as below:

    precision recall f1-score support
    Other 0.91 0.98 0.95 18315
    En 0.99 0.97 0.98 46906
    Es 0.97 0.95 0.96 25980

Running the system

  • Keep the train and test dataset similar to the format of train_data.tsv in the same directory as the script_task3.py.
  • run the command python3 script_task3.py train_data.tsv test_data.tsv
  • It'll show two images, 1. The variation of loss and validation loss during training. 2. The confusion matrix image.
  • At last will print the confusion matrix as well as classification report along with the accuracy of the madel.