The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
We will use this dataset for langauge detection of 21 languages used in dataset. It can thought of a multilingual text classification problem and we will use char level features for this task.
For each data unit, we have taken the following pre-processing steps:
- Remove Tags <> and () brackets content
- Split into multiple sentences using \n split
and then saved it to dataframe lang_df for further processing or modelling.
As we are using this dataset for language detection, we will choose char-level features, since this is multilinugal and we don't need much local domain information like sublevel classification or category classification for text, char-level features will be suited much better and can be used to create a unified vocablury with less diversification.
We will use sklearn train_test_split to split into into training and test data for model validation, count_vectorizer with char analyzer for char level features(X) and label encoder for language type or (y).
Count Vectorizer creates a char-level vocablury of the whole text data and then uses that to represent each sentence or unit of pre-processed data.
Multinomial Naive Bayes classifier is suitable for classification with discrete features, is well suited for text features.
We will use the training features generated to fit to the model, no hypertuning.
Using pre-processing funtions to pre-process data, and then generating features, training the model and testing it on test set.
Using Bi-RNN with LSTM Cell for language detection.
* 75x2 units LSTM cell bi-directional with concatination of outputs
* Softmax Activation
* Crossentropy Loss
* Adam Optimizer
* Mannual Batch Training
* 75x2 units LSTM cell bi-directional with concatination of outputs
* Softmax Activation
* Crossentropy Loss
* Adam Optimizer
* Mannual Batch Training
* learning_rate = 0.01
* n_epoch = 10
* vocab_size = 322
* num_classes = 21
* hidden_dim = 75
* timesteps = 1
Evaluation of all the three models on Fellowship.ai custom dataset.
* MNB(Sklearn) ACC: 0.9707093359216391 ,
* Keras BI-RNN ACC: 0.9718167850969849
* TF BI-RNN Accuracy: 0.9703284
* MNB(Sklearn) ACC: 0.9707093359216391 ,
* Keras BI-RNN ACC: 0.9720171125440312
* TF BI-RNN Accuracy: 0.9705131168450581