The aim of this project is to predict someone's English proficiency based on a text input.
Goal : Build a machine learning algorithm for predicting the SST score of each participant based on their transcript (we don't use the interviewer transcript).
Database : The NICT JLE Corpus available here : https://alaginrc.nict.go.jp/nict_jle/index_E.html
The source of the corpus data is the transcripts of the audio-recorded speech samples of 1,281 participants (1.2 million words, 300 hours in total) of English oral proficiency interview test. Each participant got a SST (Standard Speaking Test) score between 1 (low proficiency) and 9 (high proficiency) based on this test.
Our target is to beat the major class representation (class SST score 4) : 37%.
The preprocessing is made in extract_words() function :
- Using the HTMLParser class, we extract the participant transcript (all tags).
- We save the SST score of the participant with the SST_level tag.
- Inside participant transcript, we remove all other tags, punctuation and names using a regex.
- Extract only English words and remove all stop words (a list of very common but useless for classification).
- Then comes a stemming or lemmatization phase (we started with a stemming phase and then moved to lemmatizer which gave us the best results).
-
Extract features with hesitation words (by hand) : hesitation.ipynb -> Bad results
-
Extract features with the Bag of Words technique :
-
By hand : file handmade_bow.py
-
With gensim and DecisionTreeClassifier (sklearn) : file gensim_bow.py -> Bad results
-
With nltks library TfidfVectorizer and CountVectorizer : file nltk_vectorizer.ipynb
-
Only with score SST classes 4 5 6 et 7
* Model SVC on tf_idf vectorizer * Accuracy : 0.5261627906976745 * SVC on count vectorizer * Accuracy : 0.5668604651162791 (best score yet)
-
With all classes scores SST
* SVC on tf_idf vectorizer * Accuracy : 0.4609375 * SVC on count vectorizer * Accuracy : 0.4811111111111111
-
-
-
Extract features with Glove technique : glove.ipynb
-
We create an embedding matrix of 100D using GloVe co-occurence probabilities on our unique words.
-
Using our embedding dictionary we create two neural networks to train (library keras)
-
Model 1 : LSTM + dense layer (activation sigmoid)
- val_accuracy : 0.42
-
Model 2 : LSTM, globalMaxPool1D, Dropout (activation relu), Dense, Dropout (activtion relu)
- val_accuracy : 0.48
-
-