Language Classifier
Every language has a different grammar pattern, and sometimes people mix the grammars while writing text in English. We would like to use this idea and extend our knowledge in machine learning and natural language processing to figure out if this is doable.
Country Code | Country | Writers/Essays | # of Tokens |
---|---|---|---|
ENS* | USA, UK,CAN, AUS, NZ | 200/ 400 | 88,792 |
HKG** | Hong Kong | 100/ 200 | 46,111 |
PAK** | Pakistan | 200/ 400 | 93,100 |
PHL** | Philippines | 200/ 400 | 96,586 |
SIN** | Singapore | 200/ 400 | 96,733 |
CHN*** | China | 400/ 800 | 194,613 |
IDN*** | Indonesia | 200/ 400 | 92,316 |
JPN*** | Japan | 400/ 800 | 176,537 |
KOR*** | Korea | 300/ 600 | 130,626 |
THA*** | Thailand | 400/ 800 | 176,936 |
TWN*** | Taiwan | 200/ 400 | 89,736 |
Total | --- | 2,800/ 5,600 | 1,282,086* |
*Inner Circle
**Outer Circle
***Expanding Circle
S/W | Country | Topic/ Trial | Serial | CEFR |
---|---|---|---|---|
For NNS | ||||
PTJ: part-time job | A2_0: A2 | |||
S: Speech | CHN, ENS, HKG, IDN, JPN,TWN | SMK: non-smoking | B1_1: B1 Lower | |
W: Writing | KOR, PAK, PHL, SIN, THA, | 0 Essay | 001-999 | B1_2: B1 Upper |
1 Speech (Trial 1) | B2_0: B2 + | |||
2 Speech (Trial 2) | For NS | |||
XX_1 Students | ||||
XX_2 Teachers | ||||
XX_3 Others |
Term | Meaning |
---|---|
Code | File code |
PTJ (wds) | The number of words in one essay or speech |
SMK (wds) | The number of words in one essay or speech |
Term | Meaning |
---|---|
Country | Participant's country or area |
Sex | Participant's sex |
Age | Participant's age |
Grade | Participant's school grade (1, 2, 3, 4 |
Major (Occupation) | In case of students, their major at colleges; in case of employed people, their job. |
Academic Genres | Only for students: Humanities, Social Sciences, Science and Technology, and Life Science |
Term | Meaning |
---|---|
Proficiency Test | Test name such as TOEIC or TOEFL |
Score | Score in the test above |
VST | Score in the vocabulary size test (full mark is 50) This test measures participants' L2 lexical knowledge with a ceiling of 5,000 words. |
CEFR | CEFR levels: A2, B1_1, B1_2, B2+. Estimated from participants' scores in the proficiency test or in the vocabulary size test |
Term | Meaning |
---|---|
INTM | Integrative Motivation Score |
INSM | Instrumental Motivation Score |
INTM+INSM | Strength of Motivation |
INTM-INSM | Integrative Motivation Orientation Score |
Term | Meaning |
---|---|
Primary | How much a participant studies English in their primary school days (1 to 6 points) |
Secondary | How much a participant studies English in their secondary school days (1 to 6 points) |
College | How much a participant studies English in their college days (1 to 6 points) |
Inschool | How much a participant studies English in class (1 to 6 points) |
Outschool | How much a participant studies English outside class, namely, at home, in the community etc (1 to 6 points) |
Listening | How much a participant studies listening (1 to 6 points) |
Reading | How much a participant studies reading (1 to 6 points) |
Speaking | How much a participant studies speaking (1 to 6 points) |
Writing | How much a participant studies writing (1 to 6 points) |
NS | How much a participant has been taught by English native participant (1 to 6 points) |
Pronunciation | How much a participant has been taught by English native participant (1 to 6 points) |
Presentation | How much a participant has been taught presentation (1 to 6 points) |
Essay Writing | How much a participant has been taught essay writing (1 to 6 points) |
-
Average Sentence Length -
Grammar Check -
Spelling Check -
Word Count -
Function Words Count -
POS Bigrams and Trigrams
-
Select more and different features. Use ngrams, tfidf etc
-
Apply dimensionality reduction
-
Check again on classifiers
-
We first convert each of the essays in our training data to a list of parts of speech using Stanford’s parts of speech tagger [5]. For example, the sentence ”This is a paper” would be converted to (determiner, third person verb, determiner, singular noun). We then take consecutive 2-sequences of parts of speech, and count the frequency of each 2-sequence in all of the training essays for a language of origin. Thus, each language has its own model of parts of speech frequencies. Then, for each essay in our test data, we find the likelihood of the sequence of parts of speech from that essay appearing in each language based on our models. The prediction is the language that results in the highest likelihood.
-
We also need to find the sentence structures (subject-verb-object kind of structures !). Search how to do that !
- Install nltk, textblob, sklearn, pandas
-
First output : Logisitic Regression gives 44.9% f1 measure ! LOSER ! After removing just some features (--removed ones), f1 measure comes down to 36.36% !
-
Removed commented out features : 0.48087431694
-
Information Gain with best splits ready ! Either not working, or I did remove good features. Well, does work !
-
Principal component analysis studying.