NLP Project: Estimate Difficulty Level of Kids Stories

Project Report

1. Business Domain

Freadom is an adaptive mobile reading platform that helps parents with children aged 3 - 10, learn to read in English by instilling a daily reading habit.
It provides parents and teachers curated and levelised stories, activities, quizzes and daily positive news to enjoy with the child.

2. Business Problem

The stories in the Freadom app need to be simple and easy to understand. To ensure that students gain most from each reading session, it is imperative to make sure that the stories which the app recommends to students are at his level of proficiency.
If there is conflict between the level of proficiency of students with that of the story which we are recommending to students, it will not lead to any learning or improvement in reading proficiency.

3. Analytical Overview

- Target Definition:

Develop a model which estimates the relative difficulty level of the stories based on their text data only.
To determine if the story is at the appropriate level, multiple factors are taken into account such as overall length of the story, average length of the sentences in the story and whether the words used are simple or complex etc.

- Inputs:

556 text files extracted from stories used in the Freedom app.
All the text Files are of variable length.

4. Solution Description

- Lexile Text Measure

A Lexile reader measure represents a person’s reading ability on the Lexile scale. A Lexile text measure represents a text’s difficulty level on the Lexile scale.
The reader’s score on the test is reported as a Lexile measure from a low of 0L to a high of 2000L. When readers score at or below 0L, a BR (Beginning Reader) code is displayed on their report.

- Initial Approach:

The initial approach was to use Lexile Titles Database™ for training the model and then use the model to predict the Lexile Text Measure for the given stories. But this dataset is available on request.
The lexile analyzer tool is based on the Lexile Titles Database™ and provides lexile score for the text entered.
The above tool was used for acquiring the lexile score for the given input text files.

- Data Labelling:

The data was labelled with their respective lexical score, process was automated through selenium web driver.

The lexile analyzer tool requires at least 2 sentences in the entered text, and sufficient text length for calculation [up to 1000 words].
The final labelled dataset consisted of lexile score range for 548 text files.
The Lexile score for text files was then mapped to grade levels as follows on the basis of the mapping defined here:
1. Lexile Score Range: BR190L - 400L -> Grade < 2
2. Lexile Score Range: 410L - 1000L -> Grade 2 - Grade 4
3. Lexile Score Range: 1010L - 1400L -> Grade > 4

- Feature Engineering:

A total of 42 text-based features were created for the 548 input text files consisting of features as follows:

1. Text Analysis Based features:

Number of Words
Number of Sentences
Average Word Length
Average no of words per sentence
Average syllable count per word
Number of Complex Words
Number of Common Words
Average Number of Complex Words Per Sentence
Average Number of Simple Words Per Sentence
Ratio of complex words/common words
Number of Easy words
Number of Difficult words
Average Number of Easy Words Per Sentence
Average Number of Difficulty Words Per Sentence
Ratio of Difficult words/Easy words

2. Text Readability Score:

Automated Readability Index
Flesch Reading Ease
FleschKincaid Grade Level
Coleman Liau Index
Gunning Fog Index
SMOG Index
Linsear Write
Dale Chall Readability

3. POS Tags (Distribution): distribution (frequency) of the following tags in the text files

ADJ
ADP
ADV
AUX
CCONJ
CONJ
DET
INTJ
NOUN
NUM
PART
PRON
PROPN
PUNCT
SCONJ
SPACE
SYM
VERB
X

- Algorithms and Feature Selection:

1. Baseline Model:

Algorithm: Random Forest Classifier
Number of Features: 42
Evaluation: Cross-Validation [4 fold]
Scoring Metric: Accuracy
Results: 88.5 %

2. Model 1:

Algorithm: Random Forest Classifier
Feature Selection Method: SelectFromModel (feature importance based method)
Number of Selected Features: 16
Evaluation: Cross-Validation [4 fold]
Scoring Metric: Accuracy
Results: 88.87 %

3. Model 2:

Algorithm: Random Forest Classifier
Feature Selection Method: SelectFromModel (feature importance based method)
Number of Selected Features: 16
Parameter Tuning: RandomizedSearchCV
Tuned Parameter: n_estimators= 600, min_samples_split= 12, min_samples_leaf= 4, max_features= sqrt, max_depth= 4, bootstrap= False
Evaluation: Cross-Validation [4 fold]
Scoring Metric: Accuracy
Results: 89.42 %

4. Model 3:

Algorithm: Support Vector Machine
Number of Features: 42
Evaluation: Cross-Validation [4 fold]
Scoring Metric: Accuracy
Results: 83.21 %

5. Model 4:

Algorithm: XGBoost Classifier
Number of Features: 42
Evaluation: Cross-Validation [4 fold]
Scoring Metric: Accuracy
Results: 87.96 %

6. Model 5:

Algorithm: XGBoost Classifier
Feature Selection Method: SelectFromModel (feature importance based method)
Number of Selected Features: 11
Evaluation: Cross-Validation [4 fold]
Scoring Metric: Accuracy
Results: 88.69 %

7. Model 6:

Algorithm: XGBoost Classifier
Feature Selection Method: SelectFromModel (feature importance based method)
Number of Selected Features: 11
Parameter Tuning: RandomizedSearchCV
Tuned Parameter: subsample= 0.8, n_estimators= 200, min_child_weight= 8, max_depth= 12, learning_rate= 0.01, colsample_bytree= 0.8
Evaluation: Cross-Validation [4 fold]
Scoring Metric: Accuracy
Results: 89.78 %

8. Model Results: Summary

Model	Algorithm	Meta Description	Accuracy
Baseline Model	Random Forest Classifier	Baseline Model with All Features	88.5 %
Model 1	Random Forest Classifier	Baseline Model with 16 Selected Features using SelectFromModel (feature importance based method)	88.5 %
Model 2	Random Forest Classifier	Baseline Model with 16 Selected Features SelectFromModel (feature importance based method)and Parameter Tuning using RandomSearchCV	89.42 %
Model 3	Support Vector Machine	All Features	83.21 %
Model 4	XGBoost Classifier	All Features	87.96 %
Model 5	XGBoost Classifier	11 Selected Features using SelectFromModel (feature importance based method)	88.69 %
Model 6	XGBoost Classifier	11 Selected Features using SelectFromModel (feature importance based method) and Parameter Tuning using RandomSearchCV	89.78 %

- Results:

1. Final Model Description:

Algorithm: XGBoost Classifier
Number of Selected Features: 11

Selected Features and Importance:

Feature	Importance
NOUN	0.254066
FleschKincaid_Grade_Level	0.143781
Automated_Readability_Index	0.133037
Avg_No_Complex_Words_Per_Sentence	0.112143
Word_Count	0.072617
No_Difficulty_Words	0.063537
SCONJ	0.062241
PRON	0.053438
NUM	0.045221
Avg_Word_Length	0.030411
Coleman_Liau_Index	0.029509

Evaluation: Cross-Validation [4 fold]
Scoring Metric: Accuracy
Results: 89.78 %

- Sources of Error:

The text data for each file is an excerpt from the story. Since the features for the model are based on text data such as number of sentences, number of words per sentence, there is a significant effect of them on the model predicted labels. Therefore, the predicted labels may change for the same story but a different excerpt.
Due to the small size of the dataset, there is a class imbalance:

Grade Level Lexile Score Range

Grade < 2 62

Grade 2 - Grade 4 456

Grade > 4 30

The model doesn’t have enough examples to find discriminative features that will be used to generalise.

Grade Level	Lexile Score Range
Grade < 2	62
Grade 2 - Grade 4	456
Grade > 4	30

5. Code Files and Description:

Code File	Language	Description	Input	Output
browser_automation.py	python	automated data labeling process through selenium web driver for the input text files	text files [Story text files/*txt]	dataset.xlsx
feature_engineering.py	python	feature engineering from text data	dataset.xlsx, text files [Story text files/*txt]	dataset_modelling.xlsx
Feature Selection and Modelling.ipynb	python	Features Selection and Modelling	dataset_modelling.xlsx

Pahulpreet86/NLP-Project-Estimate-Difficulty-Level-of-Kids-Stories

NLP Project: Estimate Difficulty Level of Kids Stories

Project Report

1. Business Domain

2. Business Problem

3. Analytical Overview

- Target Definition:

- Inputs:

4. Solution Description

- Lexile Text Measure

- Initial Approach:

- Data Labelling:

- Feature Engineering:

1. Text Analysis Based features:

2. Text Readability Score:

3. POS Tags (Distribution): distribution (frequency) of the following tags in the text files

- Algorithms and Feature Selection:

1. Baseline Model:

2. Model 1:

3. Model 2:

4. Model 3:

5. Model 4:

6. Model 5:

7. Model 6:

8. Model Results: Summary

- Results:

1. Final Model Description:

- Sources of Error:

5. Code Files and Description:

6. References