/CommonLit-Readability-Prize

Rate the complexity of literary passages for grades 3-12 classroom use https://www.kaggle.com/c/commonlitreadabilityprize

Primary LanguageJupyter NotebookMIT LicenseMIT

CommonLit Readability Prize

Introduction

Title Text
Intro Can machine learning identify the appropriate reading level of a passage of text, and help inspire learning? Reading is an essential skill for academic success. When students have access to engaging passages offering the right level of challenge, they naturally develop reading skills. Currently, most educational texts are matched to readers using traditional readability methods or commercially available formulas. However, each has its issues. Tools like Flesch-Kincaid Grade Level are based on weak proxies of text decoding (i.e., characters or syllables per word) and syntactic complexity (i.e., number or words per sentence). As a result, they lack construct and theoretical validity. At the same time, commercially available formulas, such as Lexile, can be cost-prohibitive, lack suitable validation studies, and suffer from transparency issues when the formula's features aren't publicly available.
Data The dataset consisted of Train == 12000, Test == 1200, Sample_Submition, Nigerian_State_LGA_Name.
Metrics F1_score for evaluating our algorithm.
ML Task Binary Classification task.

Problems

  1. id - unique ID for excerpt
  2. url_legal - URL of source - this is blank in the test set.
  3. license - license of source material - this is blank in the test set
  4. excerpt - text to predict reading ease of
  5. target - reading ease
  6. standard_error - measure of spread of scores among multiple raters for each excerpt. Not included for test data.

Solved

  1. Used RandomOverSampler algorithm to oversample the minority class.
  2. I tried to impute NaNs with Iterative-Imputer and KNN-Imputer.
  3. I used absolute value of Age to fix negative values.
  4. When I deleted duplicated values I got lower F1_score in public LB so I did not fix it. But in private LB I found out I should have deleted it.
  5. Interestingly I used Nigerian_State_LGA_Name dataset to correct Names in LGA and State.
  6. I again did not fix duplicated rows with different targets.

Unsolved

  1. Did not pay attention to scaling, transforming, feature selection, which led to overfitting.
  2. rather than following ML rules I followed what public LB told me about duplicated rows.
  3. I did not use Stacking or boosting from ensembles efficiently.

Algorithms Used

  1. CatBoost for binary Classification.
  2. Iterative-Imputer with ExtraTrees for Imputing Missing Values by Label-Encoding the categorical dtype.
  3. RandomOverSampler for Over-Sampling minority class.
  4. Others.

🛠  Tech Tools

  • 👾 Python

  • ⚙️   GitHub Markdown

  • 💻   Windows