/osslt-pass-classifier

Predicts whether a student will pass the Ontario standardized literacy test on their first try using logistic regression

Primary LanguagePython

Introduction

A logistic regression algorithm by gradient descent that identifies whether a student will pass the Ontario standardized literacy test on their first try.

While the logistic regression algorithm only requires numpy to run, it also requires scikit-learn for its confusion matrix capabilities so that the f1 score can be calculated.

Usage

If you want to just play around, all of the work has already been done! logistic_regression.py has been set to run a k-fold crossfold validation test on the current data set oversampled_output_normalized.csv.

Training the classifier on new data is a simple as lr.train(X, y) where lr is the logistic regression object, X is the training set and y is the target set.

Utilizing the classifier is just lr.predict(data) where data is a 2D numpy array containing all the instances to predict, alternatively lr.predict_proba(data) also works if one wanted to see the confidence of the prediction for each instance.

Dataset

In oversampled_output_normalized.csvthe data is formatted as follows:

Database ID Passed on First Try Sex English as Native Language Grade 9 English Average Academic Applied Locally Developed IB

Passed on First Try is what is used as the target set. (1 for passing on first try, 0 for failling).

In Ontario high schools, Academic, Applied, and Locally Developed are different levels of difficultly in the courses that students take from hardest to easiest respectively. IB is the International Baccalaureate programme that some Students also have the option of taking (a worldwide programme considered to be more rigourous than the other 3). The level of the student is determined by analyzing the course code of their grade 9 English course. (1 for being enrolled in the level, thus all other levels should be marked as 0).

Sex is 1 for male, 0 for female.

Grade 9 English Average is 0 to 100 z-scaled down to -1 to 1.

If you would like to try feature engineering on your own, the raw data has been included as txt files. dataset_decoder.py may give you a starting point on where to go.