UIUC Course GPA Predictor

Created By: Michael Shea

Goal

The goal of this project is to be able to predict the average GPAs of current and future courses at UIUC using previous GPA data and Machine Learning.

Desired Outcome

I hope that students will be able to use this data to help decide what class to take in an upcoming semester, similarly to the way they do with previous GPA data visualizations. This will require the predicted data to also be visualized - which will be worked on once the model and predictions are completed.

How to Reproduce My Results

  1. Fork the project. Make sure all python libraries that are imported in the code are installed on your local machine. Node.js is needed to run the JavaScript files.
  2. Open and run data_cleanup.py. It should create a file called filteredComplete.csv. This file contains all the training and testing data from previous semesters who have data available for them.
  3. Open and run GetMajorData.js, which is located in the future courses folder. This should create the file MajorData.json, which holds the json data of an API response from the Course Explorer API that contains info on all the majors that courses will be offered for. This data is needed for the next step.
  4. Open and run GetCoursesByMajor.js, which is located in the future courses folder. This will take a few minutes to run. Sometimes the server will timeout and you will get an error. Keep on trying to run this script until it successfully runs all the way through. For each major found in the MajorData.json, it will save all of the info regarding courses for that major for the semester specified in the code. This data is stored in json format in the folder MajorsData.
  5. Open and run remove_bad_majors.py, which is located in the future courses folder. Some of the Majors will have no data at all, and will cause an error in the next step. So this script gets rid of the files.
  6. Open and run FutureCourses.js, which is located in the future courses folder. This will create a file called course_teacher.csv. This file is all of the course data for the semester specified in the code in the format of the data created in step 2.
  7. Open and run NextYearData.py. This will create a file called course_teacher_full.csv. This is the same thing as course_teacher.csv, except the teachers name are in the correct format. Unknown teachers will have '-1' as their value instead.
  8. Now it is time to create a model using classifier.py. This uses the library fastai to train a deep neural network with three hidden layers of sizes 1000, 1000 and 500 and drop out rates of .001, .01. and .02. The tanh activation functions are used at each layer except the last one, which uses softmax. These are the settings that produce the best results based from my experimentation. I used a technique called categorical embeddings on the input features. Traditionally, one-hot-encodings (OHE) are used on categorical features such as the one that we use. But OHE fails to find the optimal relationships between categories, as all features are 0 except one. With embeddings, taken from the inspirations of word embeddings in NLP, the relationships between the categories are learned in the model. So, in reality, the first layer of the neural network is an embedding matrix. After running this script, two files will be created. The first one is viz_prediction.csv, which contains three columns. The first column is the GPAs from the validation set, which is defined in the code to be the last semester on record. The second column is the predicted GPAs, and the third column is the difference between the first two. The second file is {Semester}{Year}Predictions.csv, which is in the same format as filteredComplete.csv. Here, the data is for the future semester specified in the code that the predictions are needed for, and the GPA column contains the predictions.