HackRush'22 | Team PRAY Submission

Problem Statement

In an alternate universe, due to the unbalanced workload among the faculty of Stanford University, the university is suffering from high faculty attrition and has become an object of mockery among the public. There has also been a petition to rename it to Standard University due to this mismanagement.

To combat the inevitable backlash, the university aims to build a system that can tell the number of students that will enrol in a course in a given academic year. Such a system will not only allow the university's stakeholders to smartly recruit faculty to balance faculty workload but also gauge the student's interest in a given course to decide if the given course should be offered or not.

Now the question is how to build these systems? That's where you come in!

In this challenge, you will develop a model that tries to forecast the future total student enrolment for courses offered at the university based on the historic enrolment trend of the last 200 years.

Our approach

Feature Engineering

Timestep
- dtypes is object. we have convert it to numerical
- Made a new coloum name "Year" contains first academic year
- ex. "AY1810-AY1811" (dtype object) converted into "1810" (dtype int)
Course and Faculty
- dtypes is object. we have convert it to numerical
- Used One-Hot-Encoding

Feature Selection

Droped the following columns: 'Id', 'Timestep', 'Course', 'Faculty'

Model Building

Used CatBoostRegressor with following hyperparameter

learning_rate = 0.75
depth = 8
n_estimators = 2000

Added the bias of 25

Models we used?

Linear Regression(scikit)
Random forest regression(scikit)
CatBoostRegressor(catboost)
Sequential model(tensorflow)

Tuning Hyperparameters

n_estimators: number of trees in the forest
depth: depth of the tree
learning_rate: determines the step size at each iteration while moving toward a minimum of a loss function

Challenges we faced!

Faculty and Courses are given as labels but machine learning required numerical data for processing
Normalizing of inputs/outputs
Splitting data for training and testing
Finding perfect parameters for our model
High training times
Overfitting
Presence of outliers and missing entries