HackRush'22 | Team PRAY Submission

Problem Statement

In an alternate universe, due to the unbalanced workload among the faculty of Stanford University, the university is suffering from high faculty attrition and has become an object of mockery among the public. There has also been a petition to rename it to Standard University due to this mismanagement.

To combat the inevitable backlash, the university aims to build a system that can tell the number of students that will enrol in a course in a given academic year. Such a system will not only allow the university's stakeholders to smartly recruit faculty to balance faculty workload but also gauge the student's interest in a given course to decide if the given course should be offered or not.

Now the question is how to build these systems? That's where you come in!

In this challenge, you will develop a model that tries to forecast the future total student enrolment for courses offered at the university based on the historic enrolment trend of the last 200 years.

Our approach

Feature Engineering

  • Timestep
    • dtypes is object. we have convert it to numerical
    • Made a new coloum name "Year" contains first academic year
    • ex. "AY1810-AY1811" (dtype object) converted into "1810" (dtype int)
  • Course and Faculty
    • dtypes is object. we have convert it to numerical
    • Used One-Hot-Encoding

Feature Selection

Droped the following columns: 'Id', 'Timestep', 'Course', 'Faculty'

Model Building

Used CatBoostRegressor with following hyperparameter

  • learning_rate = 0.75
  • depth = 8
  • n_estimators = 2000

Added the bias of 25

Models we used?

  • Linear Regression(scikit)
  • Random forest regression(scikit)
  • CatBoostRegressor(catboost)
  • Sequential model(tensorflow)

Tuning Hyperparameters

  • n_estimators: number of trees in the forest
  • depth: depth of the tree
  • learning_rate: determines the step size at each iteration while moving toward a minimum of a loss function

Challenges we faced!

  • Faculty and Courses are given as labels but machine learning required numerical data for processing
  • Normalizing of inputs/outputs
  • Splitting data for training and testing
  • Finding perfect parameters for our model
  • High training times
  • Overfitting
  • Presence of outliers and missing entries

References

Contributors

anupam
Yash Meshram
anupam
Anupam Kumar
pradeep
Pradeep Saini
robin
Robin Kumar