HackRush'22 | Team PRAY Submission
Problem Statement
In an alternate universe, due to the unbalanced workload among the faculty of Stanford University, the university is suffering from high faculty attrition and has become an object of mockery among the public. There has also been a petition to rename it to Standard University due to this mismanagement.
To combat the inevitable backlash, the university aims to build a system that can tell the number of students that will enrol in a course in a given academic year. Such a system will not only allow the university's stakeholders to smartly recruit faculty to balance faculty workload but also gauge the student's interest in a given course to decide if the given course should be offered or not.
Now the question is how to build these systems? That's where you come in!
In this challenge, you will develop a model that tries to forecast the future total student enrolment for courses offered at the university based on the historic enrolment trend of the last 200 years.
Our approach
Feature Engineering
- Timestep
- dtypes is object. we have convert it to numerical
- Made a new coloum name "Year" contains first academic year
- ex. "AY1810-AY1811" (dtype object) converted into "1810" (dtype int)
- Course and Faculty
- dtypes is object. we have convert it to numerical
- Used One-Hot-Encoding
Feature Selection
Droped the following columns: 'Id', 'Timestep', 'Course', 'Faculty'
Model Building
Used CatBoostRegressor with following hyperparameter
- learning_rate = 0.75
- depth = 8
- n_estimators = 2000
Added the bias of 25
Models we used?
- Linear Regression(scikit)
- Random forest regression(scikit)
- CatBoostRegressor(catboost)
- Sequential model(tensorflow)
Tuning Hyperparameters
n_estimators
: number of trees in the forestdepth
: depth of the treelearning_rate
: determines the step size at each iteration while moving toward a minimum of a loss function
Challenges we faced!
- Faculty and Courses are given as labels but machine learning required numerical data for processing
- Normalizing of inputs/outputs
- Splitting data for training and testing
- Finding perfect parameters for our model
- High training times
- Overfitting
- Presence of outliers and missing entries
References
- sklearn.ensemble.RandomForestRegressor
- CatBoostRegressor
- tf.keras.Sequential
- Stackoverflow
- Towards Data Science
Contributors
Yash Meshram |
Anupam Kumar |
Pradeep Saini |
Robin Kumar |