Correlation matrix of independent varaible with dependent variable
- Random Forest Regressor
- Random Forest Classifier
- Naive Bayes Gaussian Classifier
Significance of the variables in predicting the Arrival Delay as identified by the Random Forest model:
Hyper-Parameter Tuning: The following are the Random Forest Parameters that we care going to experiment with to find the best parameters.
- n_estimators = number of trees in the forest
- max_features = max number of features considered for splitting a node
- max_depth = max number of levels in each decision tree
- min_samples_split = min number of data points placed in a node before the node is split
- min_samples_leaf = min number of data points allowed in a leaf node
- bootstrap = method for sampling data points (with or without replacement)
- n_estimators : 400
- min_samples_split : 10
- min_samples_leaf : 1
- max_features : 'auto'
- max_depth : 20
- bootstrap : True
- ArrDelay categories for Random Forest Classifier Model:
- Less than -30 minutes : Very Early
- Less than 0 minutes & Greater than or equal to -30 minutes : Early
- 0 minutes to 5 minutes : Ontime
- 5 minutes to 30 minutes : Late
- Greater than 30 minutes : Very late.
The accuracy of our Random Forest Classification is 81% while our Naive Bayes Classification is only 70%. We were able to generate the best Random Forest Classifier by hyper-parameter tuning. However we did could not do hyper-parameter tuning on the Naive Bayes classifier as it is so naive that it doesn't accept parameters except priors which we don't know.
- Convert the time columns in HH:MM:SS format -
- Remove outlier / IQR -
- Hyper parameter tuning -
- RandomizedSearchCV -
- Column bin in pandas -
- Naive bayes tuning -