Code and writing for submissions made to the Kaggle Titantic competition, as a part of a course project for STAT5302: Applied Regression Analysis.
Tasks:
- Clean the "ticket" column to create a factor for the ticket type and a continuous (?) column for the number
- Do some kind of feature selection process to include only relevant variables (e.g. via AIC)
- Do some kind of search for second-order and higher-order interaction effects between variables
- Search for non-linear relationships between the variables and the response (e.g. polynomial terms, log/power transforms)
- Consider some kind of feature derived from passenger names
- Handle missing data in continuous columns
- Handle missing data in factor columns
- Do something with the Cabin information
- Create cross-validation split for use by team (?)
- Try other model approaches: Decision tree xgboost (Gradient boosted decision tree) Python sklearn for ensemble model Look up how to do ensemble models in R? Neural approach?
- Write report Abstract Short description of preliminary data study Detailed reasoning of the model/method chosen Explanations of other models and why they were inferior Short conclusion Supplemental document References Code Screenshot of score in leaderboard
Finished tasks:
- Create Kaggle team
- Create a github account and "clone" the repository to your local computer