Classifying whether a passenger will survive on the titanic. This will be my first Kaggle submission.
I'm beginning to keep a log of my progress each day.
I then ran several classification algorithms (naive bayes, SVC, random forest), which I thought would perform well on a small dataset. The all gave varying results... ranging from 0.40 - 0.77 on the test sets accuracy.
I believe that one issue that I'm running into is how to handle missing data. The only two approaches I've tried were setting NaN values to 0 or setting them to their columns mean. After further inspection, I realized that 20% of the test set contains NaN values so I believe that is where I'm running into problems.
What is my solution? Well from researching online, I've found that I can use linear regression, KNN, and other means to predict the values of missing fields.
After testing with KNN to impute missing values on the test set, I was able to achieve... a whopping 74%! -3% than my best score... I guess I'll keep on trying.
After exhaustively testing various machine learning algorithms with the currently formatted data that I have, my highest score that I was able to achieve was ~77% from the SVC classifier.
Therefore, I am going to try to experiment with that further by tweaking the hyper parameters and try to push to a higher score.
After experimenting with setting various hyper parameters, I was not able able to get anything higher than my 77%. I finally caved in and looked up a solution. Following this tutorial I realized the importance of understanding the data I'm working with. They gave a much more in depth analysis of the data before using any sort of machine learning models. Despsite following the tutorial, I was only able to get to a 78% on kaggle.
In regards to data preprocessing, I selected several columns that I felt would provide meaningful data to our classification algorithm.
I selected the following features
- Pclass
- Sex
- Age
- Fare_Per_Person
Using almost all of the columns
Prior to performing PCA, the data was normalized.
These were my results, (1 - Survivors, 0 - Non Survivors)
Attempt 1. Using a random forest classifier, I was able to achieve 96% accuracy on the train-split dataset with a 75/25 split. Sadly, I could only get between 0.70 - 0.75 on the test set. It could have been due to the way I preprocessed data and am handling missing values.
Attempt 2. Tried with more features and received worse results.
As expected, this performed significantly worse than the random forest classifier. I ended up with results are ~62%.
With a support vector classifier, I was able to get a slight increase of +2% on my kaggle prediction at 77%. I trained it on the training set with NaN values set to their columns mean and the same columns were dropped, as described above.
Tested with bernoulli naive bayes and the results were around 76% on kaggle so I achieved about the same as the random forest classifier/SVC.
With nearest neighbors set to 5, I received poor results at ~40%.
As I had expected, the results achieved from the NN were significantly worse than the alternative solutions. From my understanding, I believe this is from the lack of data and that it's not linearly separable.