SHIVITG/Titanic-ML-Disaster-Prediction
Titanic Data Science Solutions This notebook is the solution to the Titanic: MACHINE LEARNING for Disaster Workflow stages The competition solution workflow goes through seven stages described in the Data Science Solutions book. Question or problem definition. Acquire training and testing data. Wrangle, prepare, cleanse the data. Analyze, identify patterns, and explore the data. Model, predict and solve the problem. Visualize, report, and present the problem solving steps and final solution. Submiting the results. Problem Statement Competition sites like Kaggle define the problem to solve or questions to ask while providing the datasets for training your data science model and testing the model results against a test dataset. The question or problem definition for Titanic Survival competition is described here at Kaggle. Workflow goals The data science solutions workflow solves for seven major goals. Classifying. We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal. Correlating. One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features. Converting. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values. Completing. Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values. Correcting. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results. Creating. Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals. Charting. How to select the right visualization plots and charts depending on nature of the data and the solution goals.
Jupyter Notebook