Author: Aya Tamura
Date: August 25, 2024
This project focuses on analyzing the relationships between pairs of variables in the Titanic dataset. The aim is to practice bivariate analysis techniques, considering the data types involved (categorical, ordinal, and continuous). The notebook demonstrates different methods for visualizing and testing associations between these variables, providing a comprehensive overview of bivariate analysis in Python.
The dataset used is the well-known Titanic dataset, which includes information about passengers on the RMS Titanic. Key variables include:
Survived
: Whether the passenger survived or not (0 = No, 1 = Yes)Sex
: The gender of the passengerFare
: The ticket fare paid by the passengerParch
: The number of parents or children aboard
The dataset is publicly available and can be found here.
The analysis in the notebook is divided into the following questions:
- Data Type: Categorical vs. Categorical
- Methods Used: Bar plots, Chi-Square test
- Key Insight: A significant association between sex and survival was found, with women having a higher survival rate.
- Data Type: Categorical vs. Continuous
- Methods Used: Box plots, Violin plots, Mann-Whitney U test
- Key Insight: Passengers who survived generally paid higher fares, with the difference being statistically significant.
- Data Type: Categorical vs. Ordinal (or Continuous)
- Methods Used: Count plots, Box plots, Kruskal-Wallis test
- Key Insight: Survival rates varied significantly depending on the number of parents/children aboard, with
Parch
treated as an ordinal variable.
- Explanation: While this dataset did not provide opportunities to explore continuous vs. continuous relationships, the README explains the methods that would be used if such data were available, including scatter plots, correlation analysis, and linear regression.
- Python
- Pandas: For data manipulation and analysis
- Matplotlib and Seaborn: For data visualization
- Scipy: For statistical testing
- Clone the repository:
git clone https://github.com/yourusername/titanic-bivariate-analysis.git
- Install the necessary dependencies:
pip install -r requirements.txt
- Run the Jupyter Notebook:
jupyter notebook
- Open the
titanic_bivariate_analysis.ipynb
file.
This project demonstrates how to approach bivariate analysis for different combinations of categorical, ordinal, and continuous variables. The Titanic dataset serves as a versatile example for practicing key concepts in data analysis, visualization, and hypothesis testing.
Potential future improvements include:
- Exploring additional variables, such as
Age
orEmbarked
. - Conducting continuous vs. continuous analysis when applicable datasets are available.
- Extending the analysis to multivariate relationships.
This analysis is based on the Titanic dataset, which is publicly available and widely used for data science practice. Special thanks to Kaggle and other open-source communities for providing the data and resources.