This repository contains an advanced exploratory data analysis (EDA) and heart disease prediction project. The project aims to provide a comprehensive understanding of the dataset related to heart disease and build predictive models to classify whether a patient is likely to have heart disease based on various attributes.
The dataset used in this project is sourced from https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset , containing a collection of clinical parameters for individuals, including demographic information, medical test results, and lifestyle factors. It comprises both numerical and categorical features, which are crucial in understanding the factors influencing heart disease.
The project is structured as follows:
- data: This directory contains the dataset used for analysis and modeling.
- notebooks: This directory contains Jupyter notebooks outlining the exploratory data analysis process, feature engineering, model development, and evaluation.
- src: This directory contains any source code used in the project, such as utility functions or custom modules.
- reports: This directory contains any reports generated from the analysis, including visualizations, summary statistics, and model evaluation metrics.
The EDA process involves thorough examination and visualization of the dataset to gain insights into the relationships between variables, identify patterns, anomalies, and potential predictive features related to heart disease. The notebooks in the notebooks
directory detail the steps taken in the EDA process.
Key components of the EDA process include:
- Descriptive statistics
- Data cleaning and preprocessing
- Univariate and multivariate analysis
- Visualization of distributions, correlations, and trends
- Identification of outliers and missing values
- Feature engineering and selection
Following the EDA, predictive models are developed to classify whether an individual is likely to have heart disease based on the available features. Various machine learning algorithms such as logistic regression, decision trees, random forests, support vector machines, or neural networks may be explored and evaluated for their predictive performance.
The prediction task involves:
- Data splitting into training and testing sets
- Model training using different algorithms
- Hyperparameter tuning for optimizing model performance
- Model evaluation using appropriate metrics such as accuracy, precision, recall, F1-score, and ROC-AUC curve
To get started with this project, follow these steps:
- Clone this repository to your local machine.
- Navigate to the
notebooks
directory to explore the Jupyter notebooks containing the analysis and modeling process. - Ensure you have the necessary dependencies installed. You can install them via pip using the
requirements.txt
file.
pip install -r requirements.txt
- Run the notebooks sequentially to replicate the analysis and understand the steps involved.
The project requires Python 3.11 along with various libraries such as pandas, numpy, matplotlib, seaborn, scikit-learn, etc. These dependencies are listed in the requirements.txt
file.
Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please feel free to open an issue or submit a pull request.
Feel free to customize this README according to your project's specifics and preferences.