The purpose of this analysis is to predict individual health insurance costs charged by health insurance companies based on age, sex, BMI, children, smoking, and region.
- Supervised Machine Learning
- Inferential Statistics
- Descriptive Statistics
- Machine Learning
- Data Visualization
- Predictive Modeling
- Regression Analysis
- Factor Analysis
- Random Forest
- Python
- R
- Jupyter Notebook
- Pandas
- NumPy
- Matplotlib
- Scikit-learn
- Graphviz
- Seaborn
- Yellowbrick
- Pydot
- Data exploration/descriptive statistics
- Data processing/cleaning
- Statistical modeling
- Writeup/reporting
Kaggle: https://www.kaggle.com/mirichoi0218/insurance
- Age: Age of the beneficiary in years.
- Sex: Whether the beneficiary is male or female.
- BMI: Body mass index derived from the weight and height of an individual. A healthy BMI is generally known to be from 18.5 to 24.9.
- Children: Number of dependents covered by health insurance.
- Smoker: Whether or not the beneficiary smokes.
- Region: The beneficiary's residential area in the US. The categories are northeast, southeast, southwest, northwest.
- Charges: The price the beneficiary pays the health insurance companies in USD.
**Note: The individual paying for the health insurance is referred to as the "beneficiary" in the definitions.
The model should conform to the assumptions of linear regression to be usable in practice. To confirm this we examined the data set to check:
- The regression model is linear in parameters
- The mean of residuals is zero
- Homoscedasticity of residuals or equal variance
- Normality of residuals
- Multi-linear regression (supervised learning)
- Pandas.crosstab categorical variable sex smoker region to confirm values
- Check for typos
- Dollars, round decimals
- Range of age
- Incorrect entries
- Data validation = exploratory data analysis
- Data validation = cleaning the data