
After 2 weeks Data Science Bootcamp @neuefische we started with our first EDA project.

Primary LanguageJupyter Notebook

Analysis of King County Data


This is my first project during the bootcamp. Here I'm working with the King County House Sales dataset. The focus is on EDA to demonstrate an entire Data Science Lifecycle. The project can also be divided into the following steps:

  • Business Understanding
  • Data Mining
  • Data Cleaning
  • Data Exploration / Analysis
  • Feature Engineering
  • Predictive Modelling
  • Data Visualization

The data

The dataset can be found in the file "King_County_House_prices_dataset.csv", in this folder. The description of the column names can be found in the column_names.md file in this repository.


Through statistical analysis/EDA, above please come up with AT LEAST 3 (you can definitely get bonus points for more than 3) recommendations for home sellers and/or buyers in King County. Then model this dataset with a multivariate linear regression to predict the sale price of houses as accurately as possible. Acceptable R squared values = 0.7 to 0.9 Optional: Split the dataset into a train and a test set. Use Root Mean Squared Error (RMSE) as your metric of success and try to minimize this score on your test data.


The result of the project can be found in the attached jupyter notebbook and in the slides which are attached as well.