/CaliforniaHousingPricePrediction

The goal of this project is to build a predictive model for the housing prices of California. The dataset if from the 1990 Census. The final data contained 20,640 observations on 9 variables. The dependent variable is ln(median house value). House value is extremely valuable to investors and predicting this information can maximize profits. Is it all location, location, location? Or are there other variables to consider? We will find out!

Primary LanguageJupyter Notebook

Overview After lots of hard work you are almost done with this data science bootcamp! Now you will put many of the pieces together and create a showcase project for your portfolio. The project will involve elements of data exploration, preprocessing, and machine learning.

The high-level goal of this project is straightforward – build a predictive model. You’ll be given some guidelines (see below), but we’ve left a lot of room for flexibility. Be creative!

The Dataset

For this project you get to choose one of four data sets. Two are for regression and two are for classification.

• California Housing – Predict media house value (regression) Guidelines You need to produce an R or Python notebook that covers the full scope of the data science courses, from exploring data to optimizing machine learning model performance. Throughout each stage of the process, thoroughly explain your thought process. For example, perhaps you chose to ignore a certain variable because it is too related to another feature, or because regularization indicated it was not useful.

• Exploratory Data Analysis: Summarize variables, visualize distributions and relationships. Generate a few interesting questions about the data and explore them with some visualizations. • Research Methods: Calculate the sample correlation between at least one pair of variables. Come up with a hypothesis and calculate the p-value. • Data Cleaning and Preparation: Apply any appropriate preprocessing steps, such as removing duplicates, missing values, outliers, and scaling data as appropriate (note that which model(s) is/are chosen may determine whether scaling is necessary). • Feature Engineering: Create new features or transform existing ones to improve performance. Even if you decide not to use these features (e.g., they don’t affect performance or make it worse), keep the code and an explanation of what you tried in your notebook. • Model Selection: Try various models (at least 3), showing your evaluation process. Clearly indicate which metrics you used and the performance of each model. Be sure to address any imbalance in the data, as well as using an appropriate train/test data split. • Performance Optimization: Use regularization, hyperparameter tuning, or other techniques to further optimize your chosen model and/or help select the best model.

At the end of your notebook, provide a brief summary (one paragraph) of your model – what it is, what preprocessing, feature engineering, and optimization you did, and the final accuracy (or another appropriate metric). Finally, briefly provide three ideas that could improve the model, which may include collecting additional variables. Tips for Success Note that the order of these is not strict – you may perform some feature engineering before model selection, but once you’ve chosen a model you may want to perform more feature engineering specifically for that model!

Do not overlook the exploratory and preprocessing steps -- you should spend plenty of time on exploratory data analysis and preparing data! This will make the other machine learning stages easier.

Remember: this is a showcase of your data science skills. Your creativity is important here.