Introduction

Since the industrial revolution the struggle between human and machine has been ongoing.

John Henry vs The Steam Drill

Garry Kasparov vs Deep Blue

Terminators vs Everyone

Data Scientists vs Pure Processing Power

Problem Statement

While much of data science is being automated there will still be a need for insights that come from intuitive thinking that computers have yet to match.

Using publicly available housing data from Ames Iowa I will run three different scenarios showing that brute processing power of a machine cannot match the sly intuitiveness of a human.

Data

Data Summary and Dictionary

Data on housing sales for Ames Iowa was used to compare regressions between several scenarios.

Outside Research

Exploratory data analysis paired with outside research led to the selection of several features for user selected models as well as feature selection of categorical data into dummy columns.

Models

All models are run on a linear regression then scored on the R^2 Adjusted value. This metric was chosen as it will punish random models that have a large number of features.

Human Selection

All features selected through exploratory data analysis and outside research.

Random Selection

A random number of randomly selected features will be run through the regression. The number of features is the square root of all features +- 10 percent.

The model is run 1_000_000 times and the highest R^2 Adjusted that is less then one is selected.

Combination Model

The user features are combined with a randomly selected number of features similar in selection criteria in the Random Model. Function is modified though to prevent double feature assignment.

This model is also run 1_000_000 times

Random Model

To test if pure processing power can create a better model a random number of randomly selected features will be run through a linear regression. This will be iterated though 1,000,000 times.

SCORES

Human Model

Looking at that data and interpreting with the use of outside data a selection of features will selected to run a linear regression with.

Results

Model	RSME	R^2 Adjusted	Number of Features
User	30818.38	0.8449	6
Random	35770.18	0.7882	24
Combination	31190.64	0.8567	21

Conclusions

To compare each model scenario we will use the R^2 adjusted metric as that will penalize data sets for having a large number of features.

As we can see both scenarios that had user selected input have an increase in R^2 of over five percent.

This significant increase in efficiency shows the value of human input.

The highest score of using both human insight and machine processing power points us to further integrating the two.

Further Research

Compare levels of R^2 of different numbers of n
Improve data cleaning with normalizations and standardizations
Develop more feature engineering
Develop random model to include all related dummies if one is selected
Redevelop models with ElasticNet to account or large numerical data and numerous dummy columns

jellena/ames-iowa-one-million-regressions