machine_learning

Data analytics

Analyze a Data Set

This Jupyter Notebook was based on a dataset found at: https://archive.ics.uci.edu/ml/datasets/Housing

The data set concerns housing values in suburbs of Boston, and contains the following attributes.

Attribute Information:

CRIM: per capita crime rate by town
ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS: proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX: nitric oxides concentration (parts per 10 million)
RM: average number of rooms per dwelling
AGE: proportion of owner-occupied units built prior to 1940
DIS: weighted distances to five Boston employment centres
RAD: index of accessibility to radial highways
TAX: full-value property-tax rate per $10,000
PTRATIO: pupil-teacher ratio by town
B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT: % lower status of the population
MEDV: Median value of owner-occupied homes in $1000's

Data Analytics

The analysis looks at:

The relationship between median housing values and the home age
The correlation between crime rate and student/ teacher ratio at local schools
The proportion of black citizens within a suburb and the distance from that suburb to employment centers
The connection between home values, nitrus oxide contamination, and proximity to industrial centers
The relationship between crime rate and proportion of non-retail businesses

Working with Imperfect Data

This Jupyter Notebook was based on a dataset from the Seattle Police Department, outlining Incident Reports in the city of Seattle, as of January 28, 2017.

The dataset can be found here: https://data.seattle.gov/Public-Safety/Seattle-Police-Department-Police-Report-Incident/7ais-f98f

Attribute Information:

CAD CDW ID
CAD Event Number
General Offense Number
Event Clearance Code
Event Clearance Description
Event Clearance SubGroup
Event Clearance Group
Event Clearance Date
Hundred Block Location
District/Sector
Zone/Beat
Census Tract
Longitude
Latitude
Incident Location
Initial Type Description
Initial Type Subgroup
Initial Type Group
At Scene Time

Process

The data was originally displayed in rows of information separated by commas. The first row contained the column names, also separated by commas. To access this information, I created a Pandas dataframe that specified the comma as the delimiter, and the first row as the header row. I set low_memory to False, because the data set included a mix of data types.

Because these column names consisted of multi-word strings, often with spaces between the words, I wound up renaming them as simple, shortened strings (ex: "DESC"). This made it easier to use these names to access information.

rwisecar/machine_learning

machine_learning

Analyze a Data Set

Attribute Information:

Data Analytics

Working with Imperfect Data

Attribute Information:

Process