Notes for Udemy course on Machine Learning A-Z
- Download Anaconda
- https://www.continuum.io/downloads
- Anaconda is an IDE package on top of Python and Python packages
- Launch Spyder
- In the Window Panes you want Editor, Interactive Python console and Variabel Explorer with Help
- In editor
> print("Hello World")
- Highlight and press CTRL-Enter and see it appear in the interactive console.
- We need ot start out with Data PreProcessing to get to the fun parts later
- This involves downloading a lot of datasets and processing them.
- Go to: https://www.superdatascience.com/machine-learning/
- Unzip both files
- Place Preprocsessing in Template folder structures
- First dataset is Data.csv - first 3 rows are the independent variables, last row is dependent
- We need to create a file for the Data Preprocessing Template - data_processing_template.py
- We need to import 3 basic libraries
import numpy as np
import matplotlib.pyplot as plt
- to plot math charts, anytime you want to plot something in Pythonimport pandas as pd
- best library to import and manage datasets- Highlight this cose and hit CTRL-Enter to execute to make sure it is in correctly.
- Note: in R you don't have to separately load the packages.
dataset = pd.read_csv('Data.csv')
- Add this to import the dataset- In variable explorer you can see the dataset
- Change the salary from scientific notation: from
%.3g
to%.0f
- Let's start creating our matrix of features
- Add new code for the data:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
- First, we take all the lines (left of first comma) and then all but the last line (all but last column, right of comma)
- Execute that line and type X in the console. This is our matrix of independent variables.
- Y is goign to be for the last column
- Now we are going to deal with missing data in the dataset.
- We're missing data in columns for Spain and German.
- One idea si to remove the line -- but we can't do that.
- Most common: take the mean of the columns/
from sklearn.preprocessing import Imputer
- This imports a library impute which allows us to handle missing data
- Now we need to create an object
imputer = Imputer(missing_values = 'NaN')
- We're switching out NaN - reason is if you look in "Variable Explorer" at Data.csv in DataFrame mode you will see NaN in missing blanks.
- Now we make a strategy for mean =
imputer = Imputer(missing_values = 'NaN',strategy = 'mean')
- Now we set axis=0 for columns
imputer = Imputer(missing_values = 'NaN',strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:,1:3])
- we're taking the 1 and 2 rows but not 3 (1:3 means 1 and 2 but not 3)- Run the impute part of the code
- in console:
X
and this should output all the rows - (you may need to also input into console:
np.set_printoptions(threshold=100)
) if the rows are truncated. - Check Data.cvs in Spreadsheet, get avg. salary
=AVERAGE(C1:C11)
- Output: 63777.7777777778
- Note: for startegies you can also take the "median" and "most frequent" values
- The Country and Purchase columns are called Category columns (Germany/France/Spain Yes/No)
- We have to get the text out of the machine learning equations
- We need to encode the text into numbers.
from sklearn.preprocessing import LabelEncoder
- Then we have to create an objects
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
- Run in console.
- Unfortunately at this point we have higher and lower numbers for each country which could make one seem greater than another.
- So instead we'll break them into 3 columns of 1 or 0
- To do this we need to import OneHotEncoder
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
7:38
- INFO: To get info on an object got to Help
sklearn.preprocessing.OneHotEncoder
- Add in the following code:
#Encoding Category data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
- Run, Now check in Variable explorer - double-mouse-click x - you should see 3 columns prepended with 1s and 0s
- Next we'll take care of the purchased Column
- Copy paste this part:
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
- change to y
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
- Now check in Variable explorer - double-mouse-click y - you should see 1 columns with 1s and 0s
- We have to split the Dataset into a Training and a Test set.
- The test set with have slightly different data.
- The test set is used to test the perfromance of how well we trained the ML.
- We are testing the adpatation of the rules to a new set of data.
- We expect there should not be much difference in performance.
- It's very simple, takes 2 lines:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
- These are our dependent and independent variables by each set: X_train, X_test, y_train, y_test
- in train_test_split we need to cite X, y which is the whole dataset. test_size 0.2 is 20%
- We have 10 observations in the train set, 2 in test set
- random_state is if you want random sampling.
- Select these lines and Run
- See in Variable explorer the new datasets
- Note that for X in train we have 10 observations and in test we have 2.
- What is feature scaling and wy do we need to do it?
- The Euclidean Distance will be affected between the Age and Salary scale differences. (max and min for each column)
- We need to transform the variabels tot he ame scale.
- see graphic: 14-Standardization-Normalization
- import scaling library:
from sklearn.preprocessing import StandardScaler
- Then we are going to fit_transform each the train dataset X,y - and only transform the test set
- Code:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
- Run, see result graphic: 14-X-Standardization
- This is all that is requried to preprocess data
- We only include libraries we need.
- See: Preprocessing Template graphic
- For template we'll remove some of what qwe did so far
- REMOVE or COMMENT OUT - Taking care of missing data
- REMOVE or COMMENT OUT - Encoding Category data
- COMMENT OUT - Feature scaling
- Every time we start a machine learning model we will copy/paste this template
Quiz 1: Data Preprocessing 0:00
We're going to handle this next:
- Simple Linear Regression
- Multiple Linear Regression
- Polynomial Regression
- Support Vector for Regression (SVR)
- Decision Tree Classification
- Random Forest Classification
- Download dataset: https://www.superdatascience.com/machine-learning/
- Data: Simple Linear Regression/Salary_Data.csv
- What is the correlation between salary and years experience.
- What is the business value add? This is the model current and what should we apply?
- Linear Regression: y = b(0) + b(1) * x
- Image: 21-Simple-Linear-Regression
- Image: 21-Simple-Linear-Regression-Dependent-Variable
- Image: 21-Simple-Linear-Regression-Independent-Variable
- Image: 21-Simple-Linear-Regression-Coefficient
Example:
-
So we start with an x (Experience) and y (salary) axis
-
So we plot Observations on the x and y axis
-
Linear Regression: Salary = b(0) + b(1) * Experience
-
Linear regression means the plotted line, slope proportion
-
Image: 21-Simple-Linear-Regression-FULL-EXAMPLE.png
- See ordinary least squares image. 22-Ordinary-Least-Squares.png
- 22-Ordinary-Least-Squares-2-difference.png - y-i (red) and y-i-hat (green)
- This is the difference between what is observed and the model
- Take the differnece of that and take the SUm of the squares... SUM(y - y^)^2 -> min
- So it takes the gaps and sums them, and takes the linbe that has the minimal sum of squares possible.
- See image: 22-Ordinary-Least-Squares-3-SUM.png
- Setup Simple Linear Regression script in Spyder
- First thing we need to do is use our Data Processing template (last file made, previous section) to get started. Copy paste
- Update csv to Salary_Data.csv and import into Variable Expolorer
- We have 30 observations (30 employees)
- We want to train and establish a correlation between experience and salary.
- We have to SPLIT the data out first.
- X is the matrix of features (dependent variable)
- Independent variable is the Years for experience
- Dependent variable is the salary
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
-
X removes the last column
-
y will be 1 becasue that is the independent variable column
-
Run that code though X and you should get X with one column
-
Run code for y and you should get y with one column
-
At this point we have split the orginal dataset. Now, we have to split into a (1) Train and (2) Test Sets
-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
-
We want to test size to be less than a half -- let;s do 1/3 for a round 10 number (1/3 of 30)
-
Execute this code. It divides data sets again. See img: 23-Train-Test-Sets.png
-
We're using X_train and y_train to get the correlations and then we will use the result in the Test groups
-
Next step is FEATURE SCALING & FITTING the algorithm to our Dataset
-
Feature Scaling we'll leave commented out for now.
-
Our data has been preprocessed. Now we have to fit the algorithm.
-
We need to import the Linear Regression class
from sklearn.linear_model import LinearRegression
-
Out of this we are going to make an object that will be our Linear Regressor
-
The Regressor object will use the fit fethod to fit to the data model.
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
- Check help for info on the LinearRegression class.
- Now this code can be executed.
- Result:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Out[13]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
-
That is it for the most basic Linear Regression Machine Learning Model
-
In the next section we'll use it to predict some new observations which will the test set observations.
- First step: was to preprocess the data.
- Second step: create linear regression model
- Next we'll predict the Test set results
- We'll create a vector with the test set salaries called y_
y_pred = regressor.predict(X_test)
- y_pred is always the vector of predictions for the Dependent variable
predict
is a method of the LinearRegression class.- check help for info about predict
- Execute the code
- New y_pred row - See result: 25-1-Result-of-y_pred.png
- Open y_pred and y_test datasets
- What is the difference?
- y_test is the real salaries observed
- y_pred is the predicted salaries
- Compare the two datasets - test and predicted - they are not perfect, some are close some aren't.
Quiz 2: Simple Linear Regression 0:00
- General instructions about getting dataset.
- venture capital dataset
- 5 columns
- 50 companies
- View CSV - 50_Startups.csv
- Fields: R&D Spend, Administration, Marketing Spend, State, Profit
- We need to create a model to decide which types of companies are best to invest in based on Profit.
- Dependent variable (DV): Profit. Other variables are independent variables (IV).
- They need to find out which companies do better on various factors.
- see image: 33-Multiple-Regression-Formula.jpg
- Multiple Regresion Formula:
y = b(0) + b(1)*x(1) + b(2)*x(2) etc.
- See image on full desciption of formula: 33-2-Multiple-Regression-Formula--FULL-Descriptions.png
- Quick heads up -- there is a Caveat about Linear Regressions.
- Linear Regressions have assumptions.
- See image: 34-1-Linear-Regressions-Assumptions.jpg
- Linearity, Homoscedasticity, Multivariate normality, Independence of errors, Lack of multicollinerity
- Always make sure your Assumptions are correct when buuilding a Linear Regression.
- see image: 34-1-Dummy-Variables.png
y = b(0) + b(1)*x(1) + b(2)*x(2) + b(3)*x(3) + ????)
- Keep in mind the last one is State which is a categorical model (not numeric like the others)
- Remembe what we do: for each variation you need to create a new column with 0 or 1.
- So in this case you have new column sfor New York and California.
y = b(0) + b(1)*x(1) + b(2)*x(2) + b(3)*x(3) + b(4)*D(1)
- NOTE: We only need to account for the New York column, since if it;s 0 we know that is California.
- So essentially CA will be included as a constant in the coefficent b(0)
- see image: 34-5-Dummy-Variables.png
- Dummy Variable image
- Remember-- you CANNOT include 2 dummy variables at the same time.
- Multiple linearity: D(2) = 1- D(1)
- Whenever building a model omit one Dummy variable
- Step by Step Building a Model
- THE OLD DAYS: We had x(1) -> y .... One independent variable and one dependent variable
- This was just a simple Linear Regression to build.
- Now those easy days are gone as we have multiple independent variables which could all be predictors
- There are so many we need to decide which ones to Keep
- See 37-2-Variables.png
- Why thow out variables? (1) If you put garbage in you get garbage out, (2) You have to explain the variable correlations
-
(1) All-in - throw in all your variables (a) if you have prior knowledge of factors, (b) you have to such as required by your company, (c) preparing for Backward Elimination,
-
(2) Backward Elimination - (a) Sleect a significance level, (b) Fit full model with all predictors, (c) Consider predictor with highest p-value, if P > SL got to next step,else FIN, (d) Remove that pedictor, (e) Fit model without this variable, (f) Go back to c
-
(3) Forward Selection - (a) Sleect a significance level, (b) Fit all simple regression models, select one with lowest p-value, (c) Keep this variable and fit all possible models with 1 extra predictor added to what you have, (d) get the predictor with the lowest p-value . If P<SL go to c, else FIN
-
(4) Bidirectional Elimination (Stepwise Regression) see 37-8-Bidirectional-Elimination.png
-
(5) Score Comparison (All possible models) -
-
#2,3,4 are Stepwise Regressions, usually #4 is implied.
-
We're going to concentrate on Backward Elimination because it's the fastest and you still get to see the step by step
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
-
Highlight dataset line, run to look at dataset
-
50 observations of startups
-
We're going to see if there are some linear dependencies between independent variables
-
Dependent variable is Profit, the variable we are trying to predict
-
Matrix of independent variables will be X and y
-
In this course the spreadsheet is all independent variables first and dependent variables last. We may have to change X and y
-
We put X and -1 to get rid of the index column (it's not an independent variable, we want just independent variables)
-
Change y to 4 (the last column from 0)
-
Run X and y
-
Dummy variables - Next we have to jump back to our Categorical data file we did in Part 1 - this is to get rid of relational order.
-
Go to File Explorer and get it. categorical_data.py
-
Copy and paste this data directly BEFORE the splitting of the data chunks (train_test_split):
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
Note: we do not need the part of encoding the Dependent variable, only independent
- change
X[:, 0]
toX[:, 3]
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
- onehotencoder can only be used onthe nukbred variables, so we need to change categorical_features to
categorical_features = [3]
X[:, 3] = labelencoder_X.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
-
Run.
-
Last column (State) was replaced with 3 columns (dummy variables, needed to turn it into a number)
-
There is a new column for each state, and that is a 0 or 1
-
Add one more line "Adding the Dummy Variable Trap"
# Avoiding the Dummy Variable Trap
X = X[:, 1:]
-
This removes the first column from X
-
Next we have to split into a training set and a test set
-
Let's see if we have ot change the test size
-
We are currently making 50 test observations, so a good test size for traning would be 10 (.2), which is already there
test_size = 0.2,
-
Run that section
.