MACHINE LEARNING FOUNDATIONS
https://www.coursera.org/specializations/machine-learning
THE MACHINE LEARNING PIPELINE
Data -> ML Method -> Intelligence
CASE STUDY APPROACH
Use various ML Methods in different case studies
Regression Case Study 1: Predicting house prices
Classification Case Study 2: Sentiment Analysis
Clustering Case Study 3: Document retrieval
Matrix Factorization Case Study 4: Product recommendation
Deep Learning Cases Study 5: Visual product recommender
Overview
Wk 2 Regression
Case Study: Prediction house prices
Models
linear regression
Regularization: Ridge (L2), Lasso (L1)
Algorithms
Gradient descent
Coordinate descent
Concepts
Loss functions
bias-variance tradeoff
cross-validation
sparsity
overfitting
model selection
Wk 3 Classification
Case study: Analyzing sentiment
Models:
Linear classifiers (logistic regression, SVMs, perceptron)
Kernels
Decision trees
Algorithms
Stochastic gradient descent
Boosting
Concepts
Decision boundaries
MLE ensemble methods
random forests
CART
online learning
Wk 4 Clustering & Retrieval
Case study: Finding documents
Models
Nearest neighbors
Clustering, mixtures of Gaussians
Latent Dirichlet allocation (LDA)
Algorithms
KD-trees, locality-sensitive hashing (LSH)
K-means
Expectation-maximization (EM)
Concepts
Distance metrics
approximation algorithms
hashing
sampling algorithms
scaling up with map-reduce
Wk 5 Matrix Factorization & Dimensionality Reduction
Case study: Recommending Products
Models:
Collaborative filtering
Matrix factorization
PCA
Algorithms
Coordinate descent
Eigen decomposition
SVD Algorithms
Concepts
Matrix completion
eigenvalues
random projections
cold-start problem
diversity
scaling up
Wk 6 Capstone: An intelligent application using deep learning
Install anaconda, GraphLab, ipython notebook
source activate gl-env
conda update pip
jupyter-notebook --no-browser
[First Notebook](week_1/Getting started with iPython Notebook.html)
[source](week_1/Getting started with iPython Notebook.ipynb)
Basic types
i = 4 # int
f = 4.1
b = True
type(i)
Advanced types
l = [3, 1, 2] # list
d = {'foo':1, 'bar':2.3, 's':'string'} # dictionary
print d['foo']
n = None # null
type(n) # NoneType
Advanced printing
print "Our float value is %s. Our int value is %s." % (f,i)
Conditional statements
if i == 1 and f > 4:
print "i == 1 and f > 4"
elif i > 4 or f > 4:
print "(i or f) > 4"
else:
print "(i and f) <= 4"
Loops
print l
for e in l:
print e
counter = 6
while counter < 10:
print counter
counter += 1
functions
def add2(x):
return x + 2
lambdas
square = lambda x: x*x
square(3)
GETTING STARTED WITH SFRAME AND GRAPHLAB CREATE
iPython Notebook
Machine learning library scikit-learn
Data manipulation tool Pandas
Tools above require a learning curve. This course uses GraphLab Create that includes SFrame
Load a tabular data set sf = graphlab.SFrame('people-example.csv')
view end of the table sf.tail()
visualizes any data structure in GraphLab Create sf.show()
Categorical view sf['age'].show(view='Categorical')
Some simple columnar operations
sf['age'].mean()
sf['age'].max()
Create new columns in our SFrame sf['Full Name'] = sf['First Name'] + ' ' + sf['Last Name']
Use the apply function to do a advance transformation of our data
def transform_country(country):
if country == 'USA':
return 'United States'
else:
return country
transform_country('USA')
sf['Country'].apply(transform_country)
Introduction Linear regression modeling
This week you will build your first intelligent application that makes predictions of house prices from data.
Create models that predict a continuous value (price) from input features (square footage, number of bedrooms and bathrooms,...). Predicting house prices: A case study in regression
Want to list my house
Compare houses in neighborhood
Focus on similar houses: sq ft, bedrooms, etc
Plot a graph
X = sq. ft.
Y = Price
Terminology:
x = feature covariate or predictor
y = observation or response
Note no house on graph will have same sq ft as yours. Also, if you only include similar houses you're discarding the rest of the data on the graph
Fit a line through the data = f(x) = w0 + w1 x
w0 = intercept
w1 = slope
Linear Regression Model
f(x) is parameterized by (w0, w1)
RSS := Residaul sum of squares
draw a line and sum the distance of plots from line
RSS(w0,w1) = ($house1 - [w0 + w1(sq ft house1])^2 + ($house2 - [w0 + w1(sq ft house2])^2 + ($house3 - [w0 + w1(sq ft house3])^2 + ...
RSS
best line is the one that minimizes the cost over all possible w0,w1
Ŵ = (ŵ0, ŵ1) W hat
Fŵ(x) = ŵ0 + ŵ1 x
Best guess of your house price ŷ = ŵ0 + ŵ1 (sq ft your house) Adding higher order effects
But what if it's not a linear relationship. It could be quadratic.
Fw(x) = w0 + w1 x + w2 x^2
note we square x, but not w
We can apply even higher order polynomials to reduce RSS further
Quadratic
An example of an even higher order polynomial that may not be what you want :)
Higher order polynomial Evaluating regression models Evaluating overfitting via training/test split
Based on the last example, we can overfit to the point that it's not generalizable to new data
Want good predictions but can't observe future. We can simulate prediction
test set: remove some houses
training set: fit model on remaining houses
predict test set
Training error (w) = ($train1 - fw(sq.ft. train1))^2 + ($train2 - fw(sq.ft. train2))^2 + ($train3 - fw(sq.ft. train3))^2 + ...
Training error
Test error (ŵ) = ($test1 - fw(sq.ft. test1))^2
+ ($test2 - fw(sq.ft. test2))^2
+ ($test3 - fw(sq.ft. test3))^2
+ ...
Test error Training/test curves
Training error ŵ decresases with increasing model order
Test error decreases up to a point but then starts increasing
Training/Test curves Adding other features
What if we need to add additional variables e.g. # bathrooms
Each new variable is a new dimension. so adding bathroom is a 3d graph
calculate hyperplane on the cube
More features Other regression examples
Salary after ML specialization ŷ = ŵ0 + ŵ1 performance + ŵ capstone + ŵ forum
Stock prediction depends on recent prices, news event, related commodities
tweet poplarity: # followers, # followers of followers, popularity of hashtag
Summary of regression
Regression Summary
Describe the input features and output real-valued predictions of a regression model
Calculate a goodness-of-fit metric (e.g., RSS)
Estimate model parameters by minimizing RSS (algorithms to come...)
Exploit the estimated model to form predictions
Perform a training/test split of the data
Analyze performance of various regression models in terms of test error
Use test error to avoid overfitting when selecting amongst candidate models
Describe a regression model using multiple features
Quiz: Regression
1
2
3
3
x 2,3 : 4
4
x 2,3 : 4
3
2
See Explore the Quadratic Equation to see the effect of the coefficients Predicting house prices: IPython Notebook
iPython Notebook
Generate a scatter plot sales.show(view="Scatter Plot", x="sqft_living", y="price")
can hover over individual points to explore further
Splitting the data into training and test sets
train_data,test_data = sales.random_split(.8,seed=0)
Use random_split to split training and test data
0.8 => 80% training and 20% test
set seed to 0 in this case. we should use a random seed or let GL pick it for you
Learning a simple regression model to predict house prices from house size
sqft_model = graphlab.linear_regression.create(train_data,
target='price',
features=['sqft_living'],
validation_set=None)
note the default algorithm used is Newton's Method
Evaluating error (RMSE) of the simple model
print sqft_model.evaluate(test_data)
max_error is the outlier
Also shows RMSE
Visualizing predictions of simple model with Matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(test_data['sqft_living'], test_data['price'], '.',
test_data['sqft_living'], sqft_model.predict(test_data), '-')
blue . is the actual data scatter plot
green - as a line is the predicted value based on the test_data
Inspecting the model coefficients learned
sqft_model.get('coefficients')
(intercept) = where the line crosses the y axis
sqft_living ~= the average cost of a house per sq ft according to this regression model
View other features of a house we might be interested in
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
sales[my_features].show()
sales.show(view='BoxWhisker Plot', x='zipcode', y='price')
BoxWhisker Plot to view the set split by feature (zipcode in this case)
Based on the data we see other features such as zip code, and # bedrooms makes a difference in the estimated price of a home
my_features_model = graphlab.linear_regression.create(train_data,
target='price',
features=my_features,
validation_set=None)
print my_features to view what features are includedd
Compare the original model to the expanded features model
print sqft_model.evaluate(test_data)
print my_features_model.evaluate(test_data)
Note that the rmse has lowered by adding additional features
Applying learned models to predict price of an average house
house1 = sales[sales['id']=='5309101200'] # find a particular house by id
house1 # to view the data for this house
print house1['price'] to view the actual price based on the data
print sqft_model.predict(house1) to check the price our model predicted
print my_features_model.predict(house1) to predict price based on expanded features
:caution: the prediction model based on sqft was more accurate than the expanded feature model in this case
Applying learned models to predict price of two fancy houses
house2 = sales[sales['id']=='1925069082']
this is an example of a house where due to an uncaptured feature, "on the waterfront", was not predicted very well by our model expanded feature prediction is closer than the original sqft model
Final example is Bill Gates' house. We don't have data on what the actual price is but the prediction gives us a price, that is probably on the low end.
Quiz: Programming assignment
Selection and summary statistics: In the notebook we covered in the module, we discovered which neighborhood (zip code) of Seattle had the highest average house sale price. Now, take the sales data, select only the houses with this zip code, and compute the average price.
Ans:
houses = sales[sales['zipcode']=='98039']
houses['price'].mean()
2160606.5999999996
Filtering data: One of the key features we used in our model was the number of square feet of living space (sqft_living) in the house. For this part, we are going to use the idea of filtering (selecting) data.
In particular, we are going to use logical filters to select rows of an SFrame. You can find more info in the Logical Filter section of this documentation. Using such filters, first select the houses that have sqft_living higher than 2000 sqft but no larger than 4000 sqft. What fraction of the all houses have sqft_living in this range? Save this result to answer the quiz at the end.
Ans:
100.0 * sales[(sales['sqft_living'] > 2000) & (sales['sqft_living'] <= 4000)].num_rows() / sales.num_rows()
42.187572294452416
Building a regression model with several more features: In the sample notebook, we built two regression models to predict house prices, one using just sqft_living and the other one using a few more features, we called this set
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
Now, going back to the original dataset, you will build a model using the following features:
advanced_features =
[
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house
'grade', # measure of quality of construction
'waterfront', # waterfront property
'view', # type of view
'sqft_above', # square feet above ground
'sqft_basement', # square feet in basement
'yr_built', # the year built
'yr_renovated', # the year renovated
'lat', 'long', # the lat-long of the parcel
'sqft_living15', # average sq.ft. of 15 nearest neighbors
'sqft_lot15', # average lot size of 15 nearest neighbors
]
Compute the RMSE (root mean squared error) on the test_data for the model using just my_features, and for the one using advanced_features.
Ans:
my_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode']
advanced_features =
[
'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'zipcode',
'condition', # condition of house
'grade', # measure of quality of construction
'waterfront', # waterfront property
'view', # type of view
'sqft_above', # square feet above ground
'sqft_basement', # square feet in basement
'yr_built', # the year built
'yr_renovated', # the year renovated
'lat', 'long', # the lat-long of the parcel
'sqft_living15', # average sq.ft. of 15 nearest neighbors
'sqft_lot15', # average lot size of 15 nearest neighbors
]
train_data,test_data = sales.random_split(.8,seed=0)
f_model = graphlab.linear_regression.create(train_data, target='price', features=my_features, validation_set=None)
af_model = graphlab.linear_regression.create(train_data, target='price', features=advanced_features, validation_set=None)
print "RMSE diff = " + str(f_model.evaluate(test_data)['rmse'] - af_model.evaluate(test_data)['rmse'])
RMSE diff = 22711.3165108
CLASSIFICATION MODELING Analyzing the sentiment of reviews
Rating with stars is too simple
Understand aspects of restaurant review
build a restaurant review app
categories for review e.g. experience, ramen, sushi
Break all reviews into sentences in a Sentence Sentiment Classifier
Average the predictions
Display the most positive or negative reviews
Classification Aplications
Input (sentence) -> Classifier -> Output (predicted rating)
Examples
Webpage classification by category e.g. education, finance, technology
Spam filtering: checks sender, text, ipaddress, etc
Image classification
Personalized medical diagnosis
Reading your mind by FMRI
simple threshold classifier
list of +ve words: great, awesome, etc
list of -ve words: bad, terrible, etc
if # +ve words > # -ve words => ŷ = +ve else ŷ = -ve
e.g. "Great sushi, awesome food, but terrible service" => +2, -1
Problems with threshold classifier
populating initial +ve and -ve word lists
words have degrees of sentiment: e.g. "great" > "good"
single words are not enough: "good" vs "not good"
The first two can be address by learning a classifier
The 3rd issue can be address by more elaborate features Word Weight great 1.5 awesome 1.2 bad -1.0 terrible -2.1 awful -3.3 restaurant, the, we, where, ... 0.0
e.g. "Sushi was great, the food was awesome, but the service was terrible"
score = 1.2 (great) + 1.7 (awesome) - 2.1 (terrible) = 0.8
Decision Boundaries
Can graph scores. e.g. awesome = 1.0 and awful = -1.5
awful 5| 4| 3| 2| * (this ambiance is awesome but service is awful and food is awful) 1| 0|-------- awesome 0 1 2 3 4
For linear classifiers
when 2 weights are non-zero: line
when 3 weights are non-zero: plane
when 2 weights are non-zero: hyperplane
Evaluating classification models Training and Evaluating a Classifier
Training a classifier = Learning the weights
split data into training and test sets
training set is passed to learned classifier to learn weights of words
test set is evaluated by error & accuracy
Test example "Sushi was great"
Feed the sentence ^ to the learned classifier
predict ŷ is +ve
pass multiple test cases and compare correct vs mistakes
Classification error & accuracy
Error = (# mistakes) / (total # sentences)
best possible value = 0.0
Accuracy = (# correct) / (total # sentences)
best possible value = 1.0
error = 1 - accuracy
What's a good accuracy?
Purely random guessing on a binary classification = 0.5 accuracy
For k classes, accuracy = 1/k
we should at least beat random guessing
This can be counterintuitive e.g.
"90% email is spam"
if we predict 100% email is spam we get 90% accuracy
Questions to ask:
Is there class imbalance?
How does it compare to a simple, baseline approach?
Random guessing
Majority class
...
Most importantly: what accuracy does my application need?
What is good enough for my users experience?
What is the impact of the mistakes we make?
False positives, false negatives, and confusion matrices
A confusion matrix
Predicted Label
| +ve | -ve |
|-----------|-----------|
True +ve | True +ve | False -ve | Label -ve | False +ve | True -ve |
True +ve & True -ve is good, we got it right
FN (False -ve) & FP have different impacts
Example of domains
| Spam filtering | Medical Diagnosis |
| -------------- | ------------------ |
False -ve | Annyoing | Disease untreated | False +ve | Email lost | Wasteful treatment |
Given 100 test examples, a possible confusion matrix for spam filtering
Predicted Label
| +ve | -ve |
True +ve | (50) | 10 | Label -ve | 5 | (35) |
Accuracy = 85/100 = 0.85 ie higher false +ve than false -ve
Multiclass classification example with 100 test examples
Predicted Label
| | Healthy | Cold | Flu |
| ---------- | ---------- | ---- | --- |
True | Healthy 70 | (60) | 8 | 2 | Label | Cold 20 | 4 | (12) | 4 | | Flu 10 | 2 | | (8) |
Accuracy = (60 + 12 + 8)/100 = 80/100 = 0.8 Learning Curves: How much data do I need?
More data is good but data quality is more important
Theoretical techniques sometimes bound how much data is needed
provide guidance but not as practical
In practice
more complex models require more data
empirical analysis can provide guidance
Learning curves
Generally, the more data we have the fewer test errors we find
Test error
^
|*
| *
| *
| *
| *
| *
| *
|----------------------------------> Amount of training data
Bias of model even with infinite data, the curve never reaches 0.
More complex models tend to have less bias
sentiment classifier on single words does ok
But some are just too hard e.g. "The sushi was not good"
we can then score on pairs of words bigram model
Even bigram models have bias. The graph looks better (approaches 0 faster) but never hits 0 Class Probabilities
Classifier provide a confidence level P(y|x)
e.g. "the sushi & everything else were awesome" P(y=+|x) = 0.99
"The sushi was good, the service was OK" P(y=+|x) = 0.55
less confident that this is a +ve review
Summary of classification Classification ML block diagram
Classification ML Block Diagram
y = Training data
x = word counts
ŷ = predicted sentiment
ŵ = weights for each word
y = sentiment label
compare y with ŷ to get a quality metric that is fed into ML Algorithm
ML algorithm updates ŵ
Summary
Identify a classification problem and some common applications
Describe decision boundaries and linear classifiers
Train a classifier
Measure its error
Some rules of thumb for good accuracy
Interpret the types of error associated with classification
Describe the tradeoffs between model bias and data set size
Use class probability to express degree of confidence in prediction
Quiz: Classification
1 The simple threshold classifier for sentiment analysis described in the video (check all that apply):
Must have pre-defined positive and negative attributes
Must either count attributes equally or pre-define weights on attributes
Defines a possibly non-linear decision boundary
1
2 For a linear classifier classifying between positive and negative sentiment in a review x, Score(x) = 0 implies (check all that apply):
The review is very clearly negative
We are uncertain whether the review is positive or negative
We need to retrain our classifier because an error has occurred
2
3 For which of the following datasets would a linear classifier perform perfectly?
x 1,2,3
4 True or false: High classification accuracy always indicates a good classifier.
false
5 True or false: For a classifier classifying between 5 classes, there always exists a classifier with accuracy greater than 0.18.
true
6 True or false: A false negative is always worse than a false positive.
false
7 Which of the following statements are true? (Check all that apply)
Test error tends to decrease with more training data until a point, and then does not change (i.e., curve flattens out)
Test error always goes to 0 with an unboundedly large training dataset
Test error is never a function of the amount of training data
1 Analyzing sentiment: IPython Notebook
iPython Notebook
Analyze text products['word_count'] = graphlab.text_analytics.count_words(products['review'])
Train classifier using a logistic classifier
train_data,test_data = products.random_split(.8, seed=0)
sentiment_model = graphlab.logistic_classifier.create(train_data,
target='sentiment',
features=['word_count'],
validation_set=test_data)
Evaluate the sentiment model using ROC Curve
sentiment_model.evaluate(test_data, metric='roc_curve')
sentiment_model.show(view='Evaluation')
Programming assignment
Analyzing product sentiment assignment Quiz: Analyzing product sentiment
Algorithms for retrieval and measuring similarity of documents Document retrieval: A case study in clustering and measuring similarity
Groups of related articles: clusters
What is the document retrieval task?
Given an article, how to find similar articles?
measure similarity
search over similar articles
Word count representation for measuring similarity
Most popular model: bag of words model
ignore order of words
count # of instances of each word in vocabulary
Create vector e.g. "Carlos calls the sport futbol" vs "Emily calls the sport soccer" word count carlos 1 the 2 tree 0 calls 2 sport 2 cat 0 futbol 1 dog 0 soccer 1 emily 1
Measuring similarity: compare the following 2
article on messi: 1 0 0 0 5 3 0 0 1 0 0 0 0
article on pele: 3 0 0 0 2 0 0 1 0 1 0 0 0
similarity = (1 * 3) + (5 * 2) = 13
Another example
article on messi: 1 0 0 0 5 3 0 0 1 0 0 0 0
article on africa: 0 0 1 0 0 0 9 0 0 6 0 4 0
similarity = 0
Issue Doc Length: The longer the documents, the higher similarity score
messi: 1 0 0 0 5 3 0 0 1 0 0 0 0 doubled 2 0 0 0 10 6 0 0 2 0 0 0 0
pele: 3 0 0 0 2 0 0 1 0 1 0 0 0 doubled 6 0 0 0 4 0 0 2 0 2 0 0 0
original similarity = 13 vs doubled = 52
Solution = normalize Sqrt(Sum(i^2))
messi: 1 0 0 0 5 3 0 0 1 0 0 0 0
normalized: Sqrt(1^2 + 5^2 + 3^2 + 1^2) = Sqrt(36) = 6
messi normalized = 6 0 0 0 6 6 0 0 6 0 0 0 0
Prioritizing important words with tf-idf
Common words in doc: the, player, field, goal
^ dominate rare words like: futbol, Messi
discount word w based on # of docs containing w in corpus
But we don't want only rare words to dominate
What characterizes an important word ?
Appears frequently in document common locally
Appears rarely in corpus rare globally
Trade off between local frequency and global rarity
TD-IDF := Term frequency - Inverse document frequency
tf:
idf: log( # docs / 1 + # docs using word)
word in many docs: log(large / 1 + large) ~= log 1 = 0
rare word: log(large # / 1 + small #) => large #
Calculating tf-idf vectors
e.g. doc with "the" showing up 100 times and "messi" showing up 5
assuming 64 docs and "the" shows up in 63 of them
assuming "messi" shows up in 3 docs
TF: {the: 1000, Messi: 5}
IDF: { the: log 64/1+63 = 0, Messi: log 64/1+3 = log 16 = 4 }
TF-IDF (TF * IDF) "the" = 1000 * 0 = 0
TF-IDF (TF * IDF) "messi" = 5 * 4 = 20
Retrieving similar documents using nearest neighbor search
query article is the current article
corpus = entire library
specify: distance metric
output: set of most similar articles
Algorithm
search over each article in corpus
compute s = similarity
if s > Best_s, set new Best_s as this article
k-nearest neighbor
return list of k similar articles
Clustering models and algorithms Clustering documents task overview
Discover groups clusters of related articles
Structure documents by topic: e.g. sports, world news, etc
assumuning we are provided with labels
Multiclass classification Problem
how do i classify a new article based on clusters?
example of a supervised learning problem
Clustering documents: An unsupervised learning task
Clustering: no labels provided & want to uncover cluster structure
Input: docs as vectors [w1, w2]
Output: cluster labels
once articles are clustered, you can post-facto provide a meaningful label e.g. "sport"
Clustering
Defining a cluster
center
shape or spread
To assign a new article check the shape of the cluster and center
Another approach is to only look at distance from center
Cluster Center K-means: A clustering algorithm
Assuming similarity metric is distance from center
k-means 1
0 Initialize cluster centers
k-means 2
1 Assign observations to closest cluster centers Voronoi Tessalation
Assign regions
k-means 3
2 Revise cluster centers as mean of assigned observations
Iterate on the process
k-means 4
3 Repeat 1,2 until convergence
k-means 5 Other examples of clustering
Clustering images: by ocean, dog, clouds, etc
Clustering patients by medical condition: by subpopulations and diseases
e.g. patients with seizures
record seizure activity over time
cluster the different types of seizures
Siezure recording
Siezure clustering
Clustering products on amazon:
discover product categories from purchase histories
discover groups of users
e.g. person buying crib also purchased baby seat => crib is "baby product", not "furniture"
Structuring web search results
e.g. "cardinal" can have multiple meanings: bird, baseball team, catholic cardinal
Use clusturing to structure output
Discovering similar neighborhoods: estimate price at a small region
cluster regions with similar sales trends
Discovering similar neighborhoods: forecast violent crimes
cluster regions
Summary of clustering and similarity
Training data: doc id, document text
x: tf-idf
ML model: clustering
ŷ: estimated cluster label
y: does not exist b'se unsupervised learning
quality metric: x & ŵ
i.e. distances of observations to assigned centers
ŵ: cluster centers
ML algorithm: k-means
Clustering ML Block Clustering and similarity ML block diagram Summary
Describe ways to represent a document (e.g., raw word counts, tf-idf ,...)
Measure the similarity between two documents
Discuss issues related to using raw word counts
Normalize counts to adjust for document length
Emphasize important words using tf-idf
Implement a nearest neighbor search for document retrieval
Describe the input (unlabeled observations) and output (labels) of a clustering algorithm
Determine whether a task is supervised or unsupervised
Cluster documents using k-means (algorithmic details to come...)
Describe other applications of clustering
Quiz: Clustering and Similarity
1 A country, called Simpleland, has a language with a small vocabulary of just the, on, and, go, round, bus, and wheels. For a word count vector with indices ordered as the words appear above, what is the word count vector for a document that simply says "the wheels on the bus go round and round".
Please enter the vector of counts as follows: If the counts were ["the"=1, on=3, "and"=2, "go"=1, "round"=2, "bus"=1, "wheels"=1], enter 1321211.
the the
on
and
go
round round
bus
wheels
2111211
2 In Simpleland, a reader is enjoying a document with a representation: [1 3 2 1 2 1 1]. Which of the following articles would you recommend to this reader next?
[7 0 2 1 0 0 1] * [1 3 2 1 2 1 1] = [7 0 4 1 0 0 1] = 13
[1 7 0 0 2 0 1] * [1 3 2 1 2 1 1] = [1 21 0 0 4 0 1] = 27
[1 0 0 0 7 1 2] * [1 3 2 1 2 1 1] = [1 0 0 0 14 1 2] = 18
[0 2 0 0 7 1 1] * [1 3 2 1 2 1 1] = [0 6 0 0 14 1 1] = 22
3 A corpus in Simpleland has 99 articles. If you pick one article and perform 1-nearest neighbor search to find the closest article to this query article, how many times must you compute the similarity between two articles?
Y 98
98 * 2 = 196
98/2 = 49
(98)^2
99
4 For the TF-IDF representation, does the relative importance of words in a document depend on the base of the logarithm used? For example, take the words "bus" and "wheels" in a particular document. Is the ratio between the TF-IDF values for "bus" and "wheels" different when computed using log base 2 versus log base 10?
No
5 Which of the following statements are true? (Check all that apply):
Y Deciding whether an email is spam or not spam using the text of the email and some spam / not spam labels is a supervised learning problem.
Dividing emails into two groups based on the text of each email is a supervised learning problem.
Y If we are performing clustering, we typically assume we either do not have or do not use class labels in training the model.
6 Which of the following pictures represents the best k-means solution? (Squares represent observations, plus signs are cluster centers, and colors indicate assignments of observations to cluster centers.)
2 Document retrieval: IPython Notebook
iPython Notebook Loading & exploring Wikipedia data
Using text_analytics
obama = people[people['name'] == 'Barack Obama']
obama = graphlab.text_analytics.count_words(obama['text'])
print obama['word_count']
Exploring word counts
Extract words and counts using stack
obama_word_count_table = obama[['word_count']] \
.stack('word_count',
new_column_name = ['word','count']) \
.sort('count', ascending=False)
Computing & exploring TF-IDFs
people['word_count'] = graphlab.text_analytics.count_words(people['text'])
people['tfidf'] = graphlab.text_analytics.tf_idf(people['word_count'])
obama = people[people['name'] == 'Barack Obama']
obama[['tfidf']].stack('tfidf',new_column_name=['word','tfidf']).sort('tfidf',ascending=False)
Computing distances between Wikipedia articles
Manually computing distance
clinton = people[people['name'] == 'Bill Clinton']
graphlab.distances.cosine(obama['tfidf'][0],clinton['tfidf'][0])
Building & exploring a nearest neighbors model for Wikipedia articles
Using nearest_neighbors
knn_model = graphlab.nearest_neighbors \
.create(people, features=['tfidf'], label='name')
knn_model.query(obama, radius=0.84, k=10)
Programming assignment: Retrieving Wikipedia articles assignment1h Quiz: Retrieving Wikipedia articles
Recommender systems
ML to use past history of your and other's purchases
Where we see recommender systems in action
youtube, netflix movies
amazon products: include global and session interests e.g. last yr was buying newborn products
facebook friend recommendations
drug target interactions: drug d has been studied for target t. Can we discover new applications of d?
Building a recommender system via classification Solution 0: Popularity
rank by global popularity
con: no personalization Solution 1: Classification Model
User Info |
| Yes
Purchase History | /
|----> Classifier <
Product Info |
| No
Other Info |
Features may not be available e.g. age, gender
doesn't perform as well as collaborative filtering Solution 2: Co-occurrence matrices for collaborative filtering
e.g. person buying diapers might also be interested in baby wipes b'se others did so
Matrix C: (#i * #j)
symmetric matrix: reflective on diagonal
when buying diapers, look at the row for diapers and find other products that people bought frequently when buying diapers
row diapers [ 0 .... 4 ......... 100 ... ] dvd pacifier baby wipes
Effect of popular items
Sophie giraffe is a popular baby product
Can weight my purchased items to refine e.g.
Score(me, baby wipes) = 1/2[S(wipes,diapers) + S(wipes,milk)]
Cons: does not utilize
context: e.g. time of day
user features: e.g. age
product features: e.g. baby vs electronics
Cold start problem: what if this is a new product
row Sophie [ 0 .... 4 ......... 100 ......... 1M ..... ] dvd pacifier baby wipes diapers
Diapers is really popular so it drowns out everything else in the domain Normalizing co-occurrence matrices and leveraging purchase histories
Normalize with Jaccard Similarity
similar to TF-IDF
(#i + #j purchased) / (#i OR #j purchased)
^ = ∪(i,j) / i' + j' (venn notation)
An alternative similarity metric is cosine similarity
Cons
does not utilize context (e.g. time of day), user features (e.g. age), product features (e.g. baby vs electronics)
Cold start problem: no history for a new product or user
Matrix factorization Matrix completion task User Movie Rating U1 M1 3 U1 M2 5 U1 M3 2 U2 M4 2 U2 M5 4 U3 M2 4 U3 M4 4 U3 M5 5 U3 M6 4
Matrix
-------------
Rating = |XX X XXX| |X X | | X X X X| Users | X XX X | |X X X| | X XX | ------------- Movies
Rating(u,v) known for X cells
Rating(u,v) unknown for blank cells
Goal is to fill missing data Recommendations from known user/item features
Describe Movie v with topics Rv
action, romance, drama ... = [0,3, 0.01, 1.5, ...]
Describe User u with topics Lu
action, romance, drama ... = [2.5, 0, 0.8, ...]
Estimate ^Rating(u,v) for an unkown movie = ^Rv * ^Lu
[0,3, 0.01, 1.5, ...] * [2.5, 0, 0.8, ...] = 0.75 + 0 + 1.2 = 1.95
^ is not representative of what the user would score the movie, but the higher the score the more likely they are to like it Predictions in matrix form
Rating(u,v) = <Lu, Rv>
Can create the same matrix like above by using by getting all Lu and all Rv Discovering hidden structure by matrix factorization
Use residual sum of squars RSS (introduced earlier)
RSS(L,R) = SUM[(Rating(u,v) - <Lu,Rv>)^2]
Matrix Factorization Model is taking this matrix and approximating it with factorization
Limitations:
cold start problem: new movie or user
Bringing it all together: Featurized matrix factorization
Features capture context
matrix factorization captures group of users
We can combine the 2 above to mitigate cold-start problem
rating for a new user from features only
As we gather user data we shift to matrix factorization topics
Netflix Challenge 2006-2009
winning team blended over 100 models
Performance metrics for recommender systems A performance metric for recommender systems
Don't want to use classification accuracy (liked vs not liked)
not interested in what a person does not like
want to quickly discover relatively few liked items
^ imbalanced class problem
Users have short attention span, so we want to recommend fewer items
higher cost to missing liked item
Recall = (# liked & shown) / (# liked)
e.g 3/5
Precision = (# liked & shown) / (# shown)
e.g. 3/11
Optimal recommenders
How to maximize recall? recommend all my liked items
in this case precision is really small. e.g. 1000 total products, of which i liked 6 items => precision = 6/1000
Optimal recommender is to only recommend only items I like
precision & recall = 1
Precision-recall curves
Input: specific recommender system
Oupture: algoritm-specific precision-recall curve
Precision Recall Curve
A possible metric is AUC (Area under Curve). e.g. compare the blue curve to green curve below
Precision Recall Algorithm
Another metric is to set desired recall and maximize precision at k e.g.
only want to display x items to fit on a single page
assuming users will ignore anything more than x items
Summary of recommender systems
Recommender systems ML block diagram
Training data = user, products, ratings
x: user_id, product_id
ML Model: matrix factorization
ŵ: { ^Lu, ^Ru } + other weighted parameters
ŷ: predicted rating
y = actual ratings (training data set)
quality metric: RSS as well as others
ML Block Diagram Quiz: Recommender Systems
1 Recommending items based on global popularity can (check all that apply):
a) provide personalization b)capture context (e.g., time of day) c) none of the above
2 Recommending items using a classification approach can (check all that apply):
provide personalization
capture context (e.g., time of day)
none of the above
answer:-
3 Recommending items using a simple count based co-occurrence matrix can (check all that apply):
provide personalization
capture context (e.g., time of day)
none of the above
4 Recommending items using featurized matrix factorization can (check all that apply):
provide personalization
capture context (e.g., time of day)
none of the above
5 Normalizing co-occurrence matrices is used primarily to account for:
people who purchased many items
items purchased by many people
eliminating rare products
none of the above
6 A store has 3 customers and 3 products. Below are the learned feature vectors for each user and product. Based on this estimated model, which product would you recommend most highly to User #2?
| User ID | Feature vector | | 1 | (1.73, 0.01, 5.22) | | 2 | (0.03, 4.41, 2.05) | | 3 | (1.13, 0.89, 3.76) |
| Product ID | Feature vector | | 1 | (3.29, 3.44, 3.67) | | 2 | (0.82, 9.71, 3.88) | | 3 | (8.34, 1.72, 0.02) |
Product #1
Product #2
Product #3
7 For the liked and recommended items displayed below, calculate the recall and round to 2 decimal points. (As in the lesson, green squares indicate recommended items, magenta squares are liked items. Items not recommended are grayed out for clarity.) Note: enter your answer in American decimal format (e.g. enter 0.98, not 0,98)
8 For the liked and recommended items displayed below, calculate the precision and round to 2 decimal points. (As in the lesson, green squares indicate recommended items, magenta squares are liked items. Items not recommended are grayed out for clarity.) Note: enter your answer in American decimal format (e.g. enter 0.98, not 0,98)
9 Based on the precision-recall curves in the figure below, which recommender would you use?
RecSys #1
RecSys #2
RecSys #3
Answer: -
3
1,2
1
1,2
2
2
1 / 3 = 0.33
1/4 = 0.25
1
Song recommender: IPython Notebook
iPython Notebook Loading and exploring song data
Song data = [user_id, song_id, listen_count, title, artist, song]
song_data['song'].show()
Show unique users users = song_data['user_id'].unique() Creating & evaluating a popularity-based song recommender
Set up train/test data train_data,test_data = song_data.random_split(.8,seed=0)
Popularity based recommender popularity_recommender
popularity_model = graphlab.popularity_recommender
.create(train_data,
user_id='user_id',
item_id='song')
Note: we're using song as the item_id for the recommender above
The recommendations for all users should be the same
popularity_model.recommend(users=[users[0]])
popularity_model.recommend(users=[users[1]])
Creating & evaluating a personalized song recommender
Personalized recommender item_similarity_recommender
personalized_model = graphlab.item_similarity_recommender
.create(train_data,
user_id='user_id',
item_id='song')
Can recommend by user personalized_model.recommend(users=[users[0]])
Or by Song personalized_model.get_similar_items(['With Or Without You - U2']) Using precision-recall to compare recommender models
Can compare() the 2 models on test_data and generate a graph
model_performance = graphlab.compare(test_data, [popularity_model, personalized_model], user_sample=0.05)
graphlab.show_comparison(model_performance,[popularity_model, personalized_model])
Quiz: Recommending songs Deep Learning: Searching for Images Neural networks: Learning very non-linear features Slides presented in this module
Annotated PDF
Some useful papers on computer vision:
SIFT - Lowe '99
Spin Images - Johnson & Herbert '99
Textons - Malik et al. '99
RIFT - Lazebnik '04
GLOH - Mikolajczyk & Schmid '05
HoG - Dalal & Triggs '05
SURF - Bay et al. '06
ImageNet - Krizhevsky '12
What is a visual product recommender?
Want to search for a product but not sure what keywords to use e.g. dress
"dress floral" shows varied options
Can click on "similar items" once we find something we like
Learning very non-linear features with neural networks
Features are key to machine learning
Deep learning based on neural networks
learning very non-linear features
Linear classifiers create a boundary between +ve and -ve scores
With neural networks, we use graphs to score
Score(x) = w0 + Sum(wi * xi)
Graph Classifier
QN: If the perceptron takes an input of exacly 0, what should it output?
ANS: Implementation defined. An input of zero is an edge case.
What can a linear classifier represent?
x1 OR x2
x1 AND x2
What can a linear classifier represent?
What can't a linear classifier represent? XOR
What can't a linear classifier represent?
^ this is a representation of a simple neural network
Neural Networks
require lots of data
require high performance systems
Deep learning & deep features Application of deep learning to computer vision
Image features are combined to make a prediction.
e.g. nose, eye, eye, mouth => face
^ detectors are much lower level. don't find eye, but various features
^ find "unique" features
Example algorithm is SIFT
SIFT
Implicitly learn features
implicitly learn features Deep learning performance
Examples of image recognition using deep neural networks
99.5% German traffic sign
97.8% house number recognition
ImageNet 2012 competition
SuperVision team got huge gain over competitors with new algorithms and GPU implementation
Demo of deep learning model on ImageNet data
ImageNet Other examples of deep learning in computer vision
Scene parsing: label parts of an image
scene parsing Challenges of deep learning
Pros
Enables learning of features rather than hand tuning
Impressive performance gains
Computer vision
Speech recognition
Some text analysis
Potential for more impact
Cons
Requires a lot of data for high accuracy
Computationally really expensive
Extremely hard to tune
Choice of architecture
Parameter types
Hyperparameters
Learning algorithm
etc
layers
Computationally expensive and hard to tune Deep Features
Deep Features allow us to build neural networks with smaller seed data
Deep Features
In a neural net
early tasks are more generic and used as feature extractor
later tasks are specific
By doing:
keep weights fixed for early tasks
use simple classifier in later tasks
Tasks
Workflow looks like
workflow
Real world use: Compology adds camera to trash cans to detect how full the trashcan is Summary of deep learning
training data: image, label
x: deep features
ML model: logistic regression
ŷ: predicted labels
y: true labels
quality metric: classification accuracy
ŵ: weights of features
ML Block Diagram
Summary
Describe multi-layer neural network models
Interpret the role of features as local detectors in computer vision
Relate neural networks to hand-crafted image features
Describe some settings where deep learning achieves significant performance boosts
State the pros & cons of deep learning model
Apply the notion of transfer learning
Use neural network models trained in one domain as features for building a model in another domain
Build an image retrieval tool using deep features
Quiz: Deep Learning 6 questions
1 Which of the following statements are true? (Check all that apply)
Linear classifiers are never useful, because they cannot represent XOR.
Linear classifiers are useful, because, with enough data, they can represent anything.
Having good non-linear features can allow us to learn very accurate linear classifiers.
none of the above
answer:- 3
2 A simple linear classifier can represent which of the following functions? (Check all that apply) Hint: If you are stuck, see https://www.coursera.org/learn/ml-foundations/module/nqC1t/discussions/AAIUurrtEeWGphLhfbPAyQ
x1 OR x2 OR NOT x3
x1 AND x2 AND NOT x3
x1 OR (x2 AND NOT x3)
none of the above
answer:- 1,2,3
Generated tables
| a | b | c | a OR b OR NOT c | | T | T | T | T | | T | T | F | T | | T | F | T | T | | T | F | F | T | | F | T | T | T | | F | T | F | T | | F | F | T | F | | F | F | F | T |
| a | b | c | a AND b AND NOT c | | T | T | T | F | | T | T | F | T | | T | F | T | F | | T | F | F | F | | F | T | T | F | | F | T | F | F | | F | F | T | F | | F | F | F | F |
| a | b | c | a OR (b AND NOT c) | | T | T | T | T | | T | T | F | T | | T | F | T | T | | T | F | F | T | | F | T | T | F | | F | T | F | T | | F | F | T | F | | F | F | F | F |
3 Which of the the following neural networks can represent the following function? Select all that apply. (x1 AND x2) OR (NOT x1 AND NOT x2)
Hint: If you are stuck, see https://www.coursera.org/learn/ml-foundations/module/nqC1t/discussions/AAIUurrtEeWGphLhfbPAyQ
answer:- 4
4 Which of the following statements is true? (Check all that apply)
1). Features in computer vision act like local detectors. 2).Deep learning has had impact in computer vision, because it’s used to combine all the different hand-created features that already exist. 3). By learning non-linear features, neural networks have allowed us to automatically learn detectors for computer vision. 4).none of the above
answer:- 1,3
5 If you have lots of images of different types of plankton labeled with their species name, and lots of computational resources, what would you expect to perform better predictions:
1). a deep neural network trained on this data. 2). a simple classifier trained on this data, using deep features as input, which were trained using ImageNet data.
answer:- 1
6 If you have a few images of different types of plankton labeled with their species name, what would you expect to perform better predictions:
1). a deep neural network trained on this data.
2). a simple classifier trained on this data, using deep features as input, which were trained using ImageNet data.
answer:- 2
Deep features for image classification: iPython Notebook
iPython Notebook Deep features for image retrieval: iPython Notebook
iPython Notebook Programming assignment
bird
6
4
35 - 37
37-39
cat
60-70
Deploying machine learning as a service
What is Production?
Deployment: Server live predictions
Evaluation: Measuring quality of deployed models
Management: Choosing between deployed models
Monitoring: Tracking model quality & operations
Deployment ---> Evaluation
^ |
| |
| v
Management <--- Monitoring
Deployment System
Batch Training Real-time Predictions |-------------------| |----------------------------------------|
Hist. Data -> Model -> System -> Predictions
^ ^ ^
| \ / v
| Live Data Feedback
|_____________________________|
After Deployment: Evaluation, Management, Monitoring
Use feedback to learn a Model 2
run both and compare to Model 1
continuous evaluation and testing
Evaluation = predictions + metrics
track data collecting from users
what metrics being used to evaluate
Offline evaluation: when do we update the model?
e.g. SSE (sum squared error)
Online evaluation: choosing between models
e.g. User engagement
Can use A/B Testing to choose between ML Models
Other production considerations
A/B testing caveats
versioning
provenance
dashboards
reports
etc
Machine learning challenges and future directions Open challenges in ML
Choice of Model to use
Lots of models are available
e.g. classifier vs matrix factorization
Data representation
e.g. how do i represent bag of words count or tf-idf
Scaling: Data is getting big
e.g. too many social networks to get data from
e.g. wearable devices are gathering a lot of data
e.g. IoT
Need methods that scale up to the amount of data available
Scaling: models are getting big
e.g. clustering models applied on brain activity gets complicated
CPUs stopped getting faster
marginal increases in last decade
Parallel architectures
can solve for CPU limitations e.g. using GPUs or clusters
concurrent system design and implementation is hard
Where is ML going?
Self-driving cars
specialized medicine
combining real-time data gathering: e.g. localization, cameras, etc
scaling up with amount of data
What's ahead in the specialization