In this lesson, we will look at another matrix factorization technique called Alternating Least Squares (ALS). This method can prove to be much more effective and robust than the SVD we saw earlier. ALS allows you to set regularization measures and minimize a loss function while optimizing the model parameter k
. We will briefly look at the math behind this approach in this lesson, before putting it into practice next with spark.
You will be able to:
- Explain how bias terms can be used to create more accurate embedding matrices
- Describe how ALS is related to matrix decomposition and why it can be parallelized so well
In the past few lessons, we learned matrix factorization functions with the following assumptions:
- Each user can be described by
$k$ attributes or features
For example, feature 1 could be a number that says how much each user likes sci-fi movies. 2, how much he/she likes horror movies and so on.
- Each item can be described by an analogous set of
$k$ attributes or features
For our MovieLens
example, feature 1 for a chosen movie might be a number that says how close the movie is to pure sci-fi.
By multiplying the features of the user by the features of each item and adding these items together, we will approximate what a user would rate a particular item
The simplicity in matrix factorization is that we do not have to know what these features are. Nor do we know how many (
How do we integrate learning into this problem? By minimizing a loss function, of course
We can use matrix factorization to represent each user and item by k-dimensional vectors. By letting each item
$$ R = PQ^T $$ for the entire matrix
or
-
$R$ is the full user-item rating matrix -
$P$ is a matrix that contains the users and the$k$ factors represented as (user, factor) -
$Q^T$ is a matrix that contains the items and the$k$ factors represented as -
$r̂_{u,i}$ represents our prediction for the true rating$r_{ui}$ . In order to get an individual rating, you must take the dot product of a row of$P$ and a column of$Q$
These user and item vectors are called latent vectors. The
The image below is a representation of how a matrix is decomposed into two separate matrices:
If we wanted to calculate the rating for user [-1.03 , 1.62, 0.21]
and [-0.78,0.89,-1.47]
. Let's calculate these values in NumPy:
import numpy as np
# users X factors
P = np.array([[-0.63274434, 1.33686735, -1.55128517],
[-2.23813661, 0.5123861 , 0.14087293],
[-1.0289794 , 1.62052691, 0.21027516],
[-0.06422255, 1.62892864, 0.33350709]])
# factors X items
Q = np.array([[-2.09507374, 0.52351075, 0.01826269],
[-0.45078775, -0.07334991, 0.18731052],
[-0.34161766, 2.46215058, -0.18942263],
[-1.0925736 , 1.04664756, 0.69963111],
[-0.78152923, 0.89189076, -1.47144019]])
# the original
R = np.array([[2, np.nan, np.nan, 1, 4],
[5, 1, 2, np.nan, 2],
[3, np.nan, np.nan, 3, np.nan],
[1, np.nan, 4, 2, 1]])
print(P[2])
[-1.0289794 1.62052691 0.21027516]
print(Q.T[:,4])
[-0.78152923 0.89189076 -1.47144019]
P[2].dot(Q.T[:,4])
1.9401031341455333
Now we can do the calculation for the entire ratings matrix. You can see that the values in the predicted matrix are very close to the actual ratings for those that are present in the original rating array. The other values are new!
P.dot(Q.T)
array([[ 1.99717984, -0.10339773, 3.80157388, 1.00522135, 3.96947118],
[ 4.95987359, 0.99772807, 1.9994742 , 3.08017572, 1.99887552],
[ 3.00799117, 0.38437256, 4.30166793, 2.96747131, 1.94010313],
[ 0.99340337, -0.02806164, 3.96943336, 2.00841398, 1.01228247]])
This should remind you of how things were calculated for the SVD array, so let's see what is different. We want our predictions to be as close to the truth as possible. In order to calculate these matrices, we establish a loss function to minimize. To avoid overfitting, the loss function also includes a regularization parameter
The loss function
Where
In the equation, there are two L2 regularization terms to prevent overfitting of the user and item vectors. As always, our goal is to minimize this loss function. This could be done with something like gradient descent of course, but due to the massive size of sparse matrices so frequently associated with recommendation system datasets, this is not always feasible. That's where Alternating Least Squares comes into play.
For ALS minimization, we hold one set of latent vectors constant. Essentially ALS alternates between holding the
If we assume either the user-factors or item-factors was fixed, this should be just like a regularised least squares problem. Let's look at these least squares equations written out mathematically.
First, let's assume the item vectors are fixed, we first solve for the user vectors:
$$p_u=(\sum{r{u,i}\in r_{u*}}{q_iq_i^T + \lambda I_k})^{-1}\sum_{r_{u,i}\in r_{u*}}{r_{ui}{q_{i}}}$$__
Then we hold the user vectors constant and solve for the item vectors:
$$q_i=(\sum{r{u,i}\in r_{i*}}{p_up_u^T + \lambda I_k})^{-1}\sum_{r_{u,i}\in r_{u*}}{r_{ui}{p_{u}}}$$__
This process repeats until convergence.
Above two steps are iterated until convergence or some stopping criterion is reached. Literature on ALS suggests 10 iterations for optimal results. Here is another good source.
Although this can produce great results on its own, it doesn't take into account trends related to different characteristics of certain items and certain users as well as the data as a whole. To account for this, more advanced implementations of ALS include a bias term to capture some of these aspects of the overall utility matrix. A common implementation of this is captured with three bias terms:
-
$\mu$ : a global average - the overall average rating of all items -
$b_{i}$ : item bias - the deviations of item$i$ from the average -
$b_{u}$ : user bias - the deviations of user$u$ from the average
Let's look at a basic example of how this would work. Imagine we're trying to calculate the rating for The Shawshank Redemption for user Matt. Assume the overall average rating is 3.5 and that The Shawshank Redemption tends to be rated 0.4 points higher than average. Matt is a generous rater and tends to rate 0.2 higher than the average. This would make his rating for the Shawshank Redemption 3.5 + 0.4 + 0.2 = 4.1
Putting these biases into an equation becomes:
and the overall loss function becomes:
$$ L = \sum_{u,i ∈ \kappa}(r_{u,i}− \mu - b_{u} - b_{i} - p_u^T q_i)^2 + \lambda( ||q_i||^2 + |p_u||^2 + b_{u}^{2} + b_{i}^{2})$$
ALS is generally less computationally efficient than directly computing the SVD solution, but it shines when you are dealing with giant, sparse matrices. SVD requires that all entries of the matrix be observed, and this is not a requirement with ALS. Because of ALS's "alternating" nature, it lends itself to performing computations in parallel. This can make it extremely beneficial to use distributed computing when using ALS. Do you remember anything that works in parallel? Spark!
As we will see in our next lab, Spark's machine learning library ml
offers an implementation of alternating least squares algorithm out of the box. It factors the user to item matrix
- A detailed explanation of ALS
- Great video on recommendation systems in Spark
- The math behind ALS
- Spark ALS for Kaggle Santander competition
In this lesson, we looked at another matrix factorization technique, called alternating least squares and learned how we can train such a model to minimize a loss function, based on least squares. Let's try what we have seen so far in the Spark environment in the last lab for this section.