Today I'm going to present a pretty neat way to make movie recommendations. It's based on a simplified version of what the team that won the 2009 Netflix Challenge used.

We're going to briefly cover the following topics:

  • Two popular recommendation engine paradigms
  • Matrix factorization for Latent Factor Models
    • Need to establish a loss function to train model
    • Some linear algebra
    • Regularization to combat overfitting
    • Adding biases
  • How to teach the model to learn
    • Stochastic Gradient Descent
  • Why making recommendations can be hard
    • How to determine if recommendations are good or bad
      • Need to establish a loss function to score model
    • Very subjective
  • Some examples of real movie recommendations

Recommendation engine paradigms

There are several ways to think about recommendation engines. One way is to create a profile of you, the user. Do you like action movies or comedies? Are you bored after 80 minutes or do you enjoy 3 hour foreign films? After you watch a bunch of movies, we can figure out what your user profile is and make recommendations to fit that profile. This type of model is called content filtering.

This method works, but it doesn't account for a lot of information that's out there. For example, maybe you like Jean Claude Van Damme movies from the 80's. Once you've seen them all, then what? We can recommend similar movies, but the universe of possible, useful recommendations is quite small. What if instead we looked at the viewing habits of other movie goers? Imagine if 90% of JCVD fans also enjoy 70's Westerns.

Chances are, so would you. That is, we assume that if someone likes what you like, then you probably like what they like too. Naturally, a venn diagram follows:

Of course, you might like everything another user likes, but they might not like some of the movies you like. It's not a two way street. Incorporating other users' behaviors is known as collaborative filtering. Some hybrid of these two approaches is what's usually used in practice, but today we'll focus on the latter.

Latent Factor Modeling

There are a few ways to implement a collaborative-filtering based model. One way is using a latent factor model (LFM). LFMs attempt to find hidden (i.e. latent) patterns and important features that help to explain observed data. We don't always know what the latent factors are, but they could be something as explicit as "action vs. comedy" or something more difficult to identify such as "diversity of cast" or "number of dad jokes". Sometimes, we can't identify what the factors represent, but they are there.

So how can we figure out what these latent factors are? Using the maths of course. Specifically, linear algebra. Essentially, we're taking a big matrix of movie ratings (users along the rows, movies along the columns) and trying to approximate it with two smaller matrices. Something like this:

Mathematically, we can express the above as follows:

k is the number of factors. Multiplicative model.

k is the number of factors. Multiplicative model.

Where is the rating given by the ith user for the jth movie and is our approximation. In other words, we want to minimize:
Want to find u and v such that the squared error is minimized.

Want to find u and v such that the squared error is minimized.

for each combination of user and movie. We call the above the loss function - it's the thing we want to make as small as possible by choosing certain values for u and v.

Quick digression on loss functions: Note that its squared because we don't care if the approximation is more or less than the true rating (think about the symmetry of a parabola), and calculations involving squares are much easier than those involving absolute values. Trust me on this.

In many cases, however, our loss function might not be symmetric. For example, it's probably better not to recommend the wrong movie vs. recommending the right movie. Sure, it sucks to miss out on a good recommendation, but it's much worse if you give someone a bad recommendation. Then they can't trust that you know what you're doing. This concept is similar to loss aversion - it's asymmetric. 

The challenge is to compute the mapping for each user and movie to the vectors u and v. High correspondence between movie and user factors leads to a higher rating and hence a recommendation.

Here's the rub: Recall the ratings matrix is a matrix of ratings - scored between 1 and 5 with 0's where the user hasn't seen a movie. The goal is to figure out what rating a user would have given a movie had they seen it. When that rating is high, make the recommendation, otherwise don't. So the crux is to approximate the empty elements with inferred predictions. When the prediction is close to 5/5, recommend the movie, otherwise don't. There are many ways to approximate each movie rating. Multiplying the user and movie vectors (like I show above) is one simple way. 

There are many ways to solve for the vectors u and v. However, remember that we only have non-zero numbers where someone has rated a movie. That means there are a lot of 0's everywhere. This also means that our model can only learn based on watched movies, and hence is biased towards those movies, compared to all the unwatched ones. This is a classic case of overfitting to the data, and can be countered by introducing a regularization term - a penalty which balances fitting the model to the watched movies and unwatched movies. Now our model looks like:

K is the number of latent factors. Lambda is the regularization parameter.

K is the number of latent factors. Lambda is the regularization parameter.

Where the stuff on the right hand side acts as a penalty on the whole thing. Remember, we're trying to minimize the entire quantity and we just added a positive term. This positive term is going make it harder to minimize the whole thing.

Finally, we can increase the sophistication of the model by accounting for people's biases. That is, some movies are more likely to get scored higher than other movies, all else equal, just because some movies have better marketing. Or perhaps there is a seasonality effect, i.e. Christmas movies are likelier to be rated higher than non-Christmas movies, around Christmas time. We can account for these movie and user biases by modeling the bias term too, in addition to the u and v vectors. We can extend the model like so:

Model biases and interactions

Model biases and interactions

Where is the overall rating mean, is the bias for user i and is the bias for movie j. Finally, the model we are trying to minimize is:
AMMI + regularization model

AMMI + regularization model

There exist other extensions, including time-based and confidence based extensions. But this model will suffice for our purpose. Next up: How does the model actually solve for u, v and b?

Teaching the model

The two most popular methods for solving equations of the above form, are alternating least squares and stochastic gradient descent (SGD). My model implements that latter.

SGD is an algorithm which takes steps in some direction and continues taking steps until the minimum is reached. The direction of the step is the gradient, or first derivative of the function you're trying to minimize. And the size of the step is something that we commonly find using trial and error - too big of a step means you might overstep the minimum; too small of a step and you'll take forever to finally get there. So, the gradient of our function, with respect to each variable being minimized, is:

First derivatives. Every step, update variable values.

First derivatives. Every step, update variable values.

And that's it! We now have everything we need to build a recommendation engine.

To recap:

  1. We first identify what exactly we are trying to find - an approximation to the true ratings matrix
    • This way we can fill in the 0's
    • When we predict a 5/5, make a recommendation
  2. Then we build a model that attempts to find the true rating
    • We used an additive main effects and multiplicative interaction model (AMMI)
      • This modeled user-movie interaction and biases
    • We also controlled for potential overfitting by adding a regularization penalty
  3. Then we used SGD to actually find the values of the user, movie and bias vectors which gave us the minimum squared error between the true ratings and the approximation

We should be able to come up with a predicted rating for each user-movie combination and make recommendations where appropriate. Of course, whether or not that recommendation is any good is another story.

Recommendations are hard

If you really like Die Hard, and the model recommends Independence Day 2 (because ID2 scores high in the model) is that a good or bad recommendation, and how good/bad is it? (Note: Independence Day 2 is awful).

In actuality, recommending the right movie (or anything else for that matter) is notoriously difficult to do accurately. It's highly subjective and real models are based on much more than our model above. As I mentioned, real models incorporate time, confidence of rating, and other variables that model our changing preferences. Of course, the more complex the model is the less interpretable the results. Occam's razor strikes again! 

Scoring recommendation ENGINES

So now that we have a model, how do we know if it's any good? This all depends on what our definition of good or bad is. Earlier I discussed the concept of a loss function. We used squared loss to teach the SGD algorithm where the minimum was. Now, given a few different models, how would you rate which of them is best? Once again, its common to use a squared loss metric, such as root mean squared error. For example, I could train a bunch of models on 50% of the ratings data, and then make predictions on the remaining 50% of the data, and calculate how close (as measured by RMSE) each model's prediction was to the true data. Whichever model is the closest, wins. Again though, how can we be sure that the recommendations that this model makes are any good? What if RMSE isn't the best loss function to use? And even if it is, I'm not guaranteed that the recommendations will be good.

All this is to say that making recommendations is HARD. Which is probably why Amazon, Netflix and Facebook has teams of people working on this problem.

An example of real movie recommendations

Notwithstanding the preceding paragraphs, I went ahead and implemented a Latent Factor Model  (with k = 5 factors) and tested it on real users to see how it performs. It's an additive main effect and multiplicative interaction model (AMMI), and the learning algorithm is stochastic gradient descent. For more details, see my paper here.

Using the MovieLens dataset, I can pick off a random user, take a look at movies they rated highly (5/5) and based on that, see what movies that they haven't seen that my model would recommend to them. The model recommends movies by minimizing the RMSE, but that doesn't guarantee that the movies make sense - it's just the most popular (albeit highly flawed) metric for these kinds of recommendations. Let's take a look at userid = 250:

This user rated the following movies 5 out of 5:

This user appears to enjoy rom coms with a soft sport for crappy action and sci-fi flicks. Whatever works. Running this user through my model, the top 5 recommendations are:

3 of these 5 are comedies and 4 of 5 are dramatic in some way. Nick of Time is also just a great movie all around.


In order to really know if these are good recommendations, however, we need to ask the user. If we could learn from the user what their preferences are, perhaps by having them vote up or down, the model could learn very quickly what to recommend. However, in lieu of that, we must rely on other user's behavior patterns to fill in the movie gaps. Sometimes the results are intuitive (such as recommending Bed of Roses to someone who enjoys Strawberry and Chocolate) and other times, not so much (recommending The Crossing Guard). Of course, the purpose of Latent Factor Modeling is that, beyond just genre, there could be some more interesting factors that lead someone to enjoy both The Brothers and Canadian Bacon. I couldn't tell you what that is, but perhaps there are so many layers of complexity to John Candy that only a model can suss out!

The Event Classifier

The Event Classifier