Recommender systems gained significant popularity with the advent of the Netflix Prize in 2009, although methods of automated recommending were being researched well before. In this post, we will be exploring the common setup of a recommendation problem as well as a couple of the most common techniques.

Caution!

Math ahead! If you do not have a background in linear algebra, proceed with caution.

Contents

Let's begin with the basic problem: you sit down to watch Netflix on a Saturday night, and are frankly sick of the shows you've watched over the past few months. You'd like to try something different, but there are so many choices! Luckily, Netflix breaks down their selection into categories and popularity, so you can get a good idea of what other users with your same interests like to watch. But how do they take thousands of shows and movies and select the few to show you?

While Netflix uses some advanced methods to make their decision, this sort of recommendation problem can be formulated fairly simply: there exists a set of users and a set of items. We can build a matrix where each row is a user and each column is an item:

In this matrix *M*, an entry *M[i][j]* for user *i* and item *j* being equal to one
means user *i* has interacted with item *j*, and zero if user *i* has not
interacted with item *j*. Now in English: if you have watched *The Office*,
then the position at the intersection of you and *The Office* has a one.
If you have not watched *Friends* then the corresponding position has a zero, etc.

In many forms of the recommendation problem, instead of user-item *interactions*
*M* contains information on user-item *ratings*, where each rating is
normalized between zero and one:

For example, if you rate *The Office* 4 of 5 stars, then instead of a one in
the corresponding position, there will be a 0.8 (\(4/5=0.8\), 4 of a maximum
of 5 stars were chosen). Now, how do we fill in the zeroes to predict
how a particular user might feel about a particular item, and thus decide if
we should recommend it?

## K-Nearest Neighbors

Consider if a friend of yours was asking for recommendations on the next movie he or she should watch. If you and your friend had very different tastes, you wouldn't likely recommend the movies you enjoy. However, if your friend and your dad have very similar tastes, then you are more likely to recommend movies your dad enjoys. This is the base thought process of the K-Nearest Neighbors algorithm.

Essentially, of all the users existing in your system, you want to pick the
top *k* users that are most similar to the target user, and recommend items
based on predicted ratings from these similar users.
How this "similarity" is measured wholly depends on the application
context. For a Netflix rating example, we might consider two users who rate *The
Office* 4 stars and 5 stars similar. But what if one of those two users rates
*Parks and Rec* 1 star and the other 4 stars? Are these two users still
considered similar?

Consider two vectors representing users' normalized ratings:

Where vector *w* represents one user, and *v* another. Each entry in these
vectors corresponds to a item, and each vector for each user has an entry
for every single item. Our goal is now to find if these two users are similar
based on these vectors, taking advantage of geometry. For a bit of background,
let's first define a vector.

### Vectors

If you are familiar with vectors, feel free to skip this section.

Vectors define two quantites: a *position* and an *orientation*. The position of
a vector is defined by its coordinates, and the orientation by its angle with
the x-axis. For example:

This vector's position is defined by the coordinates to which it points, in this case \((9,5)\), 9 units over and 5 units up. To get the orientation, we need to calculate the angle between the vector and the x-axis. In this case it is slightly over 29 degrees.

The example vector above is defined in two dimensions, however vectors can
be defined in infinite dimensions. That being said, I will be defining a few
vector similarity measures all of which hold for any *n*-dimensional vector.

### Similarity Measures

A similarity measure is a quantitative measurement of how much two vectors resemble one another, or how close they are to being the same. One of the most common similarity measures is the familiar Euclidean distance D(u,v):

Euclidean distance is the distance between two vector positions.
The relation to similarity measures is clear: if
the distance between two vectors is small, they are similar. If it is large, they
are not. The major weakness is this distance is *relative*. To construct a
definitive statement on how far two vectors are from each other, you need a
reference point.

For example, the distance between Dallas and Oklahoma
City compared to the distance between my mouse and keyboard makes Dallas to
Oklahoma City look unfathomably far, while Dallas to Oklahoma City compared to
the distance from Earth to the Sun makes Dallas to Oklahoma City seem
microscopic. Recall in K-Neareset Neighbors we wish to recommend the top *k*
similar users to a target user. There is an underlying assumption that there are many
users within this system with which to compare, meaning this drawback is not
really a drawback for the recommendation use case, while it can become an issue
for interpretability.

Another common similarity measure is the cosine similarity C(u,v):

Cosine similarty normalizes the magnitude of two vectors and measures the cosine of the angle between them, yielding a value between negative one and one. That is, instead of comparing distance cosine similarity is a comparison of orientation. Pretend there are two people standing on a street corner. One is facing north and another north-east. Were the person facing north to be standing ten miles away rather than on the same street corner, the similarity of orientation would not change, although their distance had. The person facing will face north no matter where he or she is, and the person facing north-east will face north-east no matter where he or she is.

Mathematically, cosine similarity has the benefit of providing a global reference point: the x-axis. This means for any two vectors I can always construct a definitive statement of their similarity. A cosine similarity of one means the two vectors are exactly the same, zero means they are not correlated, and negative one means the vectors are exactly opposite.

Of course both Euclidean distance
and cosine similarity are general measures of *vector similarity*, however, we
do not always have vector output from recommendation engines. In the case of
discrete sets, Jaccard Similarity is the more popular. There are many
more defined similarity measures for vectors and sets, which are not defined
here for the sake of time [1].

### Discussion

The largest downside to the K-Nearest Neighbors recommendation method is the computation time. To compute the recommendations for a single user requires measuring the similarity between all other users and sorting them, a total complexity of \(O(n^{2}logn)\). If one wishes to compute the recommendations for all users at once (a very common need) the complexity gets bumped up a degree: \(O(n^{3}logn)\) [2].

K-Nearest Neighbors is one method (among many others) for *collaborative
filtering*. That is, I am making recommendations based on that user's or other
users' behavior. Measuring similarity among other users is called user-based
recommendation. Alternately, you may want
to recommend similar items with which a user has already interacted. This is
called item-based recommendation and is done in the exact same way, but
instead of comparing user similarity we compare item similarity.

A major bonus of collaborative filtering methods is the explicit comparisons yield readily available reasoning for the recommendation. If I'm doing user-based recommendation, then I can show a message along the lines of "Other users similar to you also seemed to like this...", or in the case of item-based recommendation "Because you bought this, you may also like this other thing...". Other methods trade this explainability for performance.

## Singular Value Decomposition

Collaborative filtering suffers heavily from what's called the *sparsity
problem*: the user-item objective matrix has quite a few zeros, thus for larger
data sets recommendations can be very weak. *Singular value decomposition* (SVD)
handles this issue very well, with its own set of drawbacks that will be
discussed in just a minute. SVD is extremely abstract, however I will explain
in as much English as possible.

### Matrix Decomposition

Let's first define the singular value decomposition for a matrix *A*:

As the name implies, this defines an eigenspace breakdown of *A* (for those
who would want the detail), where the matrix *S* is a diagonal matrix composed
of the *singular values* of *A*, the columns of *U* are the *left singular
values* of *A*, and the columns of *V* are the *right singular values* of *A*.
In more concise terms, the matrix *S* gives how the vectors in *U* stretch into
the space of the vectors in *V*. So let's recall the user-item matrix *M*:

How can we use singular value decomposition to make recommendations with this
matrix? All we have to do is take the original decomposition equation and
replace *A* with *M*:

Actually computing the matrices above is a complex optimization problem, thus for the sake of time and for the benefit of the reader, it is left out.

### Computing Recommendations

How we get recommendations out of this is a bit tricky. First, we need to
translate the meaning of each matrix. The matrix *U* is representative of the
user space, and *V* is representative of the item space. Thus the singular
values in *S* measure how the vectors in the user space stretch into the vectors
of the item space, thus we end up with an abstract relation between the users
and items. But what do the singular values in *S* mean? This post describes
this really well. Essentially, each singular value is a weight on these *latent
factors* between the users and items. A latent factor in the context of
recommendations is an abstract property that users and items have. Therefore,
computing recommendations with an SVD model is a matching of abstract properties
between the input user vector (or user matrix) and the item matrix [3].

There is an interesting property here that needs to be addressed. Pretend there
are *m* users and *n* items. *M* is then an \(m \text{ x } n\) matrix. It follows
from equality that \(USV^{T}\) is required to also be \(m \text{ x } n\). Thus
the matrix *U* should be \(m \text{ x } r\) for some *r*, and \(V^{T}\) \(r
\text{ x } n\) for the same *r*. This means *S* is a sqaure \(r \text{ x } r\) matrix, where
*r* does not depend on the user-item matrix. This makes *r* configurable: a
user can choose how many of these latent factors (abstract user-item properties)
to use. This alleviates the aforementioned sparsity problem. If M is very
sparse, choosing a small *r* will not only greatly reduce the dimensionality,
but group together many related properties into a single latent factor. This is
the reason the winner of the Netflix Prize used SVD heavily in their final
solution.

### Discussion

Singular value decomposition has proven to be a very powerful recommender. The process outlined above relating users and items via these abstract properties implicitly maps patters in the underlying data set that are difficult or near impossible for humans to percieve. But with great power comes great responsibility, of which SVD takes none. Since these found properties are more than likely uninterpretable by humans, reasoning for why the recommender produced a given recommendation is not immediately available, or really available at all. Therefore, in cases such as online shopping where a user may want to know why a certain item was recommended, SVD should be used with care.

## Conclusion

K-Nearest Neighbors and singular value decomposition are two of the most common recommenders used in modern systems. There are plenty of pre-built libraries that make these models immediately avilable for use and provide utilities for evaluation. However, these general use models should be used only for exploration and testing. Production level recommenders should be tailored towards the particular system, and tested in the same manner [4]. Deep learning for recommendations is a newer research field being explored, however the concepts needed to express how these models perform are more complex and are thus saved for a later post.

[1] | Pearson's Correlation Coefficient is of notable mention, as it is also a very popular measure of vector similarity. |

[2] | There are have been many caching strategies, many of which are obvious, to improve this. However for the sake of making a point I have left them out. |

[3] | By "abstract property" I simply mean it is a property that the algorithm understands that may or may not be human interpretable. |

[4] | It is well known that the best way (though not the only) to test a recommendation system is to provide recommendations to real users and get their feedback. |