Photo by Clint Adair on Unsplash

ML Handy Tools

Recommender System Metrics — Clearly Explained

Understanding the evaluation metrics for recommender systems

Chaitanya Belhekar
5 min readAug 28, 2020

--

In this post, we will be discussing evaluation metrics for recommender systems and try to clearly explain them. But before that let’s understand the recommender system in short.

A recommender system is an algorithm that provides recommendations to users based on their historical preferences/ tastes. Nowadays, recommendation systems are used abundantly in our everyday interactions with apps and sites. For example, Amazon is using them to recommend products, Spotify to recommend music, YouTube to recommend videos, Netflix to recommend movies.

The quality of the recommendations is based on how relevant they are to the users and also they need to be interesting. When the recommendations are too obvious, they are not useful and mundane. For the relevancy of recommendations, we use metrics like recall and precision. For the latter (serendipity) metrics like diversity, coverage, serendipity, and novelty are used. We will be exploring the relevancy metrics here, for the metrics of serendipity, please have a look at this post: Recommender Systems — It’s Not All About the Accuracy.

Let’s say that there are some users and some items, like movies, songs, or products. Each user might be interested in some items. We recommend a few items (the number is k) for each user. Now, how will you find whether our recommendations to every user were efficient?

In a classification problem, we usually use precision and recall evaluation metrics. Similarly, for recommender systems, we use a mix of precision and recall — Mean Average Precision (MAP) metric, specifically MAP@k, where k recommendations are provided.

Let’s explain MAP, so the M is just an average (mean) of APs, average precision, of all users. In other words, we take the mean for average precision, hence the Mean Average Precision. If we have 1000 users, we sum APs for each user and divide the sum by 1000. This is the MAP.

So now, what is average precision? Before that let’s understand recall (r)and precision (P).

Precision
Recall

There is usually an inverse relationship between recall and precision. Precision is concerned about how many recommendations are relevant among the provided recommendations. Recall is concerned about how many recommendations are provided among all the relevant recommendations.

Let’s understand the definitions of recall@k and precision@k, assume we are providing 5 recommendations in this order — 1 0 1 0 1, where 1 represents relevant and 0 irrelevant. So the precision@k at different values of k will be precision@3 is 2/3, precision@4 is 2/4, and precision@5 is 3/5. The recall@k would be, recall@3 is 2/3, recall@4 is 2/3, and recall@5 is 3/3.

So we don’t really need to understand average precision (AP). But we need to know this:

  • we can recommend at most k items for each user
  • it is better to submit all k recommendations because we are not penalized for bad guesses
  • order matters, so it’s better to submit more certain recommendations first, followed by recommendations we are less sure about

So basically we select k best recommendations (in order) and that’s it.

Here’s another way to understand average precision. You can think of it this way: you type something in Google and it shows you 10 results. It’s probably best if all of them were relevant. But, if only some are relevant, say five of them, then it’s much better if the relevant ones are shown first. It would be bad if the first five were irrelevant and good ones only started from sixth, wouldn’t it? AP score reflects this. They should have named it ‘order-matters recall’ instead of average precision.

But if you persist, let’s dive down into the math. If we are asked to recommend N items and the number of relevant items in the full space of items is m, then:

Average Precision

where P(k) is the precision value at the kth recommendation, rel(k), is just an indicator that says whether that kth recommendation was relevant (rel(k)=1) or not (rel(k)=0).

Consider, there are 5 relevant recommendations (m), we are making 10 recommendations (N) as follows — 1 0 1 0 0 1 0 0 1 1. Let’s calculate the mean average precision here.

Compare the math with the above formula and you will understand the intuition. Now to prove that MAP takes care of the order, as we mentioned earlier, let’s provide another set of recommendations — 1 1 1 0 1 1 0 0 0 (here we are making relevant recommendations first).

MAP rewards first-loaded relevant recommendations.

To summarize, MAP computes the mean of the Average Precision (AP) over all the users for a recommendation system. The AP is a measure that takes in a ranked list of the k recommendations and compares it to a list of relevant recommendations for the user. AP rewards you for having a lot of relevant recommendations on the list and rewards you for putting the most relevant recommendations at the top.

References:

Math formula images are taken from the following blog.

--

--

Chaitanya Belhekar

Just an old soul trapped in a tiny body. Also a home-grown data science enthusiast. An avid reader, but a lazy writer.