We tend to think of possible outcomes—in sports but also in many aspects of life—as black, white or perfect grey. Either the expected result is obvious and the alternative is unthinkable, or it’s just too close to call and it could go either way. We’re not going to help you solve all your life decision-making problems (maybe next time), but we want to help you have a more nuanced approach when trying to predict the result of a football game.

In the last couple of months we’ve seen France and Germany lose against Albania and Northern Ireland respectively. If we were to hear the calls of the “you can never tell” crowd, we would just throw our hands in the air, give up hope of making any accurate prediction again, and probably crawl under a rock.

The truth is that, though such events are rare, their likelihood can be estimated rather accurately. One way to do so is to devise a coherent rating system which can then be used to estimate the odds of any outcome. Today we introduce our version of such a rating system, following the template of Elo ratings which we optimized for international football.

In the next couple of weeks we will delve deeper into the inner workings of our rating system and use it to predict goal differences, study the importance of friendlies and more. We will also provide updated ratings for all national teams on a regular basis, present historical ratings since 1872 as well as tools to obtain the odds for any potential match-ups between national teams. But for now let’s focus on the basics.

Elo Ratings, what are they good for?

Let’s start by introducing very briefly the rating system we’ll be using in order to make our predictions. We’ll go into a lot more details in proceeding articles; meanwhile you can also take a look at our previous article on the Women’s Football FIFA index and how we applied their rating system to predict the World Cup pretty successfully. For now, let’s talk about two key ingredients of any Elo rating.

First, rating systems such as the one first devised by Elo attempt to be predictive: the definition of such a rating system includes a formula which, given the difference in ratings between two teams which are about to face off as well as a possible correction for home field advantage, predicts the odds of either team winning or the game ending in a draw. As an example, before the game, the odds for Northern Ireland-Germany were, according to our model, 62% in favor of Germany, 14% in favor of Northern Ireland with a 24% chance that the game would end in a draw. Fourteen is not equal to zero, and Germany learned it the hard way.

Second, Elo ratings are self-correcting: after each game, we compare the actual outcome with the probabilities given beforehand and transfer points from one team to the other accordingly. The losing team sees its rating drop, while the winning team’s rating increases by the same amount. The end result is always a zero-sum game. However, the magnitude of the variation is dictated by how unlikely the outcome was. Germany and Northern Ireland for example saw a large variation in their respective ratings following the surprising result.

Müller discovers that fourteen is not equal to zero

This is the theory which now needs to be adapted to our context. We compiled the results of over 35,000 events starting with the very first recorded official game, Scotland versus England on November 30th 1872, up to games on October 1st of this year. We computed historical ratings for every national team recognized by FIFA, optimizing our model by comparing predictions and actual results for the years 1945 to 2009. The more than 5,000 games that have been played since 2010 have then been used to verify the accuracy of our model. The average rating for all teams was set (arbitrarily) to 1,500, most ratings hovering between 1,000 and 2,000 throughout the years.

The ranking and the outrage

Let’s not delay the inevitable any longer, here is our top 20 as of October 1st 2015:

Yes, Brazil is seated at the top of our ranking and, yes, Argentina and Colombia come second and third before Germany. Unacceptable? Maybe—or maybe not. If you have been following events since the last World Cup, at the end of which Germany was obviously first in our ranking, you know that Brazil has been playing extremely well against teams outside of its confederation. Recently, they have won away 3–1 against France and 4–1 against the US, and then won at home 2–0 against Mexico and 1–0 against Honduras. They have also been trading punches with other South American teams, liberally giving away some of their hard earned points to Colombia and Chile. These points have then diffused throughout South America, helping make it the strongest confederation by far at this point.

At the same time, Germany lost against Northern Ireland. France lost against Albania. And the Netherlands didn’t even qualify for the Euro 2016. They are now ranked 26th, behind Romania, Costa Rica… and Iran. This might seem a bit brutal, but one needs to understand some specific aspects of this rating.

Going back to the idea of being self-correcting, Elo ratings are reactive in nature. Today’s performance is a good indicator of tomorrow’s result. If a team under or over performs, this will be instantly reflected in its rating and it will affect its odds for the next games. Maybe it ends being a fluke, a one-time event, but, on average, it should be considered a sign of an evolution in the team’s strength, at least in the short term.

The short term, the next game, is really what every Elo rating is about. It does not necessarily reflect historical strengths. After the last World Cup, the Netherlands were rightly considered one of the best teams in the world. Their constant competitiveness throughout the years is a sign of strategic, historical and systemic advantages—plus Robben’s inside cut minus Van Persie’s own goal magic. These edges can’t be wiped out after a couple of bad outings, and we should expect to see them back to their true level in the coming years. For now though—for reasons we could analyze more deeply but which probably boil down to the ebbs and flows of a team under construction—they’re just not that good.

We want proof!

Alright, we already see the torches and pitchforks in the horizon, so we’ll try to give you some indications as to why our rating works well. We’ll start with a small sleight of hand: for what comes next, a draw will count as half a win and the two will be added together. The story of how to predict the likelihood of a draw is extremely interesting but we’ll have to defer it to another time.

With this small caveat, here is a plot that compares winning probabilities with actual outcomes for all the games played between January 2010 and September 2015:

Here is what we did more precisely: we took the 5,200 games we had in our database between January 1st 2010 and September 6th 2015 and ordered them using the difference in ratings between the two opponents. We then grouped them in batches of 200 games following this order, so that the games in a batch have a similar difference in ratings. The (average) rating difference for each batch gives us an (average) win probability for all the games in it, computed using our model. Each of the 26 points on the graph then compares such probability with the actual winning percentage for a specific batch of 200 games.

The result is satisfying, especially since it seems to be working for all possible rating differences. Squarely in the middle, we have a point consisting of 200 games whose average rating difference is only about –4, which corresponds to games that were as close to being even as possible. Our model predicts that the win probability for such games should be 49.3%. The actual winning percentage: 50.75%. That’s an error of about 3 wins over 200 games. Not too shabby. At the other end of the spectrum, the rightmost point in our graph corresponds to games whose rating difference is +396 on average, among the most lopsided match-ups in the past five years. Our model predicts that the favorite in these games should win 92.2% of the time. The actual winning percentage turns out to be 92.75%, which is about one more win than expected. We don’t know about you, but we feel pretty good about these results.

A game of averages

A good predictive model doesn’t work in black and white. It must be able to assess the probability of all events, no matter how unlikely they might be. The fact that we “get it wrong” about fourteen percent of the time for games as seemingly one-sided as Northern Ireland-Germany or Albania-France is a feature of the model, it means it works as intended. Similarly, saying that a match-up between two teams has a 50% chance of going either way is not a cop-out. It just means that on average such games split evenly, and we would be foolish to pick a favorite in such a case. Sometimes grey truly is grey and all that is left is to enjoy the show.

Also, download our free awesome app for iPhone to follow games and our predictions. With the app, we dare you to challenge #Predictor3000, our innovative data-driven predictor, for every game…