Using the Elo System to rate the Premier League teams - 14/15 season¶
Over the past two weeks or so we looked at alternative methods of ranking the Premier League according to results from the 2014-2015 season, using a Markov Chain approach and an iterative approach to separately rank the offensive and defensive capabillities of a team. Today I'll look at a third technique: the Elo rating system which is commonly used to rank chess players.
I'm not going to go through all of the technical details as they are well covered elsewhere, but basically the idea behind the Elo system is that each team/player's ability is a normally distributed variable, the mean of which changes slowly as time goes on. When two teams compete, the difference in their rating should help predict the outcome of the game, and both team's ratings are updated based on the outcome of every game played. If a lower rated team beats a highly rated team then they'll recieve a significant ratings boost, whereas if a highly rated team beats a weaker team then their score will be largely unaffected (given that they are expected to win these matches).
What this means is that the rating should converge on a 'true' measure of the strength of the team: if their rating is too low then they should perform better than the system would predict and gain points until the rating settles at a more accurate measure of where they lie with respect to competitors.
Okay - that is probably enough detail to begin so let's just jump right in. To begin with I set the strength of every team in the PL to a value of 0. I could have started with some initial ratings, perhaps taken from the league table of the 13/14 season, but to keep everything fair I'll begin with everyone equal at the start of the season:
import numpy as np
import pandas as pd
import csv
import random
import math
scores = np.zeros(20,)
teams = ['Aston Villa','Arsenal','Burnley','Chelsea','Crystal Palace','Everton',
'Hull City','Leicester City','Liverpool','Manchester City','Manchester United',
'Newcastle','Southampton','Stoke City','Sunderland','Swansea','Tottenham',
'West Ham', 'West Brom','Queens Park Rangers']
df = pd.DataFrame(scores, index=teams, columns=['Rating'])
df is an empty dataframe to hold our ratings, and looks like:
df
Okay - so this part is where things get more technical but I'll try to explain what is going on.
The data is stored in a tab delimited file (data.txt), in the format 'date, home_team, away_team, home_goals, away_goals' ie:
May 24th 2015 Crystal Palace Swansea 1 0
May 24th 2015 Arsenal West Brom 4 1
May 24th 2015 Aston Villa Burnley 0 1
and so on, and this stores the games in the order they were played throughout the season.
We can read this file and for each game we perform our Elo calculation and update the rating of each team.
The equations used to do this look like (feel free to skip the rest of this box if you don't care about how the ratings are implemented):
ri(new) = ri(old) + K(Sij - muij), and rj(new) = rj(old) + K(Sji - muji)
where i and j refer to the teams in the match, and ri(old) corresponds to the old rating for the team represented by i, for example.
Sij is a way of incorporating the scores into the equation and is given by:
Sij = (Gh + 1) / (Gh + Ga + 2)
where Gh is the goals for the home team (a = away, unsurprisingly). Sij has the nice property that it ranges between 0 and 1, and it can also be interpreted as the proability that team i beats team j.
The K-factor is used to balance the deviation between actual and expected scores against prior ratings - pick a K that is too large and you end up with volitile ratings, pick a K that is too small and the ratings don't well account for a performance that is getting better or worse. I went with 30 but other choices could be reasonable too.
muij is a logistic function of the difference in ratings, and is given by:
muij = 1 / (1 + 10-(ri(old) - rj(old))/1000
Okay, enough detail. The implementation can be seen below:
K = 30
with open('data.txt','r') as f:
reader=csv.reader(f,delimiter='\t')
for date,home_team,away_team,home_goals,away_goals in reader:
s_h = ( float(home_goals) + 1 ) / ( float(home_goals) + float(away_goals) + 2 )
s_a = 1 - s_h
diff_ij = df.loc[home_team]['Rating'] - df.loc[away_team]['Rating']
mu_h = 1 / (1 + math.pow(10,-diff_ij/1000))
diff_ji = df.loc[away_team]['Rating'] - df.loc[home_team]['Rating']
mu_a = 1 / (1 + math.pow(10,-diff_ji/1000))
df.loc[home_team]['Rating'] = df.loc[home_team]['Rating'] + (K * s_h) - (K * mu_h)
df.loc[away_team]['Rating'] = df.loc[away_team]['Rating'] + (K * s_a) - (K * mu_a)
And that is it! So simple. All we have left to do is to sort our results by rating and take a look at the table (below). As you can see these results look pretty consistent with the actual final PL table from the season.
The top 4 are unchanged, although Tottenham drop from 5th in the PL to 8th in our rating and there are a few other teams moved around with respect to their PL standings. At the other end of the table Aston Villa end up relegated here and Hull survive the drop. I'm sure that'll be a huge relief to the club as they prepared for life in the Championship: "We didn't deserve to go down - look at our Elo rating!!"
result = df.sort(['Rating'], ascending=[0])
result