A Markov Chain ranking of Premier League teams (14/15 season)¶
In this simple Markov Chain model I use the number of goals scored in each game of the 14-15 season to rank the Premier League teams.
A Markov Chain for this approach can be thought of as a graph with a node corresponding to each of the teams. The probability of transitioning between the node of team X and the node of team Y is related to the number of goals that Y scored against X. This can be thought of as each team 'voting' for how strong they consider each of the other teams in the league to be, where their votes are decided by the number of goals that they conceeded. A great visual introduction to Markov Chains can be found here
First I start by importing a few things we're going to use (including numpy and pandas), and setting up our empty dataframe to contain the information about goals scored & conceeded:
import numpy as np
import pandas as pd
import csv
import random
import operator
scores = np.zeros((20,20,))
teams = ['Aston Villa','Arsenal','Burnley','Chelsea','Crystal Palace','Everton',
'Hull City','Leicester City','Liverpool','Manchester City','Manchester United',
'Newcastle','Southampton','Stoke City','Sunderland','Swansea','Tottenham',
'West Ham', 'West Brom','Queens Park Rangers']
df = pd.DataFrame(scores, index=teams, columns=teams)
'df' holds a 20x20 grid, with each row and column labelled with a team name, to explain what this means lets look at the upper left corner quickly:
print df.loc['Aston Villa':'Chelsea','Aston Villa':'Chelsea']
We are going to fill this grid with the number of goals the row team conceeds to the column team. So, once filled, the first row of the table will contain the number of goals that Aston Villa conceeded when playing every other team in the league. This will be the combined total of the home and away games.
The data is stored in a tab delimited file (data.txt), in the format 'date, home_team, away_team, home_goals, away_goals' ie:
May 24th 2015 Crystal Palace Swansea 1 0
May 24th 2015 Arsenal West Brom 4 1
May 24th 2015 Aston Villa Burnley 0 1
and so on...
We can read this file and populate our data frame with the results of all 380 games:
with open('data.txt','r') as f:
reader=csv.reader(f,delimiter='\t')
for date,home_team,away_team,home_goals,away_goals in reader:
df[home_team][away_team] += int(home_goals)
df[away_team][home_team] += int(away_goals)
Let us look now at what we have in df:
print df.loc['Aston Villa':'Chelsea','Aston Villa':'Chelsea']
It is useful to check this is correct at this stage - Aston Villa lost 3-0 to Arsenal (at home) on September 20th and 5-0 away on February 1st - our grid shows that they conceeded 8 in total - good.
Ok - the next thing we need to do is to normalise these numbers by the total number conceeded per row. This will represent the transition probability for our Markov model:
transition_matrix = df.div(df.sum(axis=1), axis=0)
print transition_matrix.loc['Aston Villa':'Chelsea','Aston Villa':'Chelsea']
What this transition_matrix represents is the probability of transitioning from each state to the next. For example - if the model is looking at Aston Villa, it has a probability of around 14% of jumping to Arsenal, 3.5% to Burnley, 8.8% to Chelsea and so on.
I use the cumulative sum of the probabilities by row in order to separate the probabilities into 'bins' between 0 and 1. In this way we can use a single random number to see where the model jumps to next:
tsm = transition_matrix.cumsum(axis=1)
print tsm.loc['Aston Villa':'Chelsea','Aston Villa':'Chelsea']
Now we can throw our random number over and over and watch our model take a random walk of 100000 steps through the Markov graph. I keep count of how often it spends at each node, and this gives us our ranking for the performance of the teams:
random.seed(555)
current_team = teams[(random.randint(0,19))]
I = iter(teams)
Z = np.zeros(20)
count = dict(zip(I,Z))
for j in range (0,100000):
count[current_team] += 1
ran = random.random()
for i,val in enumerate(tsm.loc[current_team].values):
if ran < val:
current_team = teams[i]
break
sorted_x = sorted(count.items(), key=operator.itemgetter(1),reverse = True)
We now have our results, and can print the fraction of time that the model spent at each team's node:
for key, value in sorted_x:
print key, value/100000
We can contrast this with the actual final table of the Premier League and see how different things look:
1 Chelsea
2 Manchester City
3 Arsenal
4 Manchester United
5 Tottenham
6 Liverpool
7 Southampton
8 Swansea
9 Stoke City
10 Crystal Palace
11 Everton
12 West Ham
13 West Brom
14 Leicester City
15 Newcastle
16 Sunderland
17 Aston Villa
18 Hull City
19 Burnley
20 Queens Park Rangers
So, in our Markov Chain league based only on the goals of the 2014-2015 season, Manchester City win, although it is reasonably close between the top two teams. Aston Villa and Sunderland end up being relegated, rather than Hull and QPR - with Hull only narrowly avoiding the drop. QPR finish a much improved 14th - perhaps due to a reasonable number of goals scored vs top 6 teams.
No comments:
Post a Comment