Saturday, July 11, 2015

Offensive - defensive ratings for the Premier League teams, 2014-2015

[]

Offensive and defensive team ratings for the Premier League 2014-2015

Following the previous post, in which I used a simple Markov chain model to attempt to rank the teams in the Premier League based on the results of the games in the 2014-2015 season, today I'll attempt to use the same data to rank the teams based on both offense and defence.

This approach follows the 'Offence-Defence' rating method described in Who's #1 by Langville and Meyer.

This method begins like the Markov Chain did - I start by importing a few things we're going to use and set up our empty dataframe to contain the information about goals scored & conceeded:

In [1]:
import numpy as np
import pandas as pd
import csv
import random
import operator

scores = np.zeros((20,20,))
teams = ['Aston Villa','Arsenal','Burnley','Chelsea','Crystal Palace','Everton',
         'Hull City','Leicester City','Liverpool','Manchester City','Manchester United',
         'Newcastle','Southampton','Stoke City','Sunderland','Swansea','Tottenham',
         'West Ham', 'West Brom','Queens Park Rangers']

df = pd.DataFrame(scores, index=teams, columns=teams)

'df' holds a 20x20 grid, with each row and column labelled with a team name, to explain what this means lets look at the upper left corner quickly:

In [2]:
print df.loc['Aston Villa':'Chelsea','Aston Villa':'Chelsea']
             Aston Villa  Arsenal  Burnley  Chelsea
Aston Villa            0        0        0        0
Arsenal                0        0        0        0
Burnley                0        0        0        0
Chelsea                0        0        0        0

Exactly like last time around, we fill this grid with the number of goals the row team conceeds to the column team. So, once filled, the first row of the table will contain the number of goals that Aston Villa conceeded when playing every other team in the league. This will be the combined total of the home and away games.

The data is stored in a tab delimited file (data.txt), in the format 'date, home_team, away_team, home_goals, away_goals' ie:

May 24th 2015 Crystal Palace Swansea 1 0
May 24th 2015 Arsenal West Brom 4 1
May 24th 2015 Aston Villa Burnley 0 1

and so on...

We can read this file and populate our data frame with the results of all 380 games:

In [3]:
with open('data.txt','r') as f:
    reader=csv.reader(f,delimiter='\t')
    for date,home_team,away_team,home_goals,away_goals in reader:
        
        df[home_team][away_team] += int(home_goals)
        df[away_team][home_team] += int(away_goals)

Let us look now at what we have in df:

In [4]:
print df.loc['Aston Villa':'Chelsea','Aston Villa':'Chelsea']
             Aston Villa  Arsenal  Burnley  Chelsea
Aston Villa            0        8        2        5
Arsenal                0        0        0        2
Burnley                1        4        0        4
Chelsea                1        0        2        0

It is useful to check this is correct at this stage - Aston Villa lost 3-0 to Arsenal (at home) on September 20th and 5-0 away on February 1st - our grid shows that they conceeded 8 in total - good.

Here is where this approach diverges from the Markov chain. The basic idea here is to construct two ratings for each team - one to quantify their offensive strength and the other for their defensive strength.

The offensive rating for team j is defined as:

oj = a1j/d1 + a2j/d2 + ... + amj/dm

Where di is the defensive rating for team i and m is the total number of teams (in our case, 20).

Similarly, the offensive rating for team i is defined as:

di = ai1/o1 + ai2/o2 + ... + ajm/om

You can think about these ratings and play around with the numbers if you want to develop some intuition for how this works, but essentially the point is that a large offensive rating represents a strong offence but a large defensive rating represents a weak defence. This approach will mean that racking up lots of goals agains teams with a weak defence won't boost your attacking rating as much as scoring a lot against a team with a stronger defence. Equally, keeping a clean sheet against a strong attacking side would be a bigger boost to a team's defensive rating than keeping a blunt attack at bay.

We'll now attempt to calculate these values for each of the PL teams, using the scores from the 2014-2015 games.

To begin with, lets initialize the vectors that we'll use to hold the output:

In [5]:
offensive_rating = np.zeros(20)
defensive_rating = np.ones(20)

offensive_rating_prev = np.zeros(20)
defensive_rating_prev = np.zeros(20)

The next block of code is where we perform our calculations. Given how we have defined our offensive and defensive ratings (see those equations above), we know they are not independent and therefore it is not possible to calculate one without the other. The approach to calculate these is therefore to start with some arbitrary values for one and iterate back and forward, each time feeding the previous result into the next round of calculation.

We begin by assigining '1' as every team's defensive rating and use this to calculate the offensive ratings. The new offensive rating is used to recalculate the defensive ratings and so on. We stop when the calculations converge (using a tolerance of 0.001 to break our calculation):

In [6]:
counter = 0

while True:
    for i,team in enumerate(teams):
        for index,score in enumerate(df[team]):
            if (index == 0):
                offensive_rating[i] = score/defensive_rating[index]
            else:
                offensive_rating[i] += score/defensive_rating[index]
    
    for i,team in enumerate(teams):
        for index,score in enumerate(df.loc[team]):
            if (index == 0):
                defensive_rating[i] = score/offensive_rating[index]
            else:
                defensive_rating[i] += score/offensive_rating[index]

    convergence = True

    for i in range (0,20):
        if abs(offensive_rating[i]-offensive_rating_prev[i]) > 0.001:
            convergence = False
        if abs(defensive_rating[i]-defensive_rating_prev[i]) > 0.001:
            convergence = False            

    if convergence == True:
        break
    
    offensive_rating_prev[:] = offensive_rating
    defensive_rating_prev[:] = defensive_rating
    counter+=1

print "Convergence after ", counter, " iterations:"
Convergence after  4  iterations:

The next step is simply to manipulate our results to attach each result to the corresponding team and sort them based on the values.

In case anyone is interested I print the raw results, but a fully formated table can be found below if you don't want to pour through the individual ratings.

In [7]:
offensive_list = []
defensive_list = []

for i,team in enumerate(teams):
    o_rating = (team, offensive_rating[i])
    offensive_list.append(o_rating)

    d_rating = (team, defensive_rating[i])
    defensive_list.append(d_rating)
    
offensive_list.sort(key=lambda x: x[1],reverse=True)
defensive_list.sort(key=lambda x: x[1])

for i in range (0,20):
    print offensive_list[i]

print ""

for i in range (0,20):
    print defensive_list[i]
('Manchester City', 82.091087030339878)
('Chelsea', 72.380560195215864)
('Arsenal', 68.369009157329501)
('Manchester United', 60.904492165093409)
('Tottenham', 58.976640911527809)
('Liverpool', 53.398207486422919)
('Southampton', 51.706070092055349)
('Stoke City', 48.628818782554838)
('Everton', 47.487728530938085)
('Crystal Palace', 47.094233000939916)
('Swansea', 46.853158666512357)
('Leicester City', 46.175596414755439)
('West Ham', 43.761226383030348)
('Queens Park Rangers', 41.905555496584974)
('Newcastle', 40.487143739032931)
('West Brom', 38.502623853058033)
('Hull City', 33.003865242767439)
('Sunderland', 31.811732993084007)
('Aston Villa', 31.337140506696937)
('Burnley', 27.760922296035552)

('Southampton', 0.66995683304056308)
('Chelsea', 0.71817845787502954)
('Arsenal', 0.73744729935070585)
('Manchester United', 0.79234033985536279)
('Manchester City', 0.87012778428827653)
('Swansea', 0.91836359596205652)
('Liverpool', 0.92446442252052718)
('West Ham', 0.9436212471431098)
('Stoke City', 0.95157851238307645)
('Hull City', 1.0337599874405823)
('Everton', 1.0578813597729115)
('West Brom', 1.0581547985668494)
('Sunderland', 1.0736186324951256)
('Tottenham', 1.0844423863874746)
('Leicester City', 1.0957431365749941)
('Burnley', 1.1047238318420249)
('Aston Villa', 1.1107422557592943)
('Crystal Palace', 1.1242000238088854)
('Newcastle', 1.2686191064971544)
('Queens Park Rangers', 1.4684429285267184)

The full offensive and defensive tables are shown below. Manchester City come out on top for offensive rating but are down in 5th in the defensive table, whilst PL champions Chelsea come in 2nd in both. Southampton top the defensive table although they finished the PL in 7th place - this analysis suggests the problem was in attack and looking at the results it is pretty clearly their lack of goals when playing away is where this came from.

At the other end of the table, these results QPR and Burnley had the weakest defence and attack in the league, respectively (pretty consistent with expectations). There aren't any huge revelations here but it is interesting to attempt to separate the teams in this way.

In [8]:
template = "{0:20}||{1:10}" # column widths: 8, 10, 15, 7, 10
print template.format("Offensive table", "Defensive table")
print template.format("---------------", "---------------")

for i in range (0,20):
    print template.format(offensive_list[i][0], defensive_list[i][0])
Offensive table     ||Defensive table
---------------     ||---------------
Manchester City     ||Southampton
Chelsea             ||Chelsea   
Arsenal             ||Arsenal   
Manchester United   ||Manchester United
Tottenham           ||Manchester City
Liverpool           ||Swansea   
Southampton         ||Liverpool 
Stoke City          ||West Ham  
Everton             ||Stoke City
Crystal Palace      ||Hull City 
Swansea             ||Everton   
Leicester City      ||West Brom 
West Ham            ||Sunderland
Queens Park Rangers ||Tottenham 
Newcastle           ||Leicester City
West Brom           ||Burnley   
Hull City           ||Aston Villa
Sunderland          ||Crystal Palace
Aston Villa         ||Newcastle 
Burnley             ||Queens Park Rangers

In order to check the similarity of the actual final PL table and our offensive and defensive rankings (thanks for the idea, return_0_!) we can calculate Kendall's tau. This will range between -1 and 1, with a value of 1 meaning that the lists are identical. From the calculation below we see that the offensive ranking is slightly closer to the final PL rankings, at least in terms of this metric, but they are extremely close (0.368 vs 0.347 for offensive and defensive respectively).

In [18]:
import scipy.stats as stats

PL_rankings = ['Chelsea','Manchester City','Arsenal','Manchester Utd','Tottenham','Liverpool','Southampton','Swansea','Stoke','Crystal Palace','Everton','West Ham','West Brom','Leicester','Newcastle','Sunderland','Aston Villa','Hull City','Burnley','Queens Park Rangers']
PL_offensive = []
PL_defensive = []

for i in range (0,20):
    PL_offensive.append(offensive_list[i][0])
    PL_defensive.append(defensive_list[i][0])

tau, p_value = stats.kendalltau(PL_rankings, PL_offensive)
print tau
tau, p_value = stats.kendalltau(PL_rankings, PL_defensive)
print tau
0.368421052632
0.347368421053

No comments:

Post a Comment