National team rankings using paired comparison models

S. E. Hallinan, "Paired comparison models for ranking national soccer teams", M.S. Thesis, Department of Mathematical Sciences, Worcester Polytechnic Institute, 2005. [PDF]

This thesis extends a Bradley-Terry statistical model to accommodate for drawn results, home advantage, and neutral site matches and applies it to international results over the last ten years in order to develop an alternative to the FIFA national team rankings.

———-

Shawn Hallinan was a Masters student majoring in Applied Statistics at Worcester Polytechnic Institute, and is currently with IMS Health.  In his thesis, he developed an alternative ranking system to the FIFA world rankings for national teams that included dynamically-updated weights for match results in friendlies and official international matches.  

By way of comparison, this Wikipedia link presents a good summary of the current FIFA World Rankings.  The methodology was revised extensively in 2006 and considers the following factors:

  1. Match result (3-1-0, with 2 points for win via penalties and 1 point for loss on penalties)
  2. Match status: friendly/qualifier/continental finals/World Cup finals
  3. Opposition strength: based on ranking position
  4. Regional strength
  5. Time: Last five years, and a decaying exponential relationship over time

Hallinan's thesis was published while the pre-2006 FIFA rankings were in effect, so it's only fair to describe the contributing factors here:

  1. Match scoreline
  2. Home/away advantage
  3. Match status
  4. Opposition strength: based on points difference
  5. Regional strength
  6. Time: last eight years, and a inverse linear relationship with time

The main point of Hallinan's thesis is that while the FIFA rankings model used more variables, the relative weighting of those variables to create the ranking score was arbitrary.  The thesis' contribution is the use of a particular type of paired comparison model — a Bradley-Terry model — that updates its weighting factors as it receives new match result data. 

A paired comparison model, as you might expect, is a statistical model that compares two entities based on certain traits.  The Bradley-Terry model (more accurately the Bradley-Terry-Luce model), as applied to sports teams, is the probability that one team defeats the
other in match when accounting for their relative strengths:

Sm20090522_eq01

where pij is the probability that team i defeats team j, and πi and πj are parameters that describe the strength of the competing teams.  The above expression can be derived by first assuming that the team score S follows an extreme value distribution (parametrized by the natural logarithm of the team strength variable), expressing the difference between distributions of teams i and j as a logistic distribution, and then writing the probability that Si > Sj.

Now, let's generalize this to a collection of teams that play against each other; to give one example; all of the national teams that are members of FIFA.  The probability distribution of y, which expresses the number of times that team i defeats team j, becomes a multinomial distribution, and the strength parameter is now expressed as π = (π12,…,πp).  All ranking systems attempt to answer the following questions: Given the match outcomes y among p teams, can we find an estimate of the relative strengths (i.e., the ratings) of those p teams?  And what is the likelihood that those estimated strengths are closest to the truth?  The estimate is expressed as

Sm20090522_eq02

where yi is the sum of all of the wins yij by team i over its rivals j (summed from 1 to p), and the maximum likelihood of the estimate is represented by L*(π|y).

It's possible to express this model in terms of Bayesian analysis, which seeks to estimate parameters by looking at its statistical distribution and updating it as new data are received.  This is the approach that many estimation algorithms use (Kalman filter and particle filter, to give two examples) and it's also the approach that Hallinan takes in his thesis.  Hallinan defines a multivariate distribution of the strength parameter vector π, with mean μ and covariance matrix C.  Before matches are played, the parameters are assumed to have a normal distribution, and as matches are played, the parameters are estimated with the maximum likelihood estimate and those estimates are used to calculate likelihoods based on subsequent match results.  As you might expect, it's important to have an initial distribution and an initial covariance matrix that are close to reality — not exactly, but close enough.

So now it's time to discuss the model that Hallinan used.  As I mentioned previously, it's based on a Bradley-Terry model that determines the probability that team i will defeat team j, given their relative strengths.  Hallinan incorporated extensions to the BT model that allowed for drawn matches, home advantage, and match type.  He only used a division between friendlies and official matches, with no further divisions for qualifiers or final tournaments.  Finally, to make the BT model evolve over time, the team ratings are marched forward in time by a first-order auto-regressive model:

  Sm20090522_eq03

Hallinan used international match result data over a ten-year period.  He didn't say which ten-year period, but I'm assuming 1995-2005.  The model would accept result data as an input, and return a rating of the teams involved.  To assess the variation of those ratings, he employed a Markov Chain Monte Carlo simulation.  To assess the change in the ratings with the model, he used a general static BT model, progressively extended by terms for drawn matches, home advantage, neutral settings, and match types, and a dynamic BT model with the aforementioned terms included. 

When I looked at the results, it appeared that the Hallinan ratings model attached disproportionate weight to home advantage.  This benefited national teams such as Brazil, Mexico, and Italy, as well as Costa Rica, who appeared in the top 20 of the Hallinan ranking but was not in the top 20 of the FIFA ranking.  Hallinan doesn't subdivide the rating into its contributing terms, but it seems to me that it's the home form that influences the rating.  There is also no further subdivision of official matches into qualifiers, continental finals, and World Cups, which might have generated different results, but could also have caused the MCMC simulation to break down.

That issue with the simulation occurred when the model was extended so that the ratings evolved with time.  The ratings parameters failed to converge with all the terms included, and it was only when the match type parameter was removed and the time horizon extended that the ratings reached convergence.  This model produced Mexico as the top national team (it had been Brazil in the previous model versions).  That result alone should have cast doubt on the veracity of the model, as a side that had been knocked out of the 2002 World Cup at the round of 16 ended up outranking the 2002 World Cup winners, and the runners-up Germany as well. 

Hallinan presents a mean-square estimate of his rankings relative to the FIFA world rankings, and he shows that by incorporating draws, home advantage and neutral sites, the relative model accuracy improves, but when match type is included, the relative accuracy degrades.  Perhaps this is the reason match type is removed ultimately for the dynamic model, but I think the match type parameter should have been further divided among official matches.  The variation of the team rankings with time also draws questions.  Some teams stay consistent over time, other decline, and some rise and fall as "golden" generations arise and then retire.  But there should be some cyclical variation to account for the turnover of national teams with time (usually on a 8-10 year scale).  That kind of variation was not employed in this dynamic model.

In the end, Hallinan presents a statistical model to estimate team rankings, taking account for draws, home/neutral matches, and friendly/official matches.  The framework is useful and gives an underpinning for the various ranking systems, but in the end the results show the limitations of the model and its assumptions.  The results also indicate why FIFA made the extensive changes to the ranking in 2006.

Share