Goal scoring probability over the course of a football match

M. J. Dixon and M. E. Robinson, "A birth process model for association football matches", The Statistician, 47(3): 523-538, 1998.

How does the probability of the final score change with the relative strength of the two teams, home advantage, time elapsed, and the current score?  This publication describes what's called a "birth process" model and it is shown to be useful in modeling not just the final score, but also the evolution of that score during the course of the match.  The model is useful for testing some of the common clichés heard in football and (potentially) making some money at the betting house.


This paper has been burning a hole in my hopper for the past six months, and I've read it off and on in the meantime, but today I've been busy reviewing papers for an upcoming conference.  So I've decided to review this one for this site and get it over with!  This publication by Mark Dixon and Michael Robinson — at the time, two statistics researchers at British universities — is a precursor to the research done by Bittner et al. on "football fever", which explains how the probability of scoring goals changes after a goal has been scored.  Dixon and Robinson go a little further in their paper.

There are a number of reasons to employ statistical analysis to sport, other than the fact that it's a cool way to introduce and test statistical methodologies and get one's name in the general media.  Other reasons are

  • to assess current strategies and suggest strategic improvements to individuals and teams,
  • to value the comparable market value of individual players, thus determining who might be "undervalued" or "overvalued" in the transfer or draft market,
  • to examine the fairness of the rules of the game or competition,
  • to predict future outcomes, for either media or betting purposes

The Dixon/Robinson paper falls into the fourth category.  It also builds on results by Dixon/Coles in a 1997 publication [1] that presented a statistical model for full-time results.  This model (a multi-parameter Poisson distribution) considers the attack and defense qualities of the competing sides and a home advantage factor.  The parameters are estimated by fitting the probabilities that teams A and B will have a particular scoreline to the full-time score data.  The deficiencies of the model are that it assumes the performance rate of a team is constant throughout the tournament (a common failing of a lot of these static statistical models) and cannot model the evolution of a team's performance during a match.

What Dixon/Robinson propose is the main result of the paper, the two-dimensional birth process model.  I still don't understand completely how it works, but I interpret it to be a way to consider the two competing scoring processes simultaneously.  The idea is that the scoring rate changes during the match, and the variation of this rate depends on the current score.  Dixon/Robinson show a two-dimensional chart that looks like a series of steps as the scoreline changes.  It's not clear to me how to implement such an algorithm — I really need some quiet time to understand everything — but it will be a difficult task.  First of all, one needs a TON of data, and full-time scorelines aren't enough.  I would need to know the goal times to properly use this model.  Dixon/Robinson used results from all four English professional divisions over three seasons.  That's over 4000 matches and almost 10,400 goal times.  This is the kind of research for which a good soccer result database is ideal.

Dixon/Robinson are able to draw some conclusions from the model:

  1. The scoring rate for both teams generally increases during a match.  It goes way up at 45 and 90 minutes, but Dixon/Coles lump all of the injury time goals into 45 or 90 minutes.  (With the recent FIFA goal time conventions it would be interesting to show how the scoring rate changes during the stoppage time period.
  2. The attack and defense parameters in general tend to degrade from the Premiership to the lower divisions.  In other words, as you go down the divisions, strikers tend to score less by their own efforts than by opportunities given by the poorer defense of the lower division sides.
  3. The scoring rates of the home and away teams depends very much on the current score.

More controversially, Dixon/Robinson fail to see any evidence for the common football cliché that a team is never more vulnerable to be scored upon than immediately after a goal.  Bittner et al. found support for the opposite, that a team that scores a goal increases its probability of scoring subsequent goals.

In the final section, Dixon/Robinson show that their model could be used in spread betting situations, during which a betting house could use the model to set prices or a bettor could determine what kind of bet to make.  Spread betting refers to the then-growing practice on betting that a team will score more or less than a given result, and then winning (or losing) money given the type of bet being made.  It's very similar to making calls and puts in options trading.  (I'll leave it to the reader to think of all the similarities between most stock/commodities trading and gambling.)  Dixon/Robinson show that their might be some inefficiencies in the betting pricing, which would be of huge interest to the bettors, and definitely to the betting companies as well!

Well, for those of us in the USA that kind of sports betting isn't an issue, but the paper does present a more sophisticated statistical model accompanied with insight that might be useful in other situations.  It's good solid academic research, but outside of the gaming application and the multi-parameter Poisson distribution I'm not sure how useful it will be to soccermetricians.

[1] M. J. Dixon and S. G. Coles, "Modelling association football scores and inefficiencies in the football betting market", Applied Statistics, 46: 265-280, 1997.