A literature search of statistical applications to soccer

Benoit Emonet, "Revisiting Statistical Applications in Soccer", STS Report, Department of Mathematics, Swiss Federal Institute of Technology, 2000.  [PDF]

This paper was completed for a undergraduate class project in the Department of Mathematics at the Swiss Federal Institute of Technology (EPFL) in Lausanne.  It is a literature search of papers written in statistical journals or presented at statistics conferences related to soccer.  The paper starts with some irrelevant information on the history of the sport through the ages (and incorrect information as well; the word "soccer" is NOT an American invention but an English one; for more information see this piece by Garry Archer — scroll down to 'SOCCER'). 

Despite this, the paper redeems itself with an excellent classification of the various competitions played in soccer and the research problems examined therein: round-robin competitions (most domestic leagues), knockout competitions (most domestic cups), and mixed league/knockout competitions (World Cups, continental championships, Champions Leagues).  Problems can be divided into goalscoring models, seeding coefficient formulations, and result prediction models.  The ultimate objective of these studies is to predict the eventual winner of the competition, so it would be of use to the betting community.

Below is a summary of the three competition classes and the statistical problems examined.

League competition:

  • Goalscoring probability distribution models
    • Poisson
    • Negative binomial
    • Effect of match events and conditions
      • home advantage
      • artificial pitch
      • red cards
  • Result prediction models
    • Static model
      • Maximum likelihood models
      • Generalized linear model
    • Dynamic model
      • Joint Poisson distribution
      • Paired comparisons w/ generalized Kalman filter
      • Markov chain Monte Carlo

Knockout competition:

  • Seeding coefficients
    • Constant during competition
    • Varying during competition
    • Hybrid model
  • Result prediction model
    • Logistic regression model

Mixed-league/knockout competition (paper considered only World Cup finals):

  • Comparative quality of first-round groups
    • Paired comparison models
    • Monte Carlo simulation of results
  • Combined goalscoring distribution model with result prediction model
    • 1998 World Cup (Asked the question "Was France's win a fluke?" The answer was "No!")

Keep in mind that the paper was written in 2000, so there have to be more research papers on these subjects.  At that date there were no studies on the UEFA Champions League, but there could have been some since then (I haven't checked yet).  Nevertheless, the survey paper does a good job of assembling the state of the art in statistical applications in soccer and, with 41 referenced works, gives a good starting point for experimenting with the various algorithms and distributions.