Well… not really.

I wrote about the idea of expected goals way back in my first Soccermetrics post in January 2009:

I believe that a useful statistic in soccer will ultimately contribute to what I call an “expected goal value” — for any action on the field in the course of a game, the probability that said action will create a goal. One might obtain certain types of data from actions associated with the various positions…—“Moneyball and soccer”, January 8, 2009

Of course, writing about an idea is one thing; actually implementing it is another. I did some internal work in 2011 on goal probabilities given certain features that are used in xG models, set it aside, and then got swamped by other priorities. In the years since, analysts such as Michael Caley, Sander Ijtsma, and the guys at American Soccer Analysis have taken the lead on developing and disseminating xG models and they deserve the credit and attention that they have gained.

It’s past time for me to get involved in expected goal models and their applications. Since 2011 I have accumulated a lot more practical knowledge about statistics, data modeling, and machine learning that can be used to build useful models to apply to interesting football questions. I don’t expect to write anything ground-breaking in this post; the objective here is to write my modeling methodology down so that I can refer to it and build from it later. If this post turns out to be useful to you, that’s great, too. I do plan on making my code publicly available in short order so that you can tell me how I’m building xG models wrong.

Expected goals models, like all statistical models, depend on data in order to hypothesize about the world. They require large and sufficiently rich data sets to describe and predict the outcomes of shots in a football game. There are sports data companies that provide finely-grained event data — not just temporal and spatial data, but descriptions about the play, the body part used, even the amount of defensive pressing. Other companies may not offer more than the time, spatial coordinates, and a few event flags (was it a penalty? an own goal? a free kick?). It takes hundreds of thousands of shots over multiple seasons to observe meaningful patterns, and obtaining years of such data for modeling is expensive and infeasible.

Last year I asked this question on Twitter…

Is there a public repository of training/testing data that people can use to build and evaluate xG models?

— Soccermetrics (@soccermetrics) July 12, 2016

…and I got some interesting responses. I was referred to Chris Long’s soccer analytics repository on GitHub that contains a huge collection of data from competitions around the world. It’s a massive data set to be sure, not as rich as data from the major sport data companies, but it is one that can be used to build and benchmark expected goals models. So in order to exploit this data set, one has to revisit the question, *“At its core, what is an xG model made up of anyway?”*

An expected goals model is a conditional probability model that answers the question, *“Given a collection of parameters that describes a shot toward goal, what is the probability that a goal is scored?”*

Let’s say that \(\mathbf{x}\) represents this collection of parameters (the parameter vector), and \(G\) the goal event. Then we can write this conditional probability model as

\[

Pr(G|\mathbf{x}) = f(\mathbf{\beta}, \mathbf{x})

\]

where \(\mathbf{\beta}\) represents the model coefficients associated with the shot parameters.

Shots are binary events whose probability of success has to be between 0 and 1, which makes a logistic function a great representation. Now the model becomes

\[

Pr(G|\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{\beta}^T\mathbf{x}}}

\]

It’s the selection of the shot parameters that makes every analyst’s xG model unique as a product of observation, conjecture, and judgment. Some parameters are common to almost all models, but more exotic ones depend on the richness of the data set and the willingness to search the entire possession chain and confirm that a shot occurred within 10 seconds after gaining possession in the final third, executing a through pass, and making a pirouette before curling his shot around the keeper.

I’ve decided to keep the parameters simple for a couple of reasons: to accommodate the limitations of some of the data sets I’ve been working with, and to figure out which parameters are essential.

**Distance:** This is a no-brainer — stands to reason that a shot closer to goal has a greater chance of being converted than one further away. Distance is measured from shot coordinate to the center of the goal line and normalized by the distance \(r_{max}\) between that point and the far corner so that the rescaled distance is between 0 and 1.

**Angle:** Another obvious parameter. Again, it stands to reason that shots near the front of goal have a greater chance of being converted. I define this as the angle between the distance vector and the centerline intersecting the two goal lines, and positive angles are shots from the shooting team’s left flank. Then the angle is scaled by \(\frac{\pi}{2}\) so that they lie between -1 and 1.

**Score/Match State:** This parameter was not so obvious to me when I worked on an xG model five years ago, but I’ve been impressed with how significant this parameter is in trained models. Match state is the score differential with respect to the shooter’s team, so if team A leads team B 2-0 and a player from team B makes a shot, the match state is -2. I’ve scaled the match state with a logistic function so that the result is between -1 and 1:

\[

\Delta’ = \frac{2}{1 – e^{-\Delta}} – 1

\]

**Match Time:** This parameter was not obvious to me either, but I’ve seen it have an effect in other xG models. How much of one is worth discussing but I’ve left it in the model. As you can tell, scaling and normalizing parameters is a big deal for me and time is treated no differently. Shot times are scaled by the duration of the period in which it occurs, so that halftime is always 0.5 and full time is always 1.0. I’m only considering 90 minute matches, but I guess that matches that go to extra time will see match time scaled to 2.0.

**Play Type:** There are certain events that are more likely to produce goals than others, but the usefulness of this feature depends on the richness of the event data. Some data sets describe shots as the result of set-pieces or throughballs or crosses. Other sets do no more than differentiate open play shots from penalties. The time between play type and the shot can be useful as its own parameter, but it may not always be available.

**Body Type:** The body part used to execute the shot (and no, hands do not count). I believe that body part is proxy for shot velocity, and certain body parts are more likely to be used from specific plays (headers from crosses or corners, for example). Again, some data companies include this data but others don’t. I know of a few companies that only describe headed goals but not all headed shots. So this parameter is regrettably optional for some types of data.

**Competition/Country:** It’s easy to believe that “football is football” and that certain shots have a much better chance of going in no matter what the competition. But it seems clear from observation that shot conversion rates depend on the quality and characteristics of the competition. One can choose to capture this in a competition variable (which I believe Michael Caley does) or create individual models for each competition. The advantage of using a competition/country feature is that more training data is available, but the risk is smearing together the individual league characteristics.

I’ve written my own code to implement logistic regression models before, which has its own advantages, but it’s best to use off-the-shelf tools that have already been tested and optimized. To this end, I use the LogisticRegression class from scikit-learn to create my xG model.

For a single competition, I train my model with shot data from matches across multiple seasons. I hold out data from one or two seasons as an out-of-sample set that the model will never see. I then partition the remaining data into a training data set (which yields the parameters) and a validation set (which evaluates the model’s performance) in a 75/25 split. The training/validation data are partitioned at a match level — all of the shots for a particular match are either in the training data set or the validation data set. Scikit-learn has some nice tools that will partition a large dataset into folds that preserve the distribution of successful and unsuccessful shots.

I know that R^{2} is a popular metric for evaluating the quality of the xG model. I’m not convinced that R^{2} is a proper metric to use here as the coefficient will rise with the addition of more features in the model and is ultimately a misleading metric. It may still be a good evaluation metric to use, but I’ve been drawn to the Brier score, which is a comparison between predicted probabilities and the actual outcome. I use the Brier score to tune models during cross-validation by comparing predicted goal probabilities to the actual outcomes of shot events. I did a quick search and was pleased to find out that I wasn’t the only one who thought of Brier scores in xG modeling — see SciSports and Pinnacle Sports (although you’re not supposed to use Brier scores for non-binary outcomes).

I first demonstrate the xG model on Argentina’s Primera División over two seasons — a 30-match single round-robin season (plus a derby round) in 2015, and a “transitional” tournament in 2016 in which the teams were split into two groups and they played 15 intragroup matches and one intergroup match (the derby match, usually). I wanted to verify that I could observe the same things as everyone else who had been working with xG models.

This analysis uses a “simple” xG model — body part information is not included in the shot data, and the only plays that we have knowledge of are open play events and penalties.

The relationship between expected and actual goals scored is stronger in the 2015 season than the 2016 season. It could be because of the change in tournament format over the two seasons – half as many games are played in the 2016 transitional tournament. It’s something to explore at a later time.

One of the characteristics of the xG metric touted by its proponents is its stability over consecutive seasons. In other words, the ratio between expected goals scored and the sum of expected goals scored and allowed is a more robust indicator of a team’s strength than points per game. To test this I computed the points per game and expected goals ratio of the 28 Argentine clubs that competed in the 2015 and 2016 tournaments. The year-on-year relationship between xG ratios in consecutive seasons is a moderately strong one (and this could be a product of the squad and management volatility in the local clubs), but much stronger than a similar relationship between PPG in consecutive seasons.

Finally I take a page from Sander’s blog and recreate his predictive quality comparison of popular analytics ratios to the xG ratio. I compare the following metrics: points per game, goal ratio, Total Shot Ratio (shots toward goal), and xG ratio. The plots communicate the correlation between these metrics computed for each team before a given matchday and the same metric computed after the given matchday. In the 2015 tournament the predictive ability of expected goals is stronger than that of TSR, but is consistently below that of goal ratio and points per game in the first half of the season. The 2016 tournament has the really interesting results — the predictive performance of the xG metric lags behind all other metrics. I don’t have any idea why that might be the case (short tournament? group format?), but this definitely falls in the “hey, that’s weird” observation that spurs so much discovery.

There is so much more to say about expected goals, but this post is almost 2500 words so it’s best to wrap it up. Expected goals is a concept that is here to stay in football analytics. Furthermore, analysts have extended the concept to other plays in soccer, from assists to saves to passes and even defensive actions. I believe that expected goals are capturing some real underlying characteristics of team performance, but it’s not a bulletproof metric (none are) so understanding the idiosyncrasies of the competitions still matters to some degree. Expected goals does deserve its place in the toolbox of football analytics, and now that I’m caught up, I plan on using it more.

]]>

Bottom Line:Cold or sub zero temps do not guarantee below 2.5 goals in any one match, but over a spread of matches the colder temperatures do help reduce the expected goal tallies and when the sub zero matches are clumped together (the 6) the goal tallies tend to shrink further.Cold kills goals, FACT.

(Emphasis theirs.) I guess my hackles were raised by a statistical post that made such a strong statement on so little data (16 matches played on *one day* in the 2012-13 UEFA Europa League!!), but it did get me thinking: what is the effect of temperature on goal scoring in football? And is the effect significant enough to matter?

To start to answer these questions, I took kickoff temperature and goal scoring data from matches in the 2011-12 Premier League, which incidentally is available from the beta version of the Soccermetrics API. So a total of 380 observations are taken. I consider the effect on goal scoring with the over/under metric, which is an informal standard to assessing whether a football match is high-scoring. The dependent variable is \(P(O/U>2.5)\) or the proportion of matches in which the over/under is greater than 2.5 goals, which is arbitrary of course but also the standard number given by betting houses and just below the three-year Premier League average of 2.8 goals per game.

So what was the spread of temperature and total goals in Premier League matches in 2011-12? Something like this:

The distribution of matches with temperature follows a normal distribution, as does the raw number of matches in which the O/U exceeds 2.5. However, the proportion of matches that were high-scoring at each temperature reading don’t appear to follow any kind of discernible pattern. It is accurate to say that there were fewer matches with more than two goals scored when the temperature dropped below 5 degrees Celsius. But then again, there were fewer matches being played under 5 degrees Celsius — at least in England.

So how sensitive is O/U to temperature? I perform a logistic regression analysis to examine that question. It’s a very simple (crudely simple) regression:

\[

\log \frac{\pi}{1-\pi} = \beta_0 + \beta_1 x_1

\]

where \(\pi\) represents \(P(O/U > 2.5)\). It’s a linear regression on the log odds of \(P(O/U > 2.5)\). The resulting coefficients are \(\beta_0 = 0.498 \pm 0.018\) and \(\beta_1 = -0.012 \pm 0.002\). For a single regressor it’s straightforward to display a graph, so here is one below.

So for every degree Celsius increase in playing temperature, the estimated odds of \(P(O/U > 2.5)\) *decrease* by 1.2%. Translated to probabilities, the difference in magnitudes at various temperature is not that great. At zero degrees Celsius, the estimated \(P(O/U > 2.5)\) is 0.62, which of course means that \(P(O/U < 2.5)\) is 0.38. At 10 degrees Celsius, it is 0.59. Not a big difference.

There are so many caveats with this analysis to cover this post with yellow caution tape. For starters, we are only considering one regressor: kickoff temperature. Predominant weather condition might be another condition worth adding (zero Celsius and clear is a different environment from zero Celsius and snowing), but that would add a categorical variable that I didn’t want to deal with at the time. There could be other meteorological variables that might be important — wind speed, humidity — but those aren’t always available readily. (The J-League reports temperature and humidity in their match reports for reasons that aren’t clear to me.)

The most significant caveat is the need for more data beyond a single season. I feel very uncomfortable making a strong conclusion on a season’s worth of data from a single domestic competition. What should one think of a categorical statement made with data from one night of a continental club competition?

]]>I've had my doubts that you could describe goal distributions as Poisson, at least when it comes to deriving derivative expressions from them. A formulation of the soccer Pythagorean doesn't work with a one-parameter Poisson distribution. And besides, it's rare for the expected goals to be identical to the variance, which is the central assumption of the Poisson distribution. So the Poisson distribution of soccer doesn't fit — right?

Well, I have been looking at goal scoring data from the various European leagues that I used in my Pythagorean study, and I've been plotting means and variances of the goals scored over the course of a season. Below is such a plot from France's Ligue 1 (2009-10 season), with a V sketched to indicate the line where the mean and variance are identical — the Poisson distribution line. Positive means indicate goals scored, and negative means indicate goals allowed — variances are always positive.

The means and variances don't match perfectly to the Poissonian ideal in the actual goal distribution data. But they're close enough. From visual inspection it looks like roughly half of the teams have goal statistics on either side of the Poisson line, but I need to do a more formal analysis to make sure.

There are some more interesting results when one color-codes the circles for teams at the very top or bottom of the table. But that will wait for a later post.

The data also indicate that perhaps a two-parameter Poisson distribution would make for a better goalscoring model and be more tractable for other calculations. It would be worth studying.

]]>Does there exist a formula that can predict the score of a soccer match between two teams? The answer's not that simple, but three physics researchers from Germany have used German league data from the past 20 seasons to develop an expression that predicts qualitatively the expected outcome of a soccer match as a function of the "fitness level" of both sides. They show that this fitness level remains constant over a season, establish that correlations between the two teams more often than not result in draws, and question the existence of "goal affirmation" once a team scores a goal. The most significant result is the existence of a Poisson distribution in the expected goal difference in a match.

For several months I've been working intermittently on an investigation of whether goalscoring distributions are Poisson in nature. My motivation was that whenever I presented my work on the soccer Pythagorean, invariably I would receive questions asking if I had considered a Poisson goalscoring distribution. There is some literature out there that questions the use of Poisson distributions for goals scored by a team over the course of a season, and when I attempted to derive the Pythagorean from a Poisson distribution I could never get the expression to work properly.

The paper that I will discuss touches on a number of topics in soccer goal distributions, from the existence of "self-affirmation" (the football fever that some researchers have mentioned) to the influence to team quality over random factors in football results. Most importantly, the paper will serve as a springboard for presenting some tests on Poisson distributions in the football data that I have.

The authors of this paper are researchers in physical and organic chemistry in a university in Münster, Germany. The lead writer is Dr. Andreas Heuer, whose research spans experimental and computational work in physical chemistry with an interesting side hobby in sport statistics. Naturally, his work in sport statistics has generated more media attention than his physics research! This particular paper was published in Europhysics Letters, which is a journal devoted to publishing brief papers on very recent results (Physical Review Letters in the USA is similar). It may be five pages long, but it is a very mathematically and technically dense paper — you'll need a cup of coffee, a legal pad, and at least two readings to make sense of it.

A little bit of information about the data used for this analysis is in order. The authors used German 1. Bundesliga data between the 1987-88 and 2007-08 seasons, inclusive except for 1991-92, which was the reunification season when the Bundesliga had 20 teams. They did not use data before 1987 because of what they described as a significant difference in the goal distributions before then, upon which they do not elaborate.

There are three major results in this paper, and I'll describe them briefly.

First, **team fitness levels remain constant over a season**. The authors characterize the team "fitness" (a better word might be "quality") as its average goal difference normalized by the number of matches played. This metric is an estimate of the true fitness level of the team which is attained after many matches played. By correlating a team's results with those of all of its rivals over the season (and their rivals as well), one arrives at a measure of how team quality changes over a season. It fluctuates over the course of a season, yet there exists a constant bias term. This constant term corresponds to the variance of team quality in a league. So there are some variations to team quality, but on a macro scale that quality remains constant.

It should be noted that the authors are making conclusions solely on same-season data. So it is very possible — no, it's very likely — that team fitness levels change over multiple seasons, due to either team turnover, or relegation, or promotion. It would be interesting to see how these fitness levels change for clubs who are either newly formed in a league (e.g. MLS) or recently promoted to a new league, and assess how different classes of clubs (one-year wonders, elevator clubs, consolidating teams) perform differently in terms of fitness level.

The second major result is that** fluctuations in fitness have short-term implications but matter little in the long run**. Now that result should make sense to most people, but those short-term fluctuations drive the increased number of draws and streaky runs (winning or losing) by teams. The authors developed a simple model to characterize the match result, which they described in terms of the expected outcome between both teams based on fitness levels, the systematic influence on the match, and random factors. The systematic influences on the match include external effects, such as injuries, suspensions, weather, or the occasion of the match, and intra-match effects, which include match events such as expulsions or goal scorings. None of these effects can be estimated, of course, but the variance of these effects can, and the authors develop some expressions that do that. (I am still not sure how they derived those expressions after a couple of readings; I might try again at some point in the future.) By fitting their dataset to the model, they found out that the high-order effects that they were modeling fell out, and that the variance due to fitness fluctuations was much smaller than the variance of the expected outcome.

The third major result is that while goal distributions generally aren't Poisson, **goal scoring does appear to follow a Poisson process**. There is a distinction between "process" and "distribution" which lies between describing the distribution of goals in a match, and the distribution of goals scored in a match *over the course of a season*. The authors develop Poisson distributions from the expected number of goals of each team (from computing the estimated goal difference and estimated goal sums), and show that the distribution of the goal difference holds up very well to the actual data. The distribution of the actual data does not spread out for lopsided goal differences, which challenges the existence of the goal-affirmation phenomena proposed by Bittner et al. The Poisson model of the goal difference fits well expect for ties and minimum-goal differences, which is a huge proportion of soccer match results. The issue is that the goals scored by home and away teams are slightly dependent, and it is that slight statistical variation that accounts for the narrow results and draws. (Tied matches with more than six goals are more in line with the statistical independence assumption.)

In summary, the class of a team does come out over the course of a league season, and random variations in the teams account for a substantial number of results in soccer, in particular the narrow results and the 0-0, 1-1, and 2-2 draws. Because the number of goals (points) and scoring opportunities is so low in soccer compared to other sports, random effects are much more significant. The most interesting description of a soccer match that I've read comes at the end of this paper when the authors state:

"…a soccer match is equivalent to two teams throwing a dice. The

number 6 means goal and the number of attempts of both

teams is fixed already at the beginning of the match, reflecting

their respective fitness in that season."

If you're willing to brave the highly concentrated mathematics, the abuse of notation, and the hand-waving (necessary in a five-page paper), the paper has plenty to chew on for those who like to think about how seemingly random a goal is in football. It also illustrates how much of a fool's errand it is to predict the exact score of a football match, but it doesn't stop millions from attempting to do so, for which the betting houses are grateful.

]]>There is a fair amount of overlap between regions, but there are some trends that are apparent. First of all, there doesn't seem to be much of a correlation between offensive goal variances and average points won per match, except for perhaps the poorest teams. On the other hand, there does appear to be a correlation between defensive goal variances and average points won per match. The elite clubs — the ones averaging more than two points per match — had very low defensive variances, or to put it another way, a more consistent defensive unit.

So if your team's defense lets in few goals and does so consistently, more often than not your side will be at or near the top of the table. Inconsistent teams, naturally, have more inconsistent results and an uncertain outcome in the table.

]]>To this end, I've written a script that allows me to convert a results matrix into goal scoring data per team in columnar form. Such a format would allow me to do a lot more things with the data, such as running the data through my mathematical and statistical packages to examine the goal distribution and obtain summary data. I doubt that I'm the first person to do something like this, but I haven't seen a similar code presented elsewhere. I am also sure that I'm not the only person who would find such a script useful, so I will share it here.

The code is called **ParseMatrix** and is written in Perl. It is best used from the command line with the following options:

**./ParseMatrix <matrix.file> <team.file>**

The matrix file is the collection of match results, with no team descriptors included. On the diagonal there must be placeholders for the scoreline (I use X-X but they can be any non-numeric character). The team file is a column list of the corresponding league team name. No spaces are allowed; so a name like "Manchester United" must be written as "Manchester_United".

The script reads in the matrix into a two-dimensional array and for each team compiles the goals scored/allowed data — across columns for home matches, down rows for away matches. For each team, it saves the data to an output file **Goals_<TeamName>.dat**. The filename can be anything you wish, of course.

At this time the script only works for completed leagues or result matrices with all the placeholders included. You do have to insert the placeholders by hand; I haven't gotten around to automating that procedure. I'll leave that as an exercise to an enterprising reader.

Without further delay, here's the code. I hope you find it useful.

]]>A couple of days ago I looked at Ajax Amsterdam's goalscoring distributions from this season and observed that their offensive distribution is skewed in the opposite direction from that of more typical goal distributions. More importantly, the offensive distribution was skewed in the opposite direction from the defensive goal distribution, which would make the curvefit of the underlying distribution very difficult. To find out if that was also the case for other teams with extremely lopsided goal statistics, I took a look at Barcelona's record in the Spanish Primera last season when they won The Treble. Below is a histogram and a smoothed probability density of their goal offense (horizontal axis is number of goals, vertical axis is probability from 0 to 1):

(As you can see, I'm starting to get the hang of using R. )

Here is the same type of plot with Barcelona's goal defense:

And finally, here's the final league table from the 2008-09 season with Pythagorean estimates:

Team | GP | GF | GA | Pts | Pythag | +/- |
---|---|---|---|---|---|---|

Barcelona | 38 | 105 | 35 | 87 | 76 | +11 |

Real Madrid | 38 | 83 | 52 | 78 | 65 | +13 |

Sevilla | 38 | 54 | 39 | 70 | 62 | +8 |

Atlético Madrid | 38 | 80 | 57 | 67 | 61 | +6 |

Villarreal | 38 | 61 | 54 | 65 | 56 | +9 |

Valencia | 38 | 68 | 54 | 62 | 59 | +3 |

Deportivo La Coruña | 38 | 48 | 47 | 58 | 52 | +6 |

Málaga | 38 | 55 | 59 | 55 | 50 | +5 |

Mallorca | 38 | 53 | 60 | 51 | 48 | +3 |

Espanyol | 38 | 46 | 49 | 47 | 50 | -3 |

Almería | 38 | 45 | 61 | 46 | 42 | +4 |

Racing Santander | 38 | 49 | 48 | 46 | 52 | -6 |

Athletic Bilbao | 38 | 47 | 62 | 44 | 43 | +1 |

Sporting de Gijón | 38 | 47 | 79 | 43 | 35 | +8 |

Osasuna | 38 | 41 | 47 | 43 | 47 | -4 |

Valladolid | 38 | 46 | 58 | 43 | 44 | -1 |

Getafe | 38 | 50 | 56 | 42 | 48 | -6 |

Betis | 38 | 51 | 58 | 42 | 48 | -6 |

Numancia | 38 | 38 | 69 | 35 | 33 | +2 |

Recreativo | 38 | 34 | 57 | 33 | 36 | -3 |

Now, Barcelona's goal distributions are different from Ajax's in that they are both skewed in the same direction. This characteristic is typical of most teams. The difference in Barcelona's goal distribution is that a second peak pops up at six goals, which is known in the statistical parlance as a bimodal distribution. Last season's team scored six goals in a higher proportion of matches than they scored zero, four, or five. A Weibull curvefit would miss about half of that occurrence, which could explain the discrepancy in the final Pythagorean estimation.

Let's assume that the current curve fit estimates that Barcelona will score six goals in 5% of its matches, or 2 matches (.05*38=1.9). If Barcelona scores six goals in a match, the chances of them winning the match are very good, almost 100% in fact, so let's assume they take all points in those games. The difference between the curve fit and reality is about 7%, or about three matches (.07*38=2.66). So the failure to pick up the second mode in Barca's goalscoring distribution creates a discrepancy of nine points — just about the entire Pythagorean variation.

So it seems that a change in the distribution skewness doesn't have to be present to produce large changes in the Pythagorean estimate. Bimodal distributions also have the same effect.

**UPDATE (9 May)**: You know, maybe there's not much of a difference. I looked through my code and noticed that the win/draw probability calculations consider scenarios were a team has scored up to five goals in a game. That's usually sufficient for most leagues, but not in the Spanish league last season. I increased the upper limit to ten and recalculated, and this is what I got:

Team | GP | GF | GA | Pts | Pythag | +/- |
---|---|---|---|---|---|---|

Barcelona | 38 | 105 | 35 | 87 | 87 | 0 |

Real Madrid | 38 | 83 | 52 | 78 | 69 | +9 |

Sevilla | 38 | 54 | 39 | 70 | 62 | +8 |

Atlético Madrid | 38 | 80 | 57 | 67 | 65 | +2 |

Villarreal | 38 | 61 | 54 | 65 | 57 | +8 |

Valencia | 38 | 68 | 54 | 62 | 61 | +1 |

Deportivo La Coruña | 38 | 48 | 47 | 58 | 52 | +6 |

Málaga | 38 | 55 | 59 | 55 | 50 | +5 |

Mallorca | 38 | 53 | 60 | 51 | 48 | +3 |

Espanyol | 38 | 46 | 49 | 47 | 50 | -3 |

Almería | 38 | 45 | 61 | 46 | 42 | +4 |

Racing Santander | 38 | 49 | 48 | 46 | 53 | -7 |

Athletic Bilbao | 38 | 47 | 62 | 44 | 43 | +1 |

Sporting de Gijón | 38 | 47 | 79 | 43 | 35 | +8 |

Osasuna | 38 | 41 | 47 | 43 | 47 | -4 |

Valladolid | 38 | 46 | 58 | 43 | 44 | -1 |

Getafe | 38 | 50 | 56 | 42 | 48 | -6 |

Betis | 38 | 51 | 58 | 42 | 48 | -6 |

Numancia | 38 | 38 | 69 | 35 | 33 | +2 |

Recreativo | 38 | 34 | 57 | 33 | 36 | -3 |

Spot. On.

What was most incredible about Barcelona's season was that it obscured the fact that Real Madrid, Atletico Madrid, and Sevilla were also playing at a high level.

I still think it might be useful to look at common features of teams with large Pythagorean variances. I just don't think it applies in Barcelona's case.

]]>Most league teams exhibit skewness, or asymmetry, in their goalscoring distribution, in almost all cases to the right. This is called positive skew in the literature. To give one example, here is Manchester United's goals scored distribution from last season's English Premier League:

The distribution is concentrated between one and two goals with the higher scoring events spread out to the right. It's fairly typical of most teams including league winners like Man U last season.

For completeness, here's Man U's goals allowed distribution from last season:

Here the skewness is more pronounced as the defense created a majority of clean sheets and allowed very few goals in the other matches. (Man U's defense only allowed three goals or more in two league matches in 2008-09.) This is a typical observation in league winners; United's record is very good compared to most champions, but it's not unusual for a champion's defensive record to be so positively-skewed.

When we develop the Pythagorean exponent, we come up with parameter estimates that fit both the offensive and defensive distributions simultaneously. The big assumption is that both probability distributions follow a Weibull distribution. Because we have to fit both distributions simultaneously, we can't fit every feature of both distributions, but we can fit enough to make the estimate fairly accurate within an acceptable tolerance level.

Now, here's Ajax's goals scored distribution:

Ajax's scoring record was quite different from what I've seen from most teams, including league winners. Their scoring distribution is very negatively skewed, meaning that they scored goals in bunches a lot during the season. (The bars in the histogram display the raw frequency of goals scored between the intervals on the goals axis; the blue line is the probability density.)

Their scoring defense record is even more amazing:

Much was spoken about Ajax's scoring prowess this season, but not as much about their goal defense. They allowed just 20 goals in the league, and only four at home! There is a very strong positive skewness in the distribution, and a high level of peakedness (kurtosis) present. (I'm still learning R and haven't figured out how to smooth the density properly.)

I believe that this combination of scoring distributions causes the Pythagorean estimate to break down. The league Pythagorean exponent produces a probability distribution that will never be able to correspond well to both the offensive and defensive distributions under these conditions. I am willing to guess that this happens during the truly historic seasons by league winners, whether it's Barcelona last season in La Liga, and perhaps Chelsea in the Premiership this season.

I am thinking that there needs to be an estimate of second-order points that would be augmented to the initial Pythagorean estimate. In contrast to the second-order wins formulated for baseball, I seriously doubt that it can be determined by looking at the box score. One possible route is to consider the moments of the probability distribution about the mean and determine their effect on points won during a league season. That's one approach; I'm open to others.

]]>How does the probability of the final score change with the relative strength of the two teams, home advantage, time elapsed, and the current score? This publication describes what's called a "birth process" model and it is shown to be useful in modeling not just the final score, but also the evolution of that score during the course of the match. The model is useful for testing some of the common clichés heard in football and (potentially) making some money at the betting house.

———-

This paper has been burning a hole in my hopper for the past six months, and I've read it off and on in the meantime, but today I've been busy reviewing papers for an upcoming conference. So I've decided to review this one for this site and get it over with! This publication by Mark Dixon and Michael Robinson — at the time, two statistics researchers at British universities — is a precursor to the research done by Bittner et al. on "football fever", which explains how the probability of scoring goals changes after a goal has been scored. Dixon and Robinson go a little further in their paper.

There are a number of reasons to employ statistical analysis to sport, other than the fact that it's a cool way to introduce and test statistical methodologies and get one's name in the general media. Other reasons are

- to assess current strategies and suggest strategic improvements to individuals and teams,
- to value the comparable market value of individual players, thus determining who might be "undervalued" or "overvalued" in the transfer or draft market,
- to examine the fairness of the rules of the game or competition,
- to predict future outcomes, for either media or betting purposes

The Dixon/Robinson paper falls into the fourth category. It also builds on results by Dixon/Coles in a 1997 publication [1] that presented a statistical model for full-time results. This model (a multi-parameter Poisson distribution) considers the attack and defense qualities of the competing sides and a home advantage factor. The parameters are estimated by fitting the probabilities that teams A and B will have a particular scoreline to the full-time score data. The deficiencies of the model are that it assumes the performance rate of a team is constant throughout the tournament (a common failing of a lot of these static statistical models) and cannot model the evolution of a team's performance during a match.

What Dixon/Robinson propose is the main result of the paper, the two-dimensional birth process model. I still don't understand completely how it works, but I interpret it to be a way to consider the two competing scoring processes simultaneously. The idea is that the scoring rate changes during the match, and the variation of this rate depends on the current score. Dixon/Robinson show a two-dimensional chart that looks like a series of steps as the scoreline changes. It's not clear to me how to implement such an algorithm — I really need some quiet time to understand everything — but it will be a difficult task. First of all, one needs a TON of data, and full-time scorelines aren't enough. I would need to know the goal times to properly use this model. Dixon/Robinson used results from all four English professional divisions over three seasons. That's over 4000 matches and almost 10,400 goal times. This is the kind of research for which a good soccer result database is ideal.

Dixon/Robinson are able to draw some conclusions from the model:

- The scoring rate for both teams generally increases during a match. It goes way up at 45 and 90 minutes, but Dixon/Coles lump all of the injury time goals into 45 or 90 minutes. (With the recent FIFA goal time conventions it would be interesting to show how the scoring rate changes during the stoppage time period.
- The attack and defense parameters in general tend to degrade from the Premiership to the lower divisions. In other words, as you go down the divisions, strikers tend to score less by their own efforts than by opportunities given by the poorer defense of the lower division sides.
- The scoring rates of the home and away teams depends very much on the current score.

More controversially, Dixon/Robinson fail to see any evidence for the common football cliché that a team is never more vulnerable to be scored upon than immediately after a goal. Bittner et al. found support for the opposite, that a team that scores a goal increases its probability of scoring subsequent goals.

In the final section, Dixon/Robinson show that their model could be used in spread betting situations, during which a betting house could use the model to set prices or a bettor could determine what kind of bet to make. Spread betting refers to the then-growing practice on betting that a team will score more or less than a given result, and then winning (or losing) money given the type of bet being made. It's very similar to making calls and puts in options trading. (I'll leave it to the reader to think of all the similarities between most stock/commodities trading and gambling.) Dixon/Robinson show that their might be some inefficiencies in the betting pricing, which would be of huge interest to the bettors, and definitely to the betting companies as well!

Well, for those of us in the USA that kind of sports betting isn't an issue, but the paper does present a more sophisticated statistical model accompanied with insight that might be useful in other situations. It's good solid academic research, but outside of the gaming application and the multi-parameter Poisson distribution I'm not sure how useful it will be to soccermetricians.

[1] M. J. Dixon and S. G. Coles, "Modelling association football scores and inefficiencies in the football betting market", *Applied Statistics*, 46: 265-280, 1997.

Club |
Alpha_GF | Alpha_GA | Exponent |
---|---|---|---|

Arsenal | 2.7560 | 1.5805 | 1.5030 |

Aston Villa |
2.2955 | 2.1485 | 1.4849 |

Blackburn |
2.0192 | 2.6941 | 1.4604 |

Bolton |
2.0961 | 2.3750 | 1.3561 |

Chelsea |
2.6428 | 1.2474 | 1.4092 |

Everton |
2.3008 | 1.6886 | 1.4051 |

Fulham |
1.9156 | 1.7100 | 1.3650 |

Hull City |
1.9233 | 2.3686 | 1.8368 |

Liverpool |
2.9020 | 1.3778 | 1.4974 |

Manchester City |
2.4678 | 2.2673 | 1.6112 |

Manchester United |
2.4407 | 1.1003 | 1.6722 |

Middlesborough |
1.6117 | 1.9444 | 1.6419 |

Newcastle United |
1.9951 | 2.4601 | 1.8405 |

Portsmouth |
1.9960 | 2.3075 | 1.4803 |

Stoke City |
1.8209 | 2.3877 | 1.5949 |

Sunderland |
1.7685 | 2.2380 | 1.5820 |

Tottenham |
1.9768 | 1.9093 | 1.6944 |

West Brom |
1.8442 | 2.9093 | 1.3816 |

West Ham |
2.0579 | 2.0006 | 1.5762 |

Wigan |
1.7025 | 1.9550 | 1.3951 |

The mean of the exponent is **1.5394** (median 1.5002), with a standard deviation of **0.1457**. The numbers that I've seen on the web for the exponent term are right in the 1-sigma range of this estimate. It was also close to my guess of 1.5, which was more of a gut instinct than anything else.

I suppose that if I want to make a stronger claim that this curve-fit is the right one, I would perform a chi-square goodness-of-fit test, but I'll leave that for later or as an exercise for someone more enterprising.

My solution approach is described in this document. I implemented it using a script in Scilab. If you'd like a copy of the script I can send it to you, but you will have to download Scilab.

]]>