I’ve looked at Pythagorean tables for the most recent seasons of the English Premier League and noticed that in four of the last five seasons the Premier League Manager of the Season has come from a club with one of the two or three highest Pythagorean residuals in the table. Alan Pardew (2011-12) and Alex Ferguson (2012-13) managed teams with the highest Pythagorean residuals in the competition, and Ferguson (2010-11) and José Mourinho (2014-15) managed clubs that were very close to the highest residual. But if you look at the list of Managers of the Season recipients since the 1999-2000 season, you see something different:

Since the end of the 20th century, six winners of the Manager of the Season award managed clubs that were the biggest overachievers in the competition, according to the soccer Pythagorean table. If you want to be generous and include those managers whose clubs were a point away from the highest residual, there are nine Managers of the Season whose teams over-performed the most. So about 50% of those teams among the most over-performing in the Premier League saw their managers win the best of season award. It’s possible that the Pythagorean residual isn’t very informative when it comes to assessing managers, and some baseball sabermetricians have been similarly pessimistic about its utility.

So it appears that the predictive power of Pythagorean residual to identify overachieving managers may not be much stronger than flipping a coin. Could actual points won as well as Pythagorean residual predict the team coached by the manager of season? Let’s build a support vector machine (SVM) classifier to find out.

Support vector machines were invented by Vladimir Vapnik and Alexey Chervonenkis in the 1960s and refined by Vapnik and collaborators in the 1990s. Support vector machines are used primarily to create boundaries in space that classify data points into one of two (or more) categories. You can use a logistic regression to classify data points as well, but support vector machines have twin advantages of being more robust and able to create nonlinear boundaries more easily.

For this classifier, the independent variables are the actual points won by a team in a given season, and the expected point total for that team. The dependent variable determines whether the team’s manager won the Manager of the Season award — yes=1, no=0. The SVM classifier uses a radial basis function kernel (C=1.0, γ=1.0) and is trained with 11 seasons (220 data points) of end-of-season point totals for Premier League teams. Five seasons of data, or 100 data points, are reserved to test the performance of the SVM.

I ran the SVM at least ten times to assess the average performance of the classifier on the test set. Here it is:

Predicted | |||

No |
Yes |
||

Truth | No |
82.7 | 12.3 |

Yes |
1.4 | 3.6 |

Here is what a sample classifier looks like, overlaid with points from the test data set:

The classifier assigns those teams with leading point tables and Pythagorean residuals greater than zero to the Manager of the Season category. In most years, this is a reasonable thing to do, as almost all of the Premier League Managers of the Season have managed the top teams in the league and/or influenced them to perform much better than an average team with similar goal statistics. However, such a classifier will miss winning finalists such as George Burley with Ipswich Town or Tony Pulis with Crystal Palace, which are two sides that overcame preseason expectations as opposed to statistical expectations from matches already played. Nonetheless, the classifier, as simplistic as it is, does a fairly good job of identifying possible winners of Manager of the Season, as long as those Managers lead clubs near the top of the table. Does it provide insight that an observer couldn’t obtain by considering the champion or another team in the top three? No.

It remains to be seen if this classifier can do as good a job of predicting Manager of the Season for this season (2015-16). I’ll revisit this question after Spurs’ match later today.

**UPDATE** (4/25, 23:55): With the latest results, I reran the classifier and predicted the contenders for Manager of the Season for 2015-16. Out of the eight runs that I conducted, Leicester City was a positive hit every time, followed by Tottenham Hotspur and (rarely) Arsenal and Manchester United. It’s almost certain that Claudio Ranieri will win the Manager of the Season award, so it’s not like the classifier is providing any special insight.

To visualize the Pythagorean points that resulted from a goals scored and allowed combination, I created a contour plot. I considered a wide range of goals scored or allowed — from zero to 120 goals — to ensure that I would cover the theoretical range of league points. Each plot assumes a certain number of matches in a league season, and point totals are for an average team in the league competition. I suppose I could have used point averages from 0 to 3.0 in the contour levels, which would have reduced the plots to one, but in my opinion the contours are more understandable when they relate to numbers that people typically see in a league competition.

So here are the plots, starting with Pythagorean expectations for a 30-match league competition:

Next, a 38-match league competition:

And finally, a 46-match league competition:

I didn’t smooth the contours in the plot, which should explain the jagged look. But you can see hooks in the contours that become more pronounced at the highest expected point totals. I’m not sure what causes those hooks. The contours appear approximately parabolic at low point totals, straighten out to rays at higher point totals, and flatten out and start to lose coherence at extreme point totals. But there are ridges in the contours that also fall along lines in the goals scored/allowed plane. They’re very interesting plots once you start to study them.

The thought didn’t occur to me until I started writing this post, but it would have been useful to overlay straight lines that indicate constant goal difference onto the contour plot. I might go back and insert them in, but my curiosity has taken me far enough.

]]>As you know, ResultsPage calculates various types of league tables on a round-by-round basis, one of them being the Pythagorean table. We’ve isolated the code that calculates the Pythagorean expectation and we now expose it to the outside world as its own API call.

In case you don’t know what I’m talking about, API is short for Application Programming Interface, and it is a collection of programming instructions that allow applications to talk to each other. It provides a means to access and interact with data and information from multiple sources in an automated and creative way.

Programs interact with APIs in a variety of ways. Some pass chunks of XML data back and forth. We decided to design our API on REST principles. The specifics aren’t that important — what is important is that you can access the Soccer Pythagorean through a URL address.

So here’s the Soccer Pythagorean API call:

http://fmrdlight.herokuapp.com/analytics/pythagorean?matches=XX&scored=XX&allowed=XX

All you have to do is add numbers where the XXs are for number of matches played, goals scored and allowed. **That’s it**.

What do you get in return? Here’s an example:

http://fmrdlight.herokuapp.com/analytics/pythagorean?matches=24&scored=50&allowed=25

returns the following:

{ "points": 47, "match": 24, "loss": 5, "allowed": 25, "win": 14, "draw": 5, "scored": 50 }

The expected point total is in the “points” field, while “win”, “draw”, and “loss” express on possible league record that would result in that point total (calculated from win/draw probabilities).

Now all of this is work in progress, so it’s likely that the URL root (fmrdlight.herokuapp.com) will change as the API matures. For now, it’s an open API, so don’t be a jerk. We’ll add some helper functions for those who don’t want to type a URL into a browser. We can create one in Python; maybe others can create their own.

So play around with the Pythagorean API. We hope this feature encourages its wider use in the football analytics community.

]]>Before the soccer Pythagorean was derived, a lot of people have attempted to apply the baseball Pythagorean directly to soccer. I haven’t written on its shortcomings in much detail, but in this post I’m going to go into the reasons why the original formula doesn’t work for football.

Here is Bill James’ original Pythagorean formula:

This equation (or to be really technical, a *model*) relates win percentage in baseball to runs scored and runs allowed. Now, Bill James originally used 2.0 as the exponent term that best fit the expected win percentages of all the teams in a league to their real values. Selecting 2.0 as the exponent is convenient because it the formula looks like the famous Pythagorean formula and permits some visual insight (I’ll leave that as an exercise for the motivated reader). Later sabermetrician showed that the best exponent to use is approximately 1.8, which speaks very well of James’ intuition.

Say you want to apply this formula to soccer. Replace “runs” with “goals” and win percentage with points percentage (points earned / possible points), and then set the Pythagorean exponent to the Greek letter γ:

Multiply points percentage by total possible points (number of league matches by three points) and you get expected points. It’s no surprise that so many have attempted to use James’ formula first. It’s simple and intuitive to use and understand. But when applied to football the baseball Pythagorean falls short in two ways:

- The baseball Pythagorean consistently underestimates point totals.
- The root-mean-square error of the estimation is very high.

To illustrate let’s apply the James’ Pythagorean to last season’s English Premier League.

To obtain the Pythagorean exponent that best fits the expectation to reality, we take the difference between each team’s actual point total and their expected total, square that amount, repeat the process for all of the teams in the league and then add the values together. This is the league mean-square error of the Pythagorean and if you take the square root you get the league **root-mean-square error** or **RMSE**. We want to find the Pythagorean exponent that minimizes the league RMSE.

We can find this exponent graphically by calculating the league RMSE over a range of values and plotting them. (There are more sophisticated ways to find the exponent, but the math is much more involved.) Such a plot looks like this:

The league RMSE bottoms out (*reaches a minimum*, to be precise) at around γ = 1.3, which we will call the **league Pythagorean exponent**. As a quick aside, how does the league exponent change for different leagues? Let’s look at last season’s Spanish La Liga:

Here the league RMSE reaches a minimum at γ = 1.2. In fact, the league Pythagorean exponents for Bill James’ formula is between 1.1 and 1.4, admittedly over a small samples of leagues (I looked at last season’s Big Five European leagues and the just-concluded MLS regular season).

But take a look at both figures again. The red line is the RMSE for the soccer Pythagorean that I derived, using the Pythagorean exponent that best fit expected points to reality over a large number of leagues. (It’s our ‘universal’ Pythagorean exponent of 1.70.) In both plots, the best-case RMSE from James’ Pythagorean formula applied to soccer is still larger than the RMSE of the soccer Pythagorean. Over the leagues that I’ve plotted this finding persists. Again, it’s a small sample space and I’ll leave an exhaustive study to someone else (I’ll happily link to the post that presents one), but I’m confident that even in the best case, the Jamesian Pythagorean has a consistently higher RMSE than my soccer Pythagorean.

So what do the estimated point totals look like? Once again, let’s look at the English Premier League. The table below presents goals scored/allowed by Premiership teams, with actual point totals, Pythagorean expectations (James’ and mine), and the resulting residuals. The James Pythagorean uses an exponent of 1.3, and the soccer Pythagorean uses an exponent of 1.7.

Team |
GF |
GA |
Pts |
Pythag (BJ) |
Pythag (HH) |
Pts – PyBJ |
Pts – PyHH |

Manchester City | 93 | 29 | 89 | 93 | 88 | -4 | 1 |

Manchester United | 89 | 33 | 89 | 89 | 82 | -0 | 7 |

Arsenal | 74 | 49 | 70 | 72 | 68 | -2 | 2 |

Tottenham Hotspur | 66 | 41 | 69 | 74 | 69 | -5 | 0 |

Newcastle United | 56 | 51 | 65 | 60 | 55 | 5 | 10 |

Chelsea | 65 | 46 | 64 | 70 | 63 | -6 | 1 |

Everton | 50 | 40 | 56 | 65 | 59 | -9 | -3 |

Liverpool | 47 | 40 | 52 | 63 | 56 | -11 | -4 |

Fulham | 48 | 51 | 52 | 55 | 49 | -3 | 3 |

West Bromwich Albion | 45 | 52 | 47 | 52 | 46 | -5 | 1 |

Swansea City | 44 | 51 | 47 | 52 | 46 | -5 | 1 |

Norwich City | 52 | 66 | 47 | 48 | 45 | -1 | 2 |

Sunderland | 45 | 46 | 45 | 56 | 50 | -11 | -5 |

Stoke City | 36 | 53 | 45 | 43 | 41 | 2 | 4 |

Wigan Athletic | 42 | 62 | 43 | 43 | 40 | 0 | 3 |

Aston Villa | 37 | 53 | 38 | 44 | 41 | -6 | -3 |

Queens Park Rangers | 43 | 66 | 37 | 42 | 39 | -5 | -2 |

Bolton Wanderers | 46 | 77 | 36 | 39 | 35 | -3 | 1 |

Blackburn Rovers | 48 | 78 | 31 | 40 | 35 | -9 | -4 |

Wolverhampton Wanderers | 40 | 82 | 25 | 32 | 29 | -7 | -4 |

RMSE |
5.808 | 3.814 |

The main feature of the Jamesian Pythagorean when applied to soccer is that it **consistently overestimates point totals**. There are some teams that persistently overperform in the Jamesian Pythagorean and the soccer Pythagorean, such as Newcastle, Stoke, and Wigan, and some teams identified as underperformers in the soccer Pythagorean have highly negative residuals in James’ Pythagorean. Manchester United’s performance — perceived as significantly overperforming according to the soccer Pythagorean — is in line with statistical expectations according to James’ expectation.

It’s possible to argue that both expectations do quite well at identifying the significant outliers in a league competition. My perspective is that the inability to estimate the probability of draws in the Jamesian Pythagorean yields an estimator that has a persistently high RMSE and high level of bias. Neither characteristic is found in the Jamesian Pythagorean when applied to baseball, and I would hypothesize that it’s not present when applied to basketball, American football, or any other sport with no draws (or very few).

So to conclude, Bill James’ Pythagorean expectation provides a lot of insight in baseball and has been adapted to other sports, but it fails in soccer. We’ve addressed the underlying assumptions of the original formula and developed a new metric that adapts those assumptions to football competitions. We like the results and we’re developing more metrics like it to enlighten our understanding of this great game.

]]>When I wrote my last big post on the soccer Pythagorean, I said that Pythagoreans are nothing more than win/loss/draw probability calculations given the expected averages of goals scored and goals allowed. The term “expected” is important — the means are adjusted by a translation parameter in the soccer Pythagorean (they’re also adjusted in the baseball Pythagorean that Steven Miller derived but not in Bill James’ original). They are also **joint probabilities** in that we are considering the probability that team X has scored *c* goals and team Y has scored less than *c* goals in the same match. (We use team X’s offensive and defensive goal averages to come up with both probabilities, and some sharp people will object that team Y has its own influence as well, but it’s an approximation that serves us well.)

What makes the soccer Pythagorean different from those for basketball, baseball, and American football is that we have to deal with drawn results, so we have to sum probabilities over the possible range of goals scored. Practically this means that for win probabilities we sum up to 10 and for draw probabilities we sum up to 6 (a 6-6 draw). If you come across a league where a team has scored more than 10 goals in a match, just increase the range of summation.

If you look at the partial sums of the Pythagorean, they look something like they do in the figures below. Here they are for Manchester City’s final Pythagorean estimate in the 2011-12 Premier League season.

Given City’s averages of 2.45 goals scored/match and 0.76 goals allowed/match, they would be expected to win 70.3% of their matches and draw 17.9%.

As you would expect, the cumulative draw probability flat-lines (hits its asymptote, to be way too technical) pretty quick. Very few score draws beyond 3-3 were scored in the Premier League (there was one 4-4 between Swansea and Wolves). It’s interesting to observe that scoring no goals gives City a 5% probability of a draw — zero probability for a win, of course. The cumulative win probability maxes out around six or seven goals; this is the probability that team X will score c goals **and** team Y will score less than that. The probability of the second event may be 1.0, but the probability of the first could be very small, which would result in a very small number. So the soccer Pythagorean is computing the **cumulative joint probability** of a win and a draw.

Pythagoreans in general calculate cumulative joint probabilities, but in sports that don’t have draws many of the terms are cancelled out. For sports that have lots of draws — like soccer — the Pythagorean doesn’t have that convenience. But it all works the same.

]]>Thanks to all who helped bring this about, and it is my hope and expectation that this will be the first of many journal publications from Soccermetrics.

**UPDATE:** There is a Guest Access feature of the journal that allows readers to view the article at no charge. If you would rather not bother with that, send me an email to hhamilton -at- soccermetrics -dot- net and I will send you a copy. But I do suggest that you go through the JQAS website, as downloads from there are credited to my publication ranking.

**UPDATE #2:** I just found out that my paper is currently listed as the most popular JQAS publication so far in 2011. Still a long way to go in the year but I am amazed. Thank you so much!

]]>

I learned last night that my paper on the soccer Pythagorean estimation will appear in the *Journal of Quantitative Analysis in Sports*. I am very excited about the news for multiple reasons. First, this represents a culmination of over 18 months of work on the topic and I continue to be pleased with the interest that it has generated. Second, the JQAS is a well-respected journal in the sports analytics community and it's great news that one of my works will appear there. Third, it is a great starting point for Soccermetrics and an indication of the quality of work I want my business to achieve.

There were several people who I mentioned in the paper who helped me in the beginning, whether it was introducing me to the problem or sending me data in the initial stages of the project. I will send them a copy of the this publication when the final draft is published. And if Berkeley Electronic Press give me permission (authors usually retain copyright for journal publications), I will create a permanent link to the journal paper at this website.

As for publication date I'm not sure at this time, but I recall the editor saying that this article would be in a special edition on the 2010 NCSSORS, so it might appear in the spring.

Finally, thanks to all who have either helped with data collection, made comments on the formulation, or told your friends and colleagues about the soccer Pythagorean. I've felt for some time that this work makes a really good contribution to the area of soccer analytics, and the news from JQAS serves as confirmation of that.

]]>It's helpful to take a look back at this project, from its start to the present time:

- I was cc'ed to an ongoing discussion in which someone was attempting to use the Pythagorean formula for soccer, but didn't know what value to use for the exponent. The first question in my mind was, "Okay, what is a Pythagorean?"
- I posed some questions about the Pythagorean and didn't like the answers that I was receiving, so I did some digging and found a very interesting paper by Steve Miller. It was the derivation from first statistical principles that I needed to get my work moving.
- Went to work, and three weeks later I had something. Because of the nature of the game, people are inevitably drawn to modeling goals in soccer with a Poisson distribution, but the mathematics don't work out (more on that in a future post). So I returned to a continuous distribution and developed an expression. Wrote a short paper on that which I later submitted to MIT's Sloan Sports Analytics Conference.
- The Pythagorean that I had developed for soccer was more complicated than the baseball or basketball versions, and it overpredicted point totals by 12-15 points. It did a fairly good job of predicting relative placings in a league, but that wasn't good enough. And if you added up the win/loss/draw probabilities, the totals would not sum to one, which was another problem.
- That paper that I submitted for the MIT SSAC? It got rejected. It turned out to be a blessing in disguise because of all the problems I found with the earlier derivation, but still no less disappointing.
- I didn't like the fact that the new Pythagorean was so darn complicated and scary-looking with all those exponential terms. It would scare away small children and math-phobic analysts. So I tried to apply some mathematical tools to simplify the expression. In the end, I gave up and made my peace with the complexity of the soccer Pythagorean.
- That above-mentioned work did have one positive side-effect: it made me realize that not only did I have to introduce a definition of a drawn match, I would also have to revise the definition of a win. This was the big breakthrough in my soccer Pythagorean work. The math derivation became a lot easier (the resulting equation was no less complicated, unfortunately), and all of the win/loss/draw probabilities summed correctly (i.e., to 1).
- I wrote a second paper that summarized the newest version of the soccer Pythagorean and posted it on this site.
- The formula started getting some attention from the press, but the reporters were suffering bad flashbacks to their days in high school math class. I attempted to put them at ease by writing an explanation of the Pythagorean formula with very little math content. Not sure if I succeeded, but they still read my blog, so that should count for something.
- Now that the hard work of developing the Pythagorean was over, I could concentrate on addressing the issue of a 'universal' Pythagorean exponent and the effect of Pythagorean goal variances on the quality of the estimate.
- And to wrap a bow around the whole effort, I presented my work at this year's NCSSORS.

So what were the lessons learned from this work? And what might it be good for?

**Proceed from first principles.**Here I can't express how valuable it was to find Steve Miller's paper on the baseball Pythagorean. Seeing the work that he had done to derive that expression gave me the confidence that I could do the same for soccer with some tweaks and wrinkles. It's all about making some logical (and defensible) assumptions, expressing those assumptions in the language of mathematics, and then using the mathematics as a guide for answering the questions you have.**Stand on the shoulders of giants.**See above.**Understand your formula.**One of the most pieces of advice that I received from my thesis advisor was that if you don't understand the model that you are using to do designs or analysis, you really shouldn't use it. Formulas are more defensible if you understand the meaning of every term and its influence on the final result. The Pythagorean is nothing more complicated than a win/draw probability calculation given the average scoring offense/defense. Then it's a matter of defining a win and a draw.**Pythagorean exponents do matter.**If the Pythagorean exponent is too small, it will underestimate the number of wins and draws. An exponent that is too large will also have the same effect. While I won't be able to prove it definitively, there does appear to be a 'universal' league Pythagorean exponent, or at least a range of exponents that work well for almost all teams in all domestic leagues. I have been using 1.70 as a universal exponent; the average league exponent from all the leagues that I've studied has been around 1.66. An exponent between 1.55 and 1.80 should work as well.**The soccer Pythagorean is a team-centered metric.**It assesses team performance using the most important metric of all — goals. Dean Oliver says in his book that an analyst should seek to understand the characteristics of the team before understanding the characteristics of the players who make up the team. The soccer Pythagorean points out which teams might be performing well outside the 3-5 point differential, which would motivate further study of those teams. In the same vein, the Pythagorean could be part of the package of coaching metrics.**The soccer Pythagorean is an assessor, not a predictor.**From the current goal statistics, the Pythagorean answers the question, "How is the team performing relative to expectations?" It does not answer "How will the team perform in the future given its current form?" One of the dangers of some predictive measurements is that they take current form and extrapolate way too far in the future. A good estimator would have to take into account future opponents, home/away matches, and updated offensive/defensive characteristics. The soccer Pythagorean can be part of an improved estimation framework, but it is not that by itself.**The soccer Pythagorean tends to overestimate the number of draws.**In most domestic leagues the percentage of draws is between 20-33%, and the soccer Pythagorean tends to estimate a similar number of draws unless the goal statistics are wildly different. It tends to compensate for overpredicting the number of draws by slightly underpredicting the number of wins, which still results in an estimated point total within 3-5 points of actual.**Consistently stingy defenses win championships.**One of the side studies that I did showed that there appears to be a much stronger correlation between defensive goal variances and average points won per match than offensive goal variances. If a team's defense allows few goals per game, the team is liable to win a lot of games. If a team allows few goals per game*consistently*, then it is likely to win a number of games that should have been draws.**Teams at the top or bottom of leagues deserve to be there.**What I have found out from looking at so many domestic leagues is that in general the teams that end up winning the league or get relegated deserve to be there. By that I mean that their point total is in line with the expectations from their goal statistics. You will find some exceptions, like FC Twente last year in the Netherlands (Pythagorean differential of +15!), or Wigan Athletic in England (should have been relegated but had a Pythagorean differential of +7). Teams at the very bottom not only deserve to be relegated, they also play much worse than their statistics would indicate, which could indicate some sort of on-field breakdown — not surprising to see from the worst team in the league.**Second or third place teams are the real overachievers.**A big season by the league winner tends to overshadow just how hard they were pushed to the title by other teams. Barcelona, in their Treble-winning season, had a fantastic goal scoring record and they deserved to be champions. But Real Madrid and Sevilla had very strong seasons and made the title race much closer than it should have been. In Ligue 1 last year, Marseille were deserved champions, but Auxerre and Montpellier greatly overachieved (Auxerre gaining entry into the Champions League because of it).

As you can see, there were a lot of insights to be drawn from one formula. It has been rewarding to examine the Pythagorean in the context of soccer, and I've picked up some techniques that can and will be used in other analytics problems. I've also developed a nice suite of tools to find solutions quickly. Now it's all about applying this formula to leagues and perhaps to other analytics problems.

The rationale was that if the variances are low, then the corresponding standard distributions are also low, which means that a team's goalscoring becomes more consistent. If two teams have identical goalscoring records, the team that scores more consistently (i.e. has a lower offensive variance) should have more league points than expected. But teams also have to play defense, so there is most likely some kind of nonlinear relationship.

I went back through all of the national leagues that I was evaluating — almost 40 in all — and extracted team goalscoring variances and Pythagorean residuals. Then I plotted multiple overlaid scatter plots of offensive and defensive variances, color-coded by the absolute value of the Pythagorean residual.

Below are two plots. The first one plots the offensive and defensive goal variances with Pythagorean residuals greater than zero. That is, all of these clubs played at or above their statistical expectations.

The second plot is of the offensive and defensive goal variances with Pythagorean residuals less than zero. That is, all of these clubs played at or below their statistical expectations.

From these two figures it is not apparent how the goal variances correspond with team performance relative to Pythagorean expectation. It doesn't look like there is much correlation present between the three sets of quantities. And when I think about it some more, that actually makes some sense. The Pythagorean is essentially an estimator of behavior in a league, and we're estimating the league Pythagorean exponent in the presence of uncertainties. If the estimate is good, then the resulting residuals should be uncorrelated noise with a few spikes that correspond to strong over- or under-performance.

It would be interesting to find out (a) whether the residuals really are uncorrelated, and (b) how the Pythagorean fits in within the realm of estimation theory. But those questions are worthy of at least a Master's thesis and/or a SIAM journal article, and I have no desire to do either.

Perhaps goalscoring variances and actual league points (averaged per game) would be a better thing to look at.

]]>Risk and uncertainty are inherent elements in sport, finance, and games of chance, and I think some of the themes dovetail well with my work on the soccer Pythagorean.

]]>