Soccer Pythagorean: Lessons Learned

Hopefully, this will be the last post I write on the Soccer Pythagorean.  Oh, I will continue to post tables for selected European leagues and MLS throughout the season, but development of the estimation formula has reached a level of maturity where I can now direct my focus to other things.

It's helpful to take a look back at this project, from its start to the present time:

  • I was cc'ed to an ongoing discussion in which someone was attempting to use the Pythagorean formula for soccer, but didn't know what value to use for the exponent. The first question in my mind was, "Okay, what is a Pythagorean?"
  • I posed some questions about the Pythagorean and didn't like the answers that I was receiving, so I did some digging and found a very interesting paper by Steve Miller.  It was the derivation from first statistical principles that I needed to get my work moving.
  • Went to work, and three weeks later I had something.  Because of the nature of the game, people are inevitably drawn to modeling goals in soccer with a Poisson distribution, but the mathematics don't work out (more on that in a future post).  So I returned to a continuous distribution and developed an expression.  Wrote a short paper on that which I later submitted to MIT's Sloan Sports Analytics Conference.
  • The Pythagorean that I had developed for soccer was more complicated than the baseball or basketball versions, and it overpredicted point totals by 12-15 points.  It did a fairly good job of predicting relative placings in a league, but that wasn't good enough.  And if you added up the win/loss/draw probabilities, the totals would not sum to one, which was another problem.
  • That paper that I submitted for the MIT SSAC?  It got rejected.  It turned out to be a blessing in disguise because of all the problems I found with the earlier derivation, but still no less disappointing.
  • I didn't like the fact that the new Pythagorean was so darn complicated and scary-looking with all those exponential terms.  It would scare away small children and math-phobic analysts.  So I tried to apply some mathematical tools to simplify the expression.  In the end, I gave up and made my peace with the complexity of the soccer Pythagorean.
  • That above-mentioned work did have one positive side-effect: it made me realize that not only did I have to introduce a definition of a drawn match, I would also have to revise the definition of a win.  This was the big breakthrough in my soccer Pythagorean work.  The math derivation became a lot easier (the resulting equation was no less complicated, unfortunately), and all of the win/loss/draw probabilities summed correctly (i.e., to 1). 
  • I wrote a second paper that summarized the newest version of the soccer Pythagorean and posted it on this site.
  • The formula started getting some attention from the press, but the reporters were suffering bad flashbacks to their days in high school math class.  I attempted to put them at ease by writing an explanation of the Pythagorean formula with very little math content.  Not sure if I succeeded, but they still read my blog, so that should count for something.
  • Now that the hard work of developing the Pythagorean was over, I could concentrate on addressing the issue of a 'universal' Pythagorean exponent and the effect of Pythagorean goal variances on the quality of the estimate.
  • And to wrap a bow around the whole effort, I presented my work at this year's NCSSORS.

So what were the lessons learned from this work?  And what might it be good for?

  • Proceed from first principles.  Here I can't express how valuable it was to find Steve Miller's paper on the baseball Pythagorean.  Seeing the work that he had done to derive that expression gave me the confidence that I could do the same for soccer with some tweaks and wrinkles.  It's all about making some logical (and defensible) assumptions, expressing those assumptions in the language of mathematics, and then using the mathematics as a guide for answering the questions you have.
  • Stand on the shoulders of giants.  See above.
  • Understand your formula.  One of the most pieces of advice that I received from my thesis advisor was that if you don't understand the model that you are using to do designs or analysis, you really shouldn't use it.  Formulas are more defensible if you understand the meaning of every term and its influence on the final result.  The Pythagorean is nothing more complicated than a win/draw probability calculation given the average scoring offense/defense.  Then it's a matter of defining a win and a draw.
  • Pythagorean exponents do matter.  If the Pythagorean exponent is too small, it will underestimate the number of wins and draws.  An exponent that is too large will also have the same effect.  While I won't be able to prove it definitively, there does appear to be a 'universal' league Pythagorean exponent, or at least a range of exponents that work well for almost all teams in all domestic leagues.  I have been using 1.70 as a universal exponent; the average league exponent from all the leagues that I've studied has been around 1.66.  An exponent between 1.55 and 1.80 should work as well.
  • The soccer Pythagorean is a team-centered metric.  It assesses team performance using the most important metric of all — goals.  Dean Oliver says in his book that an analyst should seek to understand the characteristics of the team before understanding the characteristics of the players who make up the team.  The soccer Pythagorean points out which teams might be performing well outside the 3-5 point differential, which would motivate further study of those teams.  In the same vein, the Pythagorean could be part of the package of coaching metrics.
  • The soccer Pythagorean is an assessor, not a predictor.  From the current goal statistics, the Pythagorean answers the question, "How is the team performing relative to expectations?" It does not answer "How will the team perform in the future given its current form?"  One of the dangers of some predictive measurements is that they take current form and extrapolate way too far in the future.  A good estimator would have to take into account future opponents, home/away matches, and updated offensive/defensive characteristics.  The soccer Pythagorean can be part of an improved estimation framework, but it is not that by itself.
  • The soccer Pythagorean tends to overestimate the number of draws.  In most domestic leagues the percentage of draws is between 20-33%, and the soccer Pythagorean tends to estimate a similar number of draws unless the goal statistics are wildly different.  It tends to compensate for overpredicting the number of draws by slightly underpredicting the number of wins, which still results in an estimated point total within 3-5 points of actual.
  • Consistently stingy defenses win championships.  One of the side studies that I did showed that there appears to be a much stronger correlation between defensive goal variances and average points won per match than offensive goal variances. If a team's defense allows few goals per game, the team is liable to win a lot of games.  If a team allows few goals per game consistently, then it is likely to win a number of games that should have been draws.
  • Teams at the top or bottom of leagues deserve to be there.  What I have found out from looking at so many domestic leagues is that in general the teams that end up winning the league or get relegated deserve to be there.  By that I mean that their point total is in line with the expectations from their goal statistics.  You will find some exceptions, like FC Twente last year in the Netherlands (Pythagorean differential of +15!), or Wigan Athletic in England (should have been relegated but had a Pythagorean differential of +7).  Teams at the very bottom not only deserve to be relegated, they also play much worse than their statistics would indicate, which could indicate some sort of on-field breakdown — not surprising to see from the worst team in the league.
  • Second or third place teams are the real overachievers.  A big season by the league winner tends to overshadow just how hard they were pushed to the title by other teams.  Barcelona, in their Treble-winning season, had a fantastic goal scoring record and they deserved to be champions.  But Real Madrid and Sevilla had very strong seasons and made the title race much closer than it should have been.  In Ligue 1 last year, Marseille were deserved champions, but Auxerre and Montpellier greatly overachieved (Auxerre gaining entry into the Champions League because of it). 

As you can see, there were a lot of insights to be drawn from one formula.  It has been rewarding to examine the Pythagorean in the context of soccer, and I've picked up some techniques that can and will be used in other analytics problems.  I've also developed a nice suite of tools to find solutions quickly.  Now it's all about applying this formula to leagues and perhaps to other analytics problems.