Moneyball and soccer

This was a post I wrote on my primary blog, and I’m reproducing it here.


A couple of years ago I read Michael Lewis’ Moneyball, which is an excellent book on the 2002 Oakland Athletics and the behind-the-scenes maneuvering that assembled that team.  I was living in the Bay Area when the A’s went on their 20-game winning streak and I remember the excitement in the East Bay during that time, so the book appealed to me on that level.  But it’s more than just a baseball book.  It’s really a story of the use of statistical methods to assemble a team in a sport so heavily depending on scouting evaluations, despite the reputation of being a statistics-heavy sport. Now that the A’s are running the San José Earthquakes, and the architect of that approach (GM Billy Beane) is seeking to apply it to soccer, the following question has to be asked: Can a ‘Moneyball’ approach be successful in soccer?

While I believe that there are some very intriguing questions about the applications of statistics to soccer, in general I think the answer to the original question is no.

I believe that the objectives of a systematic statistical and mathematical approach to soccer are twofold:  one, to obtain insight into a particular game, player, team, or season, and two, to identify talent.  There’s a persistent desire to obtain information about the game beyond a subjective analysis of playing strategy or a player’s effectiveness in a particular formation.  It’s possible to generate a lot of data about any sporting event; it only makes sense to attempt to develop some metrics that enhance subjective understanding.  Unfortunately, the majority of conventional sports statistics are useless for obtaining an accurate and reliable understanding of what is going on.  (That doesn’t stop reporters and pundits from relying on them to make any argument,especially in the USA!)  This conclusion motivated Bill James and other sabermetricians to develop better statistical metrics in baseball.  In soccer these conventional statistics are even more useless for obtaining any kind of insight as they are often incomplete, imprecise, or irrelevant.  For example, it’s possible that a team that wins a match will have more shots on goal or more corner kicks.  But soccer fans everywhere can recall many matches where the losing team created a lopsided number of shots or corners.  These are examples of statistics that are not reliable indicators of performance as currently applied.

And that brings us to the second objective, which is to identify talent.  We would like to develop and refine statistics in order to identify the elite players, undervalued players (those who might be better than conventional statistics indicate), and overvalued players (those whose performances are not as good as the conventional statistics would indicate).  The first category is important to demonstrate the efficacy of the statistic — the best as identified by the statistic should be the ones that you would expect.  To this end, a couple of Billy Beane’s soccer statistics have some promise.  The second and third categories are important from the standpoint of a head coach or a general manager who is seeking to retool a squad.  Are there potential gems in the player market who can be obtained for a discount?  Are there players who should be sold before they drop too much in market value?  Is it possible to find through objective statistical analysis players who — rightly or wrongly — traditional scouting and subjective approaches have missed?

I believe that a useful statistic in soccer will ultimately contribute to what I call an “expected goal value” — for any action on the field in the course of a game, the probability that said action will create a goal.  One might obtain certain types of data from actions associated with the various positions (what I have below is NOT exhaustive, I’m just starting the discussion):

  • Goalkeeper:  saves, goals conceded, penalty kicks conceded, corner kicks conceded, passing/distribution %, balls won
  • Defender: goal kicks forced, balls won/lost, corner kicks conceded, fouls conceded, penalty kicks conceded, passing %
  • Midfielder: passing %, balls won/lost, assists, fouls conceded, fouls won, corner kicks forced, shots, goals
  • Striker: shots, goals, passing %, balls won/lost, corner kicks forced, fouls won

By just looking at those acts, it should be apparent that some precision will be required.  For example, is a foul committed 70 meters from goal the same as one conceded just outside the penalty area? What’s a good shot on goal, or what is a “high-percentage” shot on goal?  What is a “good” pass?  One of the passages that remember most from Moneyball was how much the sabermetricians had to rethink what they thought they knew about the game of baseball — what comprised a good hit, what statistics was the best indicator of future success in the majors, and so on.

So there are a lot of interesting questions in soccer that could be addressed with a systematic statistical approach.  But why am I skeptical about its success?  One reason is the sheer complexity of soccer.  Technically speaking, soccer is an example of a piecewise-continuous, highly nonlinear, stochastic dynamical system.  In layman’s terms, soccer is non-stop, freely flowing for the most part, and subject to a lot of improvisation within a team structure  — the things that we love about the game.  But it’s not a game that will permit its secrets to be known with total precision, or even a high level of precision.  The problem arises when we use statistics without understanding either their purpose or their limitations — not just in sports but in other fields.  There are problems that will keep academicians occupied for the rest of their careers in the area of stochastic dynamical systems, or hybrid dynamical systems, or probability and statistics.  But in order to make individual problems tractable it’s necessary to make some approximations and simplifications, and by doing so the richness and unpredictability that you see in soccer is lost.  How would the resulting findings be applicable to understanding the real game?  I believe this issue gets into the dilemma of capturing the system dynamics of soccer at the risk of creating an intractable problem, versus creating a tractable problem that has no relevance to the actual game.

My second reservation lies in the sheer amount of data required to make an accurate and useful metric, which would limit its efficacy in some parts of the world.  This is the challenge of taking measurements from a nonlinear dynamical system.  Before my current job, I worked as a researcher in experimental and computational fluid dynamics, and I can assure you that obtaining measurements from a nonlinear system is not an easy task.  For soccer, one would need video analysis that includes player and ball tracking, for which one would need at least three cameras at each match and video processing software.  I was going to say that this would be a difficult task, but in July a group of Italian researchers presented such a system at an international conference in Canada.  (If they’re smart, they’ll license it and make a lot of money.)  I could see this system being feasible in North America, parts of Latin America, Europe, east Asia and Oceania, but not in some parts of the world that are much poorer (Africa, some parts of South America and the Caribbean).  And those latter areas are where you are more likely to find undervalued players.

So in conclusion, while I believe that there is a wealth of  interesting academic and practical problems in order to create a quantitative understanding of soccer, in the end the game won’t permit itself to obtain that kind of understanding with sufficient precision to make such an approach feasible.  I’m willing to be proven wrong on the latter point, and I’m even willing to pursue some of these research problems myself – with any funding from a soccer club, of course.