The tension between open-source and proprietary soccer analytics

My post in response to the Brian Phillips article in Slate touched a little on the issues of open source versus proprietary approaches to soccer analytics, but this blog post at Hot Time in Old Town (a Chicago Fire fan’s site) has motivated me to write more at length on the subject.  It’s something that I’ve thought about at varying levels of intensity since I started this blog.

There exists a considerable amount of tension between open-source and proprietary soccer analytics.  That is to say, there exists a considerable amount of tension between open-source and proprietary sources of soccer data.  It is more acute in soccer than most other sports because of the sparsity of data that is available in the public domain.  I came up with the following schematic to illustrate this point:


In a sporting event, I define the facts that are known before the start of the event and those facts revealed after the event as the “historical facts”.  The facts include the competition, the competing teams and their squads, the referees, as well as the final scoreline, scorers, and other significant events attributed to a player (fouls, assists (post-1994), bookings, substitutions, etc).  These events are in the public domain and available to everyone, and the ruling in a 2006 case between a fantasy sports company and Major League Baseball backs up this definition.

Now, there are micro-events that occur within the match that generate the historical events — player actions or movements with and without the ball that occur at certain spatial locations on the playing surface and take place at a certain time of the match.  (Let’s add the referee’s spatial movements and decisions during the match as well.)  It is easy to observe that all of these events comprise an immense amount of data.  As best as I understand it, all of those data that explain how the historical events come to pass are property of the competing teams and the league.

The division between open-source and proprietary soccer analytics originates from not only the data upon which these analytics act, but also the sparsity of public-domain data in the sport.  It is possible to develop analytics based on nothing more than historical event data — lineups, scores, goalscorers, league tables, and so on.  I’ve done that and plenty of other websites do it.  It is also possible to develop proprietary measurements based on those data, whether as the result of some regression analysis on huge amounts of time-series data or by other methods.  Here I don’t define proprietary analysis as solely a computer program that’s made closed-source, I refer also to those who only report the results of their analysis and don’t reveal their algorithms or codes.  (It is also possible to create highly sensitive results from work on different pieces of freely-available data – the OPSEC issue.)  When it comes to proprietary data, however, the only legal way to develop analytics is through the use of proprietary analysis tools.  One could develop — and some have developed! — analyses based on in-match video data that they compile themselves, but one does so in violation of the intellectual property rights of the league and the competing teams.  (These rights are even stronger for sporting events in Europe.)

So with all this in mind, what does this apparent division have to do with soccer analytics and the aforementioned article?  A lot, in my opinion, because the ability to develop open-source analytics and stay on the right side of the intellectual property law depends on the type of data to which one has access.  Soccer’s problem, at least to the statistical analysts, is that there are so little data freely available to use, and even the expanded set of statistics is less descriptive and relevant compared to that found in other sports.  So when Mark Rogers says that Major League Soccer has failed to develop statistics and claims that “many of the statistics that fans would be interested in seeing already exist. You’re just not allowed to see them”, that’s not quite accurate in my opinion.  In fact, Major League Soccer did produce very detailed statistics in its early days (as did NASL), and they were ridiculed in the international soccer media for doing so.  (I remember in the mid-1990s World Soccer magazine asking MLS for match details so that they could publish brief reports on the league, and receiving instead a barrage of statistics and box scores but not match details in the British sense of the phrase.)  MLS still collect those data, but they don’t call attention to them.

It is easy to make comparisons between the analytics work in soccer and the sabermetricians in baseball, especially when considering the transition from amateur analysts to the front offices of MLB teams.  However, it’s equally easy to forget that those amateur sabermetricians were working from historical facts.  Major League Baseball is not performing some grand gesture by providing these facts; they are following a convention for summarizing baseball games for over 120 years.  The events that generated those historical facts — ball velocity, ball spin, ball position over home plate, bat speed, ball landing point, fielder positions, to name a few — are unavailable for games prior to the 1990s (or even 2000s), are protected by the MLB and the teams, and aren’t released to fans.  NBA behaves the same way, so do the NFL, and they’ve started to enforce their play-by-play data rights more aggressively.  The UEFA Champions League exercises control over all data associated with their matches.  Soccer analysts are working from fewer historical facts, especially in matches before the 90s, and the information that they would really like to have is protected by the data collection companies, who almost certainly have had to pay a licensing fee to the leagues.

It’s not surprising that sport data companies would perform in-match data collection for their clients, and it’s not surprising that the leagues would not be collecting those data already.  The first reason is the costs involved with setting up cameras, specialized software, and data storage, and the second is level of expertise within the league required to use those equipment.  Far better for the league to license those data collection rights to the sports data companies, who would then sign deals with the teams.  It is true that those data are proprietary, but they were never freely available in the first place.  Even if Prozone, Opta, and Match Analysis didn’t exist, those data would not be freely available.

I think what a lot of aspiring soccer analysts resent is the inability to interface to proprietary data, even if one has no interest in physically possessing those data.  This is the underside of the proprietary data companies in that there is no incentive for them to divulge software interfaces to their information.  There are multiple challenges in developing analytics software, the first one developing interfaces to the dataset, the second the algorithm development for the problem of interest.  Another issue is who owns the results generated from analysis on proprietary data.  My layman’s guess without being totally aware of the law (lawyers who read the blog feel free to correct me) is that that information still belongs to the company with holds those data, who can charge you a license fee if (especially if) it’s used commercially, or displayed on a website that generates a lot of pageviews.

Do I think open-source soccer analytics have their place? Yes, absolutely. In the end, what will make an analytics tool open-source or proprietary is the nature of the dataset that it touches.