Looking back on five years in soccer analytics

[This is the first of a three-part series on my perspectives on Soccermetrics on its 5th anniversary.]

On 8 January 2009, I started Soccermetrics in order to “consider problems in applied mathematics and statistics that have applications toward a better understanding of soccer.”  This month will mark not just the fifth anniversary of the site, but also the 5000th follower of Soccermetrics on Twitter.

I’ve written fewer “high-level discussions” on soccer analytics over the last two and a half years, preferring to leave that to the usual biennial assessments that I’ve made.  In fact I’ve grown less enamored of writing (and also reading) essays of how analytics are supposed to be; I’d much rather do analytics than talk about them.  That said, the five-year mark is a good excuse to pause and consider the growth in soccer analytics during that time, the evolution of Soccermetrics, and the lessons learned and challenges that remain.

I started Soccermetrics so that I would have a space to write about — and ultimately do — quantitative analysis problems in football.  “Soccermetrics” was a contraction of “Soccer Sabermetrics”, an attempt to replicate in soccer what had occurred in baseball over the previous three decades.  I definitely wasn’t the first person to present statistical analysis on soccer, and I wasn’t the first person with an engineering and computational background to write about soccer.  I do believe that the site was unique in that it took very seriously the statistical analysis — the mathematical assumptions (tempered by understanding of the game), the underlying statistical distributions and models, algorithms, and so on.  What is called “sport statistics” has not even the slightest thing in common with the mathematical field of statistics, and in football even more so.  I had looked at the quality of analytics research in basketball from researchers such as Dean Oliver, Aaron Barzilai, Dan Rosenbaum and asked myself why there couldn’t be work like that in soccer.  And why not this site?

Over five years Soccermetrics has evolved from a blog to something more organized, whether formal or less so.  These changes have occurred during a surge of interest and action in sports analytics, and in particular soccer analytics.  There is so much that I could write about regarding the small decisions I made and stands I took which gave Soccermetrics its identity, but if I wrote about all of that this would be a 10,000-word post.  So maybe I’ll save it for a book or something!  Instead I’ll write about three things: my initial expectations for Soccermetrics, lessons learned from attempting to create a technology startup, and perspectives on the growth in soccer analytics during that time.

People start blogs for multiple reasons — a daily diary, and information source, a creative and social outlet to the world.  Their audience in mind could be either a general audience or a narrow community, and who you believe you’re writing for has an impact on what you choose to write about.  My primary audience for Soccermetrics has been and will always be me, and beyond that the soccer analytics research community.  I write about topics that interest me, I work on projects that interest me, and I use all of the software that I write.  To be sure, I’m happy that people follow the site and draw some insight from it, but I would continue writing about these things even if I had no readers.

Nevertheless, it was fascinating to learn who did visit the site.  When I started Soccermetrics I thought I would receive the bulk of my visits from academics who did research projects on football as a side distraction from their main work.  I did get some of that traffic, but I started seeing visits from students, general soccer fans, curious journalists, punters jonesing for tips, sport traders, and personnel at professional football clubs.  Oh yes, and investors, too. The majority of those readers came from the USA and the UK (something that has remained consistent over five years), with interesting spurts from central Europe and Asia at times.

I was impressed that such a wide range of people involved in football followed Soccermetrics because I learned very quickly that not everyone is as into mathematics and statistics (the real kind) as I am.  In fact, a lot of people are very, very intimidated by mathematics!   I had the romantic ideal of presenting not only my ideas but also describing mathematical and statistical concepts in a way that was accessible to the general public.  I was not as successful on that front as I would have liked, perhaps because I didn’t understand some of my own ideas thoroughly enough.  Even if one places a high priority on writing clear and understandable prose (and I keep a copy of Joseph Williams’ Style: Lessons in Clarity and Grace on my desk and George Orwell’s “Politics and the English Language” bookmarked), technical writing is still going to be intimidating because of its subject matter.  I am not a fan of introducing complicated concepts just for the sake of looking “smart”, but I believe very strongly in being precise with language, especially mathematical language.  I also believe strongly in acknowledging and respecting the intelligence of those who do read the site.  This site is not written at a 6th to 8th-grade level, and I completely reject the notion that publications must “dumb down” their content so as not to be too far ahead of their readers.

In the early days of Soccermetrics most of my posts were about what other people were doing, whether as academic researchers or “amateur” soccermetricians who were seeking insight on some aspect of the game.  These “Paper Discussions” turned out to be my own reviews of the publication than a back-and-forth discussion in the comment section, but over time I started to receive papers from people who desired a mention on the site, so I guess people did read the posts and found them interesting.  The posts were useful in that it kept me aware of what was going on in the academic community, but after a while I wanted to present original work of my own.  I finally purchased my own Lenovo laptop running Ubuntu Linux (which I still have) and set out to do some statistical analyses of my own, starting with international competitions in CONCACAF.  And it was there that I realized that soccer analytics have two nontrivial problems:  the data problem and the data analysis problem.   The data analysis problem is self-explanatory, but the data problem encompasses measurement, collection, pre-processing, and modeling.  Everyone who does work in soccer analytics will contend with the data problem no matter how simple or complicated their analysis is, and if your analysis is the slightest bit involved the pre-processing process will be so painful that you will resolve never to go through the experience again.

As with a number of other projects, I was not the first person to think of a soccer database — there are hundreds floating around the web at varying levels of detail.  I was more concerned with creating a software framework that would allow me to construct databases and analysis without having scores of unrelated scripts on my computer.  It turned out to be a huge effort — I had to teach myself relational database design and Python programming at the same time — and in retrospect I would have been better served to create just enough database functionality to do the analyses I wanted to do.  (Sometimes being precise and meticulous can work against you.)  But over the long term I developed a lot of insight on data models for football and other sports, and the development experience I gained helped me land a few jobs.  I also became very sensitive to data integrity and consistency, which is something that too many football databases, and databases in general, lack.

These events were occurring on a backdrop of Soccermetrics’ transition from a blog to something more formal.  I’ll talk about that and the lessons learned in Part two of this post.

CORRECTION: Thanks to a reader who noted that I had written Adam Barzilai instead of Aaron Barzilai.  I have corrected the text above.