MCFC Analytics ‘basic’ dataset: Not good enough

Like many of the visitors to this site, I eagerly awaited the release of the MCFC Analytics dataset. I don’t expect everyone to agree with me, but I’ll just get to the point and say it: the basic dataset that MCFC and Opta have released is simply not good enough.

The MCFC/Opta dataset is a table of summary statistics of players who participated in Premier League matches over last season.  Now, I recognize that many people don’t have the interest or technical chops to drill into micro-events of a football match.  Fine.  Yet even so, this dataset is woefully incomplete.

It is possible to back out the starting lineups and substitutes for each match, but impossible to determine who was substituted and for whom.  There are no timings for macro-level match events, such as goals, penalties, bookings, or substitutions. We don’t know who was the referee for any of the matches, nor do we know who were the managers for either team in a match.  We don’t have venue data, and we don’t know what the match attendances were.  Granted, some of the fields that I mentioned have nothing to do with in-match analytics, but many of them do, and in any case I prefer to develop match data models for the widest possible range of analytical applications.

A number of people have sent us their files desiring that we convert them into one of our databases.  We thank you for your interest and committment.  We will honor our commitments to you.  I’d like to propose an initative of my own.

I am creating a new database schema that is similar to FMRD but models summary statistics in soccer.  I want to enrich this dataset with publicly available information on the match, such as referees, managers, top-level match data, and the timings of major events in a soccer match.  I know that between the visitors of this site and myself, we have this information.  I’ll contribute the schema and some of the software I’ve written to build FMRD and FMRD-Light databases. I haven’t thought through all of the details — just thought of this within the last hour — but I’ll reveal more later.  I do want this to be an open-source and open-data effort with contributions from the soccer analytics community.  It might take a little longer to develop the end result, but I believe that everyone at whatever level of skill and sophistication will be pleased with it.

I’ll have more to say on this by the end of the weekend.


