Manchester City’s MCFC Analytics page is now online, and thousands of people are signing up to receive access to Opta’s on-ball data for the 2011-12 Premier League. But what will users be receiving? This post gives a hopefully not-too-technical description of the data.
Opta sends out their data feeds for all their sports in XML format. For football, a variety of data feeds are generated — players, referees, match results, and on-ball (touch-by-touch) data. (I’m sure there are others I’m not mentioning — Opta’s primary market is media, after all). I will discuss the on-ball data file since that is what advanced users will have.
Opta’s XML structure is quite simple:
The Games element is the root element and won’t have any data of interest, just the timestamp at which the file was created.
The Game element is where the excitement starts. At the top level is high-level information on the match:
- home team
- away team
- scheduled kickoff time
- actual start time of each period
Within each Game element is the Event element, and these Events make up the full description of match events with the football game. There’s only one Games and Game element in the XML file, but multiple Events that may have zero or more Qualifiers.
Every Event is tagged by the match time, the (x,y) pitch location at which an event occurred, an ID associated with a player and his team ID. This location is normalized on a 100×100 pitch and is always described as if each team is playing left-to-right (0.0 is the defending goal line and 100.0 the attacking goal line). Events also have an outcome attribute that give event-specific information on the result of an event, like a pass. Some events, such as lineup information, substitutions, match stoppages, and end of periods, don’t occur at a spot on the field, so numbers associated with those attributes won’t make any sense (and are often set to zero).
Opta records about 60 in-match events, of which there are at least 150 qualifiers (don’t be too intimidated — about 30-40 are relevant to most people). So here are some of the touch events that you’ll see:
All of these events have contextual data associated with them. Passes are associated with end (x,y) points, the receiving player, and the bodypart used. Unsuccessful shots are associated with (x,y,z) points, the type of shot, and bodypart used. Goals have similar associations.
Now, I admit to flying blind, but I’ll just go on and say what will not be in the data. There are IDs to match referees in the pre-match events, but I don’t know if there will be a file that cross-references to referees. Managers associated with each team are not present. There are no venue data, or even match attendance fields. Weather conditions aren’t captured, either, but you can capture changes in conditions.
I’ll just say that all of the aforementioned fields are captured in the Match Result and Match Event database schemas developed by Soccermetrics.
In the next series of posts, I will introduce the Football Match Event Database and discuss how we will fill in the gaps of the Opta dataset.