This is the third part of my week-long presentation of the Football Match Result Database project. Here are the other articles in the series:
Part I: Selected screenshots
Part II: Motivation, design and implementation histories
Part IV: Code release
Part V: Roadmap and call for contributors
Yesterday I wrote a long article on the origins of the FMRD and the various turns I took in developing the database design and its data-entry application. In this article I will discuss the functionality of the design and application: its purpose, its main features, and the philosophical rationale for the design and implementation decisions that I made.
Last year, when I was writing the specifications document for the database, I wrote the following mission statement:
The purpose of the Football Match Result Database is to maintain the
match result data needed to support football research activities for
the benefit of the football analytics community.
The key phrase in that statement is match result data. I define it as historical and publicly-available data associated with the match which include the following:
- Maintain data on matches that occur within football league competitions, whether among national teams or domestic clubs.
- Maintain complete top-level data on the football match, including match date, competition name and stage, participating teams, venues, and environmental conditions.
- Maintain complete data on macro-events that occur during a match, including goals, penalties, disciplinary incidents, and substitutions.
- Maintain complete data on participating personnel, such as players, managers, and match referees.
I focus on match result data because I want to make FMRD as backward-compatible as possible. Soccer is a game in which very few statistics have been collected historically, but the data that you can find just about anywhere at anytime is in those four bullets above.
The initial version of the FMRD models league competitions without a playoff. There are various competition formats in soccer that need to be modeled, but one has to start somewhere, and I chose to start with a model of the competition format most widely used in domestic and international football. The plan is to develop the league model first and then extend it to model knockout competitions and hybrid competitions (part league, part knockout) which are supersets of league competitions.
FMRD is Open Source
I had always planned on making the Football Match Result Database an open source project. There are a number of database designs for soccer out there that are open source, but I had own differences with them (which I won't go into). So I created the FMRD to teach myself database design and propose my own idea for a football database. Ultimately I would love to have the FMRD contribute toward a standard database design for storage of football match results, or at least a statement of best practices. I feel that the way for this to happen is for lots of people to use it and make comments, which is what I'm hoping for with this release.
Another reason for going open source was public relations. I state without apology that I formed Soccermetrics Research & Consulting with the intention of making money. But sports analytics started in the grassroots community, data collection and access have always been challenges, and in soccer analytics the data issue is even more of a challenge because of the sparseness of the data that exist in the public domain. I feel that the value added is more a result of the analysis than the retention of the data itself, and I want to help the soccer analytics community grow. So I have no problem with making this project open source.
Now, the design of the FMRD. I wanted to design a database that facilitated data integrity and relationship integrity, as well as a database that was extensible for future derivatives of the model. I start out by describing a football (league) match like this:
On a given day and time under certain weather conditions, a football match is played as the nth round of the league championship, at a given venue, between two teams with eleven starting players and a few substitutes coached by two managers, with a match referee in the middle.
Everything underlined is modeled. It's possible to drill down further, but we arrive fairly quickly at the atomic elements of the match. Once I have the atomic elements of the match, I combine them through relations to develop a model of the match data. I wanted to get away from tying a player or a manager to a team, or even a player to a position.
To give one example: Cristiano Ronaldo may play striker most of the time, but that doesn't mean that he couldn't play as a winger, or even a goalkeeper (perish the thought!). I'd like to be able to model that event if it happens, and not be tied to a player being pre-assigned a fixed position. For that reason i tied the player to a position only when it came time to place him in a match lineup. At the same time, it's a bit of a pain from an implementation standpoint to have to scroll through a list of positions and select the same one for a player, so I created a default position field in the Players table and used that default position to set the initial index in the match lineup entry form. Yes, it is one area of the database design that was made with implementation in mind, but it didn't cost me anything and the design remained atomic and relational.
Because I am so focused on getting down to the atomic elements of the match model, I end up with a lot of tables. In particular I have a lot of validation and linking tables. It made implementation a challenge, but the final result is a robust database that can be extended without breaking the original design. There is some data integrity enforced in the database, but most of it, as well as the business rules, are enforced by the data-entry application.
The FMRD is designed to build upon itself like an anvil. A wide base of validation and personnel tables support the Match table, which is tied to the Lineup table. The entries from the Lineup table, filtered by match, make connections with an wide "flange" of Match Events. (I need to include a cartoon later!) You can't enter a field in the Matches table without having records in the Teams, Venues, and Managers tables. And you can't make an entry in the Goals or other Match Events tables without making entries in the Lineups table first. Those conditions are enforced at the application level.
I got a little ahead of myself with the names of table groups in the previous paragraph, so here is a fuller explanation. The database tables are organized into five main groups. The Overview tables contain high-level data about the competition in which the football matches are played. The Personnel tables contain data on players, managers, and referees. The Match tables contain high-level data on the match itself and the time/weather conditions of the match. The Lineups table is in a group by itself as it links the player personnel to the match, and then links referenced players to the Match Event tables, which collect data on the main historical events of the match.
- Match Events
- Position Switches
There are also supporting tables that I use to enforce validation. These are also grouped in the above categories:
- Weather Conditions
- Field Positions
- Flank Positions
- Positions (composite)
- Match Events
- Goal Events
- Goal Strikes
- Penalty Outcomes
(Upon further review, the Weather Conditions table should be listed in the Match group because it is a validation table for Environments, but at the time I didn't want to have two one-member sub-menus. It is not much trouble to rearrange the menu and the subsequent groupings.)
Now I write about the features of this database design and application.
- Facilitate study of league matches in one competition or over multiple competitions.
- Model not just the match events, but also the geographical and environmental information associated with the match such as:
- Weather conditions (using NOAA/NWS standard terms)
- Collect latitude/longitude information on match venues
- Capture data on generic on-field positions, and field positions modified by flank.
- Decouple players from a specific match or position until the Lineups table.
- Decouple managers from a specific match until the Matches table.
- Capture elapsed time in each half of a league match (including stoppage time), even if a match has been suspended before the completion of regulation time.
- Capture data from the following match events:
- Goals scored in open play or set-pieces (not penalty kicks)
- All penalty kick events
- All events that resulted in a disciplinary card shown by the referee.
- All substitutions
- All events in which a field player switched positions, whether tactical or forced (e.g. goalkeeper expulsion)
I'm sure I've forgotten about a feature or two, but right now I need to go to bed. The next post will be the one that you all have been waiting for — the code release. Sometime in the next 24-48 hours as I add the last feature and complete some more documentation.