I haven’t written about my software projects in a while, so I’m taking a moment to do so here.
My analytics work is supported by data and software written to produce analytical content from them. The software, which I’ve called Marcotti (formerly Football Match Result Database), is a library that creates match databases, loads data into them, and queries them in order to train models, identify patterns, or visualize information.
Marcotti-Events is a superset of Marcotti in that it models the finely grained event data in a football match. I’ve used Marcotti-Events to load and access data in my analysis of Argentina’s Primera División/Superliga competitions as well as other national team competitions in the past. It’s matured a lot over the past year, and I’m rather proud of how far it has come. The current release is v0.6.2, which doesn’t sound like much progress but I’ve been slow to bump up the version number.
I’ve given some thought to the roadmap for Marcotti-Events, and these are the major features that I want to add or see added in future releases leading up to a 1.0 release:
- Transition the codebase to Python 3.5+. The current codebase is at Python 2.7, which is fine for those who want to keep using Python 2.7, but most of the world has moved on to Python 3 (and 3.7 is in development!) and the old objections to transitioning are fading. Moving to Python 3 would allow us to fully embrace Unicode and exploit some of the language features that are unique to Python 3. Python 2.7 will continue to be supported for the foreseeable future.
- Fix and expand the test suite. This is a task that will impact the codebases for Python 2.7 and 3+. The test suite is currently broken — I focused on updating the model code and didn’t bother to update the tests (bad practice, I know). There are zero tests for the ETL code. At the same time, the repo has been rearranged as a proper Python package and configuration files are setup by a user command, so the test suite has to work under such an environment.
- Merge club and national team schemas. I’ve kept the data models for clubs and national teams separate in order to distinguish clubs from countries and avoid repetition, but the costs of that decision were a complicated schema and ultimate a complicated codebase. Having Clubs and Countries inherit from an Organizations data model might remove the need for this distinction, which would simplify the codebase and allow for club and national team data to be incorporated in the same database.
- Redesign and expansion of library module. The library module contains routines and data structures that produce analytical content. I’ve created a lot of scripts to analyze match data and it’s time to port some of the routines into this library. This is going to be a massive design challenge.
I will almost certainly think of more features to be added. Thanks to the analytics community for your interest in this project — most analysts tend to do their own thing when it comes to crunching numbers so I appreciate those who have found this useful in some way. I’m going to work on these features throughout the season, but if you’d like to join me, come visit the Marcotti-Events repo on GitHub and browse the code and wiki.