Soccermetrics’ Marcotti repository has finally reached an important milestone. The transition of the schema from SQL scripts to Pythonic data models is complete, and a next-generation set of extraction, transformation, and loading (ETL) tools has been ported successfully from other projects. It’s not a final version, but I feel comfortable with setting a version number at this point.
The initial data set that I used to debug the schema and ETL code comes from the Enhanced Data Project for MCFC Analytics. I’ve used this data set to demonstrate and test other data modeling projects within Soccermetrics, so it’s only fitting to return to this rich data. I have decided to make this data available on the ProjectData repository so that you can download it and ingest it in a Marcotti-formatted database or one of your own. To create a club Marcotti database and load the data, consult the Creating Databases page on the Marcotti wiki.
There remains a lot of work to be done with the ETL tool, such as accommodating other types of club competitions as well as national team competitions. Currently the extraction, transformation, and loading tasks are mingled in a single method, and those tasks need to be separated in order to make the ETL tool more flexible. More sophisticated error handling and reporting would also be beneficial.
For now, it’s possible to shift attention from data modeling to data analysis, and that will involve building a true football analytics library, which I’ve done in pieces but not together. Should be exciting!