This is the second part of my week-long presentation of the Football Match Result Database project. Here are the other articles:
Part I: Selected screenshots
Part III: Database objective, design, and features
Part IV: Code release
Part V: Roadmap and call for contributors
I've looked back at my previous posts on the subject and I realize that I have been writing about this project for almost two years! There are new visitors to this site all the time, so I think it would be good for my readers and me to revisit the origins for this project and the reasons for the decisions I made, with links to earlier posts included.
At the risk of sounding oxymoronic, I am a lazy overachiever at heart. Well, the good kind of lazy, anyway. I believe that every analyst should go through the trouble of collecting data manually for at least one analysis project. I guarantee you that the data collection will be so painful that you will be searching desperately for ways to automate it so that you enter data once and access it multiple times.
I found myself in this situation shortly after I started Soccermetrics, the website. I had, and still do have, some analysis ideas that I felt were really good, but the act of collecting and arranging data so that I could then write a script to act on them was extremely daunting. After a couple of analysis projects, I realized that my current process wasn't going to work, so I set out to make my task of data analysis easier.
I am also a bit of a control freak. I looked at the various online databases and did not find any with the functionality or completeness that I was looking for, so I decided to create my own. I also wanted to have control over the database design and format, which motivated me even further to create my own database. Now that Soccermetrics, the website, has become Soccermetrics, the company, the development of a high-quality database has become mandatory.
I decided early on in the project that the database would be a collection of the macro-events associated with a football match. These events would include the competition under which the match was played, the match venue and date, referee information, competing teams, participating players and managers, and the major match events — goals, penalties, disciplinary incidents, and substitutions. I did not model micro-events associated with a football match because it was much more difficult to collect those data and the data would often come from proprietary sources. (I write in more detail about micro- and macro-statistics in soccer here.)
I had the goal of developing a database that was "flexible, search-able, and extensible." It needed to be sufficiently thought out so that the atomic events related to the football match were modeled, yet flexible enough to account for various types of competitions in the game (league, knockout, hybrid) and extensible enough to collect specialized data on certain atomic entities. Getting to that point, however, was a challenge for someone new to database design.
My first attempt at a database was laughably pathetic — it's only after looking back two years later that I realize just how bad it was. Simon H. (you know who you are) took mercy on me and gave me some very valuable suggestions via email, although I didn't know what "normalizing" meant. I actually tried to implement my design before I had completed it — bad Howard, bad! — and made such a hash of it that I went quiet on the subject for almost six months. I knew that I needed to improve my expertise on database design, so I purchased what turned out to be a valuable book on conceptual database design. Excellent excellent book. It helped me learn the design elements quickly so that I could create a detailed specifications report for my database. I would not call myself a database pro, but I am much more comfortable with relational database design as a result of reading that book. I cannot recommend it enough.
With a design for my database in place, I could progress to the next challenge: implementing it.
No matter how great the database design is, it's not worth much without a manner of loading data into it, so I wrote a data-entry application. Database design and the implementation of a database in an application are two distinct tasks. There are lessons learned during implementation that feed back into the database design, but it is not a good idea to carry out a design with a specific database management system or high-level language in mind, so I focused entirely on the design before shifting over to implementation.
The immediate question the people will have is "Why didn't you develop a web-based application?" It's a very good question and I doubt that my answer will convince people that I went about it the right way. Those who suggest a web-based application have a good argument — no extra installation required, self-contained within a web browser, remote data entry enabled, and others. I have more experience developing desktop-based applications and it ended up being a matter of sticking with what I was more familiar. I was concerned about security and local testing and look-and-feel, so I wanted to work out whatever issues I had on the desktop app first before I tackled a web-based app. Of course, I could have addressed those issues on the web-based app as well. Some claim that a web-based application is easier to implement, but for this task I'm not sure it's that much easier to implement than a desktop application. In the end it comes down to preference, and I preferred a desktop app, but a web-based app is very much in the development roadmap.
My problem was finding the development framework that gave me the functionality that I desired combined with a professional look-and-feel. I wanted the finished product to be something that I would love using, and something that other users, analysts, and club employees would enjoy using. It took a long route to get to the ultimate framework selection.
(I should also state in passing that the user interface was a third element of the project, and was intertwined with the implementation. I devoted a lot of sheets of paper to the design of the dialogs, using insights from my own experience and intuition on how a quality user interface should look.)
I started with OpenOffice Base because it was easily available — built into the package if you are using OpenOffice.org 3.0. I was able to implement the basic tables, but was frustrated by the lack of documentation and what was frankly an ugly user interface. It would take way too much effort to make that user interface attractive. I had to learn how to build and code forms at the same time. Eventually I became bogged down when I attempted to design subforms and forms with linking tables. I figured out how to get the subforms to work, but the complicated tables like the Matches and Lineups tables were extremely difficult to implement in OpenOffice Base. I gave up after a few months and searched for something else.
I found a XML-based framework called Bond, which was developed by a New Zealand software company. Their backend was based on Postgres, which influenced my decision to implement my database there. I had not been wedded to one relational database over another, but I came to prefer Postgres due to natural support of views and other features of relational databases. The GUI look-and-feel was closer to what I was looking for, but again I bogged down on subforms and linking tables, which were critical to the database. Coding in XML was VERY VERY painful, and more so because the documentation was very sparse for some critical API functions. Worst of all, the form would crash out due to a segmentation faults that I could not find!
I put the implementation on ice for a few months, wondered if I had gone down the wrong path, and got progressively more frustrated with the whole thing.
Out of desperation I made another internet search for a GUI database application and came across the Nokia Qt toolkit. It was clean, it was professional, it was extensive, it had excellent database functionality, it was extremely well-documented and supported, but it was in C++, and I wasn't thrilled about the prospect of coding in C++. It was then that I found this book by Mark Summerfield on fast GUI development using PyQt, a Python binding for Qt, and I realized that I must buy that book. Yes, I would have to learn Python, but it had been on my to-do list so there was no time like the present.
Design and coding of the all of the modules and dialogs in Qt Designer and Python — at least 30 of them, not including error and information pop-up windows — took about 6-7 months with a lot of learning involved. The look-and-feel of the dialogs was great, and implementation of the simple tables was a breeze. Linking table implementation was a challenge because it wasn't a built in feature of Qt, but after trial and error I got a solution that worked very well. There were some modifications of the database due to lessons learned in the implementation, but I now have a finished product that implements all of the features that I envisioned and looks great.
I did not plan on taking close to two years to develop this database and its implementation, but I learned a tremendous amount about database development, Python and Perl programming, and GUI development that will serve me well in future projects.
In the next post, I will discuss selected features of the database application and my development roadmap.