I'm using the 2008-09 English Premier League as a test case of the three-parameter Weibull distribution that is the basis of the Pythagorean formula. Below I present the exponent values that I obtained from the least-squares estimation, as well as a few plots that show how the resulting Weibull distribution fit to the score data.
I compiled match result data — goals scored and allowed — for all 20 teams in the English Premiership, resulting in 40 goal distributions. The highest number of goals any team scored in a match was six, so I formed a histogram with seven bins (0-6). (This is simple to do, there are Excel commands to count within a block the number of cells equal to a certain number.) Then I inserted the resulting histogram into a Scilab script that I had written to implement the least-squares algorithm. The result was an alpha term, which is a scaling parameter, and a gamma term, which is the shape parameter. Both are important for matching the distribution to the data, but the gamma term is especially important for the Pythagorean formula. To make that formula work, we have to assume that all goalscoring distributions have the same shape between all teams involved.
(And in case you don't know what Scilab is – it is a numerical computation package developed by INRIA (the French national laboratory in informatics). It's a very good package, comparable to MATLAB (another high-quality numerical package, developed in the USA), but much less expensive. A MATLAB license costs close to US$10,000, while Scilab is free. I love MATLAB, but Scilab is almost as good if you don't have a lot of coin. End digression.)
So here is the spreadsheet that contains the raw data:
and here is my short explanation of the least-squares algorithm applied to this problem:
And here are the exponents for the twenty teams in the 2008-09 Premiership:
||GF Exponent||GA Exponent|
Here are a couple of examples that illustrate how the parameter estimates fit the goal distributions. Here is the goals scored distribution for Hull City:
Now, here's the goals scored distribution for Manchester United:
The distribution works well at low goal totals, but falls apart after three goals or more. That is a feature of the Weibull distribution when it comes to modeling scores at the far end of the distribution. Manchester United's goals allowed distribution has a similar problem:
This kind of strongly skewed distribution ended up breaking the least-squares algorithm. The Jacobian matrix that I used to make estimates on the parameters was so ill-behaved that I had to take tiny steps just to insure a stable iteration. It ended up requiring over 300 iterations just to achieve two-digit accuracy, and even then the results aren't that great. This type of distribution reflected the need for a better type of probability distribution, like an extreme value distribution, and demonstrated just how freaky Manchester United's defensive record was last season (24 clean sheets!).
Now, I'm looking at the exponent results, and I realize now that I did things a little differently from Steven Miller's paper. I did a least-squares on each goal distribution (scored or allowed) separately; Miller performed least-squares on the sum of the runs scored distribution and the runs allowed distribution, resulting in two alpha terms and one exponent term that applied to both distributions. So I need to go back and figure out how to implement that kind of least-squares algorithm to the score data.
In the end, I've made some important steps forward with the goal distributions and fitting the Weibull distribution to them, but I don't have everything I need to compute the Pythagorean yet.
UPDATE: I figured out what I needed to do; it's just an expanded Jacobian matrix. I'll update my explanation of the algorithm to include that change. Funny thing was that I had made the modification to my code and was wondering why I was getting such weird results (probability curve not adding up to 1, huge gamma exponent). After banging my head on the desk for a few hours, I realized that I had changed the wrong variable name my code! It happens to all of us sometimes!
Anyway, the code is running well and the exponents aren't varying as widely as they did before. But I haven't started to crunch the Manchester United histograms yet. It's too late to post anything else tonight, but I'll show my results tomorrow evening.