A closer look at Ajax’s goal performance

I am so fascinated by Ajax's performance in the league this season, and I'm intrigued that there was such a large gap between the Pythagorean estimate and the actual point total.  To be sure, FC Twente's performance also greatly outperformed their Pythagorean estimate.  But Ajax's record deserves closer scrutiny because their goalscoring record was so ridiculous.

Most league teams exhibit skewness, or asymmetry, in their goalscoring distribution, in almost all cases to the right.  This is called positive skew in the literature.  To give one example, here is Manchester United's goals scored distribution from last season's English Premier League:


The distribution is concentrated between one and two goals with the higher scoring events spread out to the right.  It's fairly typical of most teams including league winners like Man U last season.  

For completeness, here's Man U's goals allowed distribution from last season: ManUtd0809_GoalsAgainst 

Here the skewness is more pronounced as the defense created a majority of clean sheets and allowed very few goals in the other matches. (Man U's defense only allowed three goals or more in two league matches in 2008-09.)  This is a typical observation in league winners; United's record is very good compared to most champions, but it's not unusual for a champion's defensive record to be so positively-skewed.

When we develop the Pythagorean exponent, we come up with parameter estimates that fit both the offensive and defensive distributions simultaneously.  The big assumption is that both probability distributions follow a Weibull distribution.  Because we have to fit both distributions simultaneously, we can't fit every feature of both distributions, but we can fit enough to make the estimate fairly accurate within an acceptable tolerance level.

Now, here's Ajax's goals scored distribution: Ajax_gf
Ajax's scoring record was quite different from what I've seen from most teams, including league winners.  Their scoring distribution is very negatively skewed, meaning that they scored goals in bunches a lot during the season.  (The bars in the histogram display the raw frequency of goals scored between the intervals on the goals axis; the blue line is the probability density.) 

Their scoring defense record is even more amazing: Ajax_ga
Much was spoken about Ajax's scoring prowess this season, but not as much about their goal defense. They allowed just 20 goals in the league, and only four at home!  There is a very strong positive skewness in the distribution, and a high level of peakedness (kurtosis) present.  (I'm still learning R and haven't figured out how to smooth the density properly.) 

I believe that this combination of scoring distributions causes the Pythagorean estimate to break down.  The league Pythagorean exponent produces a probability distribution that will never be able to correspond well to both the offensive and defensive distributions under these conditions.  I am willing to guess that this happens during the truly historic seasons by league winners, whether it's Barcelona last season in La Liga, and perhaps Chelsea in the Premiership this season.

I am thinking that there needs to be an estimate of second-order points that would be augmented to the initial Pythagorean estimate.  In contrast to the second-order wins formulated for baseball, I seriously doubt that it can be determined by looking at the box score.  One possible route is to consider the moments of the probability distribution about the mean and determine their effect on points won during a league season.  That's one approach; I'm open to others.