# Are there common features of teams with large Pythagorean variances?

My last post has sparked a question in my soccermetric mind: Are there common features in the offensive and defensive goal distributions for teams with large Pythagorean variances?  The Soccer Pythagorean works well at assessing the level of team performance relative to expectations from their goal statistics.  It can even predict point totals within a relatively narrow margin (4-6 points).  The estimation falls short with teams that have extremely lopsided offensive goal statistics.

A couple of days ago I looked at Ajax Amsterdam's goalscoring distributions from this season and observed that their offensive distribution is skewed in the opposite direction from that of more typical goal distributions.  More importantly, the offensive distribution was skewed in the opposite direction from the defensive goal distribution, which would make the curvefit of the underlying distribution very difficult.  To find out if that was also the case for other teams with extremely lopsided goal statistics, I took a look at Barcelona's record in the Spanish Primera last season when they won The Treble.  Below is a histogram and a smoothed probability density of their goal offense (horizontal axis is number of goals, vertical axis is probability from 0 to 1):
(As you can see, I'm starting to get the hang of using R. 🙂 )

Here is the same type of plot with Barcelona's goal defense:
And finally, here's the final league table from the 2008-09 season with Pythagorean estimates:

Team GP GF GA Pts Pythag +/-
Barcelona 38 105 35 87 76 +11
Real Madrid 38 83 52 78 65 +13
Sevilla 38 54 39 70 62 +8
Atlético Madrid 38 80 57 67 61 +6
Villarreal 38 61 54 65 56 +9
Valencia 38 68 54 62 59 +3
Deportivo La Coruña 38 48 47 58 52 +6
Málaga 38 55 59 55 50 +5
Mallorca 38 53 60 51 48 +3
Espanyol 38 46 49 47 50 -3
Almería 38 45 61 46 42 +4
Racing Santander 38 49 48 46 52 -6
Athletic Bilbao 38 47 62 44 43 +1
Sporting de Gijón 38 47 79 43 35 +8
Osasuna 38 41 47 43 47 -4
Valladolid 38 46 58 43 44 -1
Getafe 38 50 56 42 48 -6
Betis 38 51 58 42 48 -6
Numancia 38 38 69 35 33 +2
Recreativo 38 34 57 33 36 -3

Now, Barcelona's goal distributions are different from Ajax's in that they are both skewed in the same direction.  This characteristic is typical of most teams.  The difference in Barcelona's goal distribution is that a second peak pops up at six goals, which is known in the statistical parlance as a bimodal distribution.  Last season's team scored six goals in a higher proportion of matches than they scored zero, four, or five.  A Weibull curvefit would miss about half of that occurrence, which could explain the discrepancy in the final Pythagorean estimation.

Let's assume that the current curve fit estimates that Barcelona will score six goals in 5% of its matches, or 2 matches (.05*38=1.9).  If Barcelona scores six goals in a match, the chances of them winning the match are very good, almost 100% in fact, so let's assume they take all points in those games.  The difference between the curve fit and reality is about 7%, or about three matches (.07*38=2.66).  So the failure to pick up the second mode in Barca's goalscoring distribution creates a discrepancy of nine points — just about the entire Pythagorean variation.

So it seems that a change in the distribution skewness doesn't have to be present to produce large changes in the Pythagorean estimate.  Bimodal distributions also have the same effect.

UPDATE (9 May): You know, maybe there's not much of a difference.  I looked through my code and noticed that the win/draw probability calculations consider scenarios were a team has scored up to five goals in a game.  That's usually sufficient for most leagues, but not in the Spanish league last season.  I increased the upper limit to ten and recalculated, and this is what I got:

Team GP GF GA Pts Pythag +/-
Barcelona 38 105 35 87 87 0
Real Madrid 38 83 52 78 69 +9
Sevilla 38 54 39 70 62 +8
Atlético Madrid 38 80 57 67 65 +2
Villarreal 38 61 54 65 57 +8
Valencia 38 68 54 62 61 +1
Deportivo La Coruña 38 48 47 58 52 +6
Málaga 38 55 59 55 50 +5
Mallorca 38 53 60 51 48 +3
Espanyol 38 46 49 47 50 -3
Almería 38 45 61 46 42 +4
Racing Santander 38 49 48 46 53 -7
Athletic Bilbao 38 47 62 44 43 +1
Sporting de Gijón 38 47 79 43 35 +8
Osasuna 38 41 47 43 47 -4
Valladolid 38 46 58 43 44 -1
Getafe 38 50 56 42 48 -6
Betis 38 51 58 42 48 -6
Numancia 38 38 69 35 33 +2
Recreativo 38 34 57 33 36 -3

Spot.  On.

What was most incredible about Barcelona's season was that it obscured the fact that Real Madrid, Atletico Madrid, and Sevilla were also playing at a high level.

I still think it might be useful to look at common features of teams with large Pythagorean variances.  I just don't think it applies in Barcelona's case.