More on soccer Pythagoreans

A few days ago, I wrote about an email discussion I was having on soccer Pythagoreans.  (Here's some more information on Pythagoreans in general in case you don't know what they are.)  The development of soccer Pythagoreans is an interesting problem, and an open problem in my opinion because the underlying premises of the Pythagorean formula don't apply very well to soccer.  

I read the paper by Steven Miller where he provides a derivation of the Pythagorean formula for baseball.  It's a neat paper and the proof is very nice.  However, I think the Pythagorean in its current form isn't all that applicable for soccer, and also illustrates the dangers of using a mathematical formula without recognizing its underlying assumptions.  Miller derives the Pythagorean assuming a Weibull probability distribution for the runs scored and allowed.  The problem here isn't that a Weibull distribution is being used; the Weibull is flexible enough to approximate a wide range of distributions and perhaps one could find a set of parameters to describe the goal distribution of a soccer team.  The issue is that the Weibull is a continuous distribution, and the probability that two continuous random variables have the same value is zero.  (Actually, the probability that a continuous random variable will have a specific number is zero because there is an infinite amount of numbers on the real axis.  For that reason probabilities of continuous variables always talk about ranges.)  That property is fine for matches that have binary outcomes, but just won't work for a game like soccer where there are a nontrivial amount of drawn results.

That said, Miller does provide us with some good starting points for deriving a soccer-specific Pythagorean expression.  He has to make some assumptions in order to make the runs scored/allowed statistics independent of each other (strictly speaking, they can't because a baseball game can't end in a tie).  In soccer we can make this assumption already because of the three results that are possible, and most researchers do so in their work.  (We'll set aside the "football fever" phenomenon for the moment.)  We will have to use a discrete probability distribution to determine the probability of a drawn match, which, given the independence of random variables condition, would be product of the probabilities of the same number of goals scored and allowed.  The exponent term corresponds to the shape of the distribution, which must the same for all teams in a league.  This value will have to be estimated using either least-squares or maximum likelihood.  

Assuming that X is the number of goals scored and Y the number of goals allowed, and P() the probability of an event, the final Pythagorean formula should have the following form:

Estimated Points = P(X > Y) + 1/3*P(X = Y)

I'm still not sure we'll end up with a neat and elegant formula; the use of a discrete probability distribution will most likely yield a derivation that is a lot messier than a continuous one.  It's worth trying out.