Two weeks ago I said that I would post the document containing the derivation of the Soccer Pythagorean formula, but I was trying to complete a modification before uploading the paper on the site. It had to do with the fact that the predicted percentages of wins and draws exceeded 1 in some cases, which is impossible, of course. It turns out that there is an overlap in the integration intervals that I used to calculate the win and draw probabilities, and I aimed to fix that by recalculating the interval in the expression for win probability. A side effect is to reduce the over-prediction of wins by the soccer Pythagorean, which I estimate would reduce the estimated point total by 15-20 points.
It turns out that the small change makes the integral extremely complicated to solve. For the one or two of you who care, here's why. Skip the next paragraph or two if you don't want to know.
The problem is that when I perform the integration with that modification to the intervals, I end up with a binomial series of the form
which is an infinite series because γ, more often than not, is not a rational number, so I have to make an approximation. Then I end up with an integral that looks like this:
and if you perform an integration by parts you end up with another infinite series with an infinite chain of integrations.
I've almost figured it out, but the bottom line is that (1) a simple modification produces an extremely complicated problem, and (2) the resulting formula is nowhere near as simple as the baseball Pythagorean. The price, I believe, of having a sport with three possible outcomes that occur quite often.
I want to include some results and look at some of the current European leagues using the formula, but I wanted to hold off until making the modifications. I can upload the original paper that I submitted to the conference tomorrow if there's sufficient interest. What do you think?
UPDATE: This derivation has gone in a number of directions, but the one constant is that the resulting integral can't be solved closed-form. (It's like a generalized form of a Gaussian integral, which shows up a lot in statistics and probability theory, so I guess that's kind of reassuring.) At some point you wonder if subtracting 12 or 15 from the predicted point total gets you get close enough with much less effort.