Quantifying strategy effectiveness in football

L. Szczepanski, "Measuring the effectiveness of strategies and quantifying players' performance in football", International Journal of Performance Analysis in Sport, 8(2): 55-66, 2008. [Link]

This paper describes a procedure that estimates the probability of scoring and conceding goals in a match from open or set play in a given sector of the field, by a player who may or may not be pressured by an opponent.  The procedure is combined with a match analysis system to create quantitative measurements of the efficacy of a team strategy during a match or a player's impact on the outcome of a game.

For further commentary, read on.

Lukasz Szczepanski sent me this paper, which he wrote in 2008. (I should say, was published in 2008 — it was probably written 8-12 months before.)  He didn't tell me with whom he is affiliated, so I guess he's an independent contractor.  He sent the paper with a quotation from my Moneyball and soccer post, which he feels his publication addressees:

believe that a useful statistic in soccer will ultimately contribute to
what I call an "expected goal value" — for any action on the field in
the course of a game, the probability that said action will create a

As Szczepanski writes in the first sentence of his paper, football is about scoring goals.  A manager wants to know which players best contribute to the scoring of goals and the prevention of goals scored against them.  In order to produce such information, Szczepanski creates the notion of a vector yield, which is a combination of the goals scored and conceded as the result of an action on the field.  In doing so, he extends the work by Pollard and Reep (1997) which developed the idea of a yield measure to evaluate the effectiveness of possessions in soccer.  Szczepanski's construction is richer than Pollard and Reep's in that he accounts for a specific type of action during a possession in football and separates the expected number of goals scored from the expected number of goals conceded.  The result is a measure that looks like this:

y = (y(1), y(2))

I think that the idea of a vector yield is a very elegant formulation.  It separates the expected goals scored and conceded into distinct elements, and the vector construction permits automation in a mathematical programming package such as Matlab or Scilab.

The vector yield is applied to the football match analysis problem in order to estimate the goals scored or conceded due to actions at a specific zone of the field.  Szczepanski divides the field into 18 sectors: six equally spaced zones from one endline to another, and three (equally spaced, but not necessarily so) flank across the width of the field.  (Zone 1 is own endline, Zone 6 is opposing endline; Flank 1 is left flank, Flank 2 is center, Flank 3 is right flank.)  Actions are further divided into whether they occur in open play or during a set piece, and whether the player making the play was free or pressured by an opposing player.  These variables are used to describe the yield quantities yijkl:

    i = longitudinal region on pitch (i=1…6)
    j = crosswise region on pitch (j=1..3)
    k = open play (k=1), set play (k=2)
    l = player on ball is pressed (l=1), player on ball is free (l=2);
         when k=2, l=2

So we are looking for the goals scored or conceded as the result of an action of type kl in zone ij of the pitch.  How do we calculate this over the course of a match?

Szczepanski describes a procedure to estimate the yield quantities, but it is a little difficult to follow.  I think I understand the first few steps, which involve using match data to seed the goal-scoring probabilities from a variety of scenarios, and then using those probabilities to initialize the vector yields.  There is an iterative process to update the vector yields during play that relates the connectedness between actions in a possession to the vector yields due to actions kl in zones ij.  But the process is not very clear to me, even after writing out some examples on a notepad over a couple of hours.  If I had been a reviewer for the paper, I would have asked to see a better description of the algorithm with pseudocode.

One item that struck my attention was the use of match data to initialize the estimation model.  Were these data also used to estimate the vector yields?  If so, that process would introduce biases into the model.  And how much match data would you need to come up with a proper initialization?  How sensitive is the estimation model to initial conditions?  It would be interesting to look apply a rigorous analysis to this estimation procedure in order to determine the class of model to which it belongs and then make assessments about the model's robustness to initial conditions.  That analysis would be useful in determining some practical approaches to initializing the model, but is still an academic problem.

Szczepanski used this model to analyze official football matches (in this case, World Cup qualifiers) played by the Polish national team last year.  Most of the results would make intuitive sense to football fans, which is encouraging for the viability of the model:

  • Possessions from the defensive area to midfield had the similar expected goal yield.
  • In the final third, certain possessions have very distinct goal yields.  The highest yield is for an unmarked player in the center of the penalty area, with lower yields for possessions on the flanks.
  • As you might expect, marking a player reduces the goal yield, and the reduction is much more pronounced when the player holds the ball in the center of the penalty area. 
  • The only type of possession with negative goal value is a player under pressure in the center of his own penalty area, such as a goalkeeper or a pressured defender.

The implications of this approach are that it is possible to evaluate playing strategies during the match and assess their effectiveness in different sectors of the field.  Because the expected goal value can be calculated as the result of an action, or a chain of actions, it is possible to extend the model in order to isolate the contribution of individual players.  I believe this approach would go toward assessing what is a "good" pass beyond one that simply maintains possession to one that maintains possession and improves the probability of scoring a goal.  There are some higher-level tasks that can be addressed through these statistics, such as targeted and tailored team/individual training sessions, team selection, and valuation of players on the transfer market (or also, in the case of MLS, the draft).

There are some additional issues that this work raises, but I'll just say that this paper presents an analysis framework that would lie on top of a match analysis system.  The data in this publication was obtained using a hand notation system, but there's no reason why this system can't be automated.  That's pretty much a necessity if the approach is going to have widespread use.  There are some issues with the clarity of the concepts presented, but it's quite clear that the data collection and analysis tools exist; now the next task is to create software to integrate the disparate systems and analyze and assess match performance.