For this analysis I consider all goalkeepers who appeared in a minimum of 630 minutes in the league, or seven full matches. Goalkeepers in general play a lot more minutes than other field players, but all Superliga 18/19 clubs have at least one goalkeeper who has appeared for a minimum of 630 minutes.

The first table displays actual and expected saves and goals allowed per 90 minutes for every goalkeeper above the 630 minute cutoff. The table is sorted on actual goals allowed per 90 minutes.

This list is headed by Esteban Andrada, who transferred from Lanús to Boca Juniors for a large fee at the start of the season and supplanted Agustín Rossi. He had some excellent performances for Boca before he fractured his jaw in a Copa Libertadores match and became unavailable for a month. Accompanying him on the list is Alan Aguerre, who transferred from Vélez to Newell’s during the off-season, became the first-choice ‘keeper, and also distinguished himself with some excellent performances.

One question that jumps out upon seeing that chart is, *how do Aguerre and Andrada have close to 1.0 xGA/90*? Neither ‘keeper has faced a high volume of shots this season (33 for Aguerre, 25 for Andrada), but both have multiple matches in which the number of expected goals allowed exceeds the actual total by more than 1.0 goals (four times for Aguerre, three times for Andrada). I see it as a combination of an excellent save percentage and relatively few matches played. Of course, it’s also necessary to understand how both ‘keepers are positioned when facing a shot on goal, and with the current data set we’re just not able to know that.

The second table displays the goals allowed above expected (GAAx) along with shots on goal and actual/expected saves and goals allowed. The table is ordered on GAAx in ascending order. As in golf, red numbers are good.

Again, Aguerre and Andrada head the charts, which is what you would expect given that both also lead the league in GA/90. It is interesting that allowing significantly fewer goals than expected does not necessarily translate to higher positions in the league table. It could, however, make the difference between a current position and a much worse one. (David de Gea’s recent seasons at Manchester United should immediately come to mind.)

The act of allowing fewer goals less often is a significant part of assessing goalkeeper quality, but it’s not everything. There is also positioning on set-pieces, ball distribution, and defensive interventions that prevent a shot from being attempted. My data set won’t be able to answer all these questions, but in the near future I’ll present results from other models that will attempt to explain a goalkeeper’s contribution to the possession and chance creation of his team.

]]>I’ve written about expected goalkeeping metrics, such as expected saves and expected goals allowed, in a previous post. The first table represents saves, expected saves, and the difference in saves above expected per 90 minutes, as well as goals allowed, expected goals allowed, and the difference above expected per 90 minutes. The minimum cutoff for minutes played was 400 minutes, which is admittedly low but I wanted to capture the performance of those goalkeepers who appeared over most of the group stage.

The second table expresses the number of goals saved by the goalkeeper relative to his expected total. Once again the minimum cutoff is 400 minutes.

Jaílson’s performance was right on the borderline because of his minutes played. He played for Palmeiras in the group stage until the final match, then removed in favor of the more experienced Weverton for the remainder of the tournament. Jaílson’s 0.38 GA/90 was the best of those ‘keepers who appeared for more than 400 minutes, but it has to be considered that he only appeared in the group stage in which Palmeiras only allowed three goals, anyway.

The most impressive of the Brazilian goalkeepers were Vanderlei (Santos) and Marcelo Grohe (Grêmio). Santos had the least impressive record of the group winners and were bounced out of the Copa thanks to an ineligible player, but Vanderlei carried his side through the tournament. He made the most saves of any goalkeeper with more than 400 minutes (admittedly, not necessarily a good thing), and allowed much fewer goals than expected, which was hugely important for a side that struggled to score goals. Grohe, in contrast, was behind arguably the best defensive back line in South America and didn’t have to make many saves. His 1.54 saves/90 minutes were the fewest of any goalkeeper with more than 400 minutes played, and his 0.60 xGA/90 was the best of the first-choice goalkeepers in the last four. Despite being over 30, he appeared to be the ‘keeper with the best chance of a foreign transfer, and that appears to be the case as he’ll be plying his trade in Saudi Arabia at al-Ittihad FC.

So where were the finalist goalkeepers in this list? Franco Armani has made slightly more saves per 90 minutes than Agustín Rossi, yet his GA/90 is a bit less. I believe that many analysts in Argentina would say that Armani is a better ‘keeper than Rossi, but not by much. Martín Campaña of Independiente had a reasonably good performance in the Libertadores. Juan Musso, Augusto Batalla, and Mariano Andújar were further behind.

And lastly, poor poor Alain Baroja Méndez. The goalkeeper for Venezuela’s Monagas FC seemed to be before a firing squad during every Libertadores match — which he was.

]]>Armani joined River during the summer transfer period (January window in European football parlance) from Colombia’s Atlético Nacional and attracted attention for his impressive displays in goal. So impressive were his displays that observers of South American football started to consider him as a dark horse for the Argentina national team and possibly a starting role in Russia.

So now Armani has been made the initial cut for Argentina. Does he deserve to be there? According to the goalkeeper metrics that I’ve calculated for the Superliga, yes:

By virtue of having arrived in the second half of the Superliga season, Armani hadn’t appeared in as many minutes as most of his colleagues. Among the 33 goalkeepers who have appeared in more than 1000 league minutes, 18 played more than 2000 minutes of league play, and only eight keepers had fewer minutes than Armani. No one bettered Armani’s 0.47 goals allowed per 90 minutes.

It is true that Armani hasn’t faced as many shots as his counterparts, whether as an absolute number or per 90 minutes. It’s also true that as goalkeepers face more shots, they’ll eventually let more in. The chart below, which maps goals allowed per 90 minutes as a function of shots on goal per 90 minutes, gives an indication of that. Armani’s performance is highlighted in orange on the lower left. Armani almost certainly benefited from an improved defensive in front of him (River had allowed 18 goals in the first 12 league matches of the season but only eight in the remaining 15) but he was one of the few goalkeepers whose goals allowed statistics were better than expected given the characteristics of the shots he faced.

The “goals saved” metric has gained some degree of traction in football analytics circles this year, in particular when used to describe the exceptional performance of goalkeepers such as David de Gea. I used a different approach to calculate expected goals allowed by goalkeepers, and I called the end result “goals allowed above expected” or GAAx. The results aren’t nearly as dramatic as those of other analysts, but seeing Armani’s name at the top of the list gave me some degree of satisfaction.

This analysis hasn’t delved into Armani’s distribution of the ball, and it’s not able to make comparisons with the other Argentina ‘keepers who play in Europe or Mexico. But the data shown here presents another dimension that adds to what Sampaoli saw, and provides some support for his decision to call for Armani. If he makes the final 23, I’ll take a look at his distribution and identify any tendencies therein.

]]>This analysis takes DataFactory’s match event data from Primera División matches and calculates expected saves, and from those figures expected goals allowed, for every goalkeeper who participated in them. I’ve focused on the 33 ‘keepers who appeared for more than 80000 seconds, or 1333 minutes, in league matches this season.

The first chart shows the tabulated and calculated totals for minutes played, shots on goal, actual and expected saves, actual and expected goals allowed, and the goals allowed above the expected total (which I abbreviate GAAx). Own goals have been ignored. There may be discrepancies between the total shots on goal calculated by DataFactory and me.

Compared to the previous version of the model in which the total expected goals allowed was about 50% of the actual goals allowed in Primera, the total expected goals allowed as predicted by the current version is about 75% of the actual total. The expected number of saves appears to be closer to more realistic figures, and most importantly you can observe over- and under-performers more easily in the predictions.

The table throws out some very interesting results. Atlético Rafaela’s primary ‘keeper Lucas Hoyos allowed 0.9 goals per 90 minutes and allowed six fewer goals than expected by an average goalkeeper — the best GAAx of any goalkeeper in the Primera División regardless of minutes played. Martín Campaña of Independiente allowed 5.3 goals below expected, which was second-best on the list. Of the ‘keepers who played more than 2250 minutes, or the equivalent of 25 full matches, only one of them — Gimnasia de La Plata’s Alexis Martín Arias — allowed fewer goals than expected.

For the most part the best and worst goalkeepers in terms of goals allowed above expected were similar to those predicted by the previous version of the model. Agustín Rossi of champions Boca was near the top of the list, as were the two Lanús ‘keepers Fernando Monetti and Esteban Andrada. At the foot of the GAAx table were César Rigamonti of Quilmes, who was absolutely bombarded in goal, but also Cristian Lucchetti, who faced 42 fewer shots than Sarmiento’s Julio Chiarini but let in just as many goals (30 vs 31).

The following chart displays actual and expected saves and goals allowed per 90 minutes, again with own goals excepted:

When ordered by actual goals allowed per 90 minutes, Fernando Monetti of Lanús heads the list. He had a very good first half of the season before he tore his ACL (which might explain his performance in 2017-18). All of the goalkeepers who allowed fewer goals than expected had GA/90 below 1.0. Augusto Batalla (River), Nereo Champagne (Olimpo), and Diego Rodríguez (Rosario Central) were the three keepers whose GA/90 fell below 1.0 and GAAx/90 exceeded 0.3. Toward the bottom of the list you see ‘keepers such as Rigamonti, Lucchetti, Luís Ardente (San Martín de San Juan) and Nelson Ibáñez (Tigre), whose GAAx/90 exceeded 0.5.

There are other researchers who use the xG values of the shots on goal to calculate the total expected goals allowed. It would be useful to run a compare and contrast of both methodologies, especially now with an improved expected saves model. It would also be useful to visualize the shots on goal and the goalkeeper response in a similar manner to the expected goal maps in order to identify any tendencies or vulnerabilities in their performance. More tasks to do, for sure.

]]>To remedy the problems that I observed with the model, I focused on two features: the use of an intercept in the logistic coefficients and the weighting of the save/no save events. The intercept affects the probability of an event in the absence of all the other parameters that describe a shot. Because of the way that I code plays, such an event is a penalty shot from right in front of the center of the goal line that goes straight. Penalty shots, of course, don’t occur from the goal line, so it’s possible to remove the intercept. I evaluated the model performance with and without the intercept and on the current Superliga data the intercept added 1.5 to the average expected saves total and shrank the expected goals allowed total by 1.5 goals.

The weighting of the save/no save shot events turned out to be much more significant. Compared to other unbalanced data sets, the ratio between shots saved and not saved isn’t huge at just over 2:1. But on a team-by-team level, the ratio varied from 2:1 to more than 4:1. Switching to a balanced weighting of classes in the model had a huge effect on the coefficients associated with the shot parameters and the expected saves and goals allowed. On average the number of expected saves shrank by 3.4 saves between a model with uniform weighting and a model with balanced weighting. But the latter model increased expected goals allowed by an average of 7.4 goals, which brought the totals much closer to the actual number of goals allowed.

In the end, I decided to implement the expected saves model with balanced class weighting and no intercept.

Now it’s time to show some results from this season’s Argentine Superliga. The event data has been supplied by DataFactory LatAm and analyzed to produce expected saves and expected goals allowed by each team. This analysis focuses on the 31 goalkeepers who have played at least 630 minutes in the league as of the end of Round 19.

Below are the tabulated and calculated totals for shots on goal, actual and expected saves, and actual and expected goals allowed. Own goals have been ignored. There may be discrepancies between the total shots on goal calculated by DataFactory and me.

The total number of goals allowed remains larger than the total number of expected goals allowed, but the discrepancy is not as wide as indicated by the previous expected saves model. More interestingly, some goalkeepers have allowed fewer goals than expected and others more goals than expected, which is what I was hoping to see when I created the model.

The above chart is sorted by the number of goals allowed above expected (which I call GAAx), and Guido Herrera of Talleres heads the list. You can also see other goalkeepers who you might expect, such as Agustín Rossi of Boca Juniors, and perhaps those you might not, such as Jeremías Ledesma of Rosario Central and Marcos Díaz of Colón (Santa Fe). Patronato has shown the greatest discrepancy between expected and actual goals allowed, which accounts for their much higher-than-expected league position. I was hoping that the goalkeeping statistics would provide some insight on this surprise, but the first-choice ‘keeper Sebastián Bértoli’s performance has been just about in line with expectations.

So what do these figures look like over 90 minutes? Here is another chart, once again with own goals ignored.

I’ve sorted these figures by actual goals allowed over 90 minutes but also highlighted those goalkeepers whose GAAx/90 minutes is negative. It’s very interesting to see those goalkeepers who have allowed a low amount of goals and allowed fewer than expected. In this light we see Herrera and Rossi but also Iván Arboleda of Banfield. It’s also interesting to see goalkeepers such as Nereo Fernández (Unión de Santa Fe) and Luís Unsaín (Defensa y Justicia), who have have GA/90 below one but positive GAAx/90.

In a future post, I’ll reevaluate the goalkeeper statistics for last season’s Argentine top flight and take a further look into some of the surprising results that I have seen above.

]]>This analysis takes DataFactory’s match event data from the tournament and calculates expected saves and eventually expected goals allowed for every goalkeeper who participated in the championship. I’ve focused on the 33 ‘keepers who appeared for more than 80000 seconds, or 1333 minutes, in league matches this season.

Below are the tabulated and calculated totals for shots on goal, actual and expected saves, and actual and expected goals allowed. Own goals have been ignored. There may be discrepancies between the total shots on goal calculated by DataFactory and me.

One thing that jumps out upon initial inspection is the high number of expected saves, which results in a low number of expected goals allowed. It’s possible that using saves as the response variable in the model may not the best choice; there will always be much more successful saves than unsuccessful ones, and classification models work best when they attempt to identify the “anomalies”, which are the goals in this case. It’s something to think about and I’ll revisit this choice in a future post. The total number of goals allowed is actually much larger than that predicted by a compliment of the xG model.

There are some interesting results revealed in this table. The goalkeeper with the fewest goals allowed was Fernando Monetti of Lanús, who was their #1 ‘keeper during the first half of the championship but managed to tear his ACL playing football-tennis during summer vacation. His replacement ‘keeper, Esteban Andrada, played very well, especially in their final league match. Agustín Rossi (Boca) and Gabriel Arias (Defensa y Justicia) were the first-choice ‘keepers for their respective teams but faced few shots and let in fewer goals. At the other end of the table, poor César Rigamonti of Quilmes was bombarded throughout the tournament, and his goalkeeping statistics were impacted as a result. Alan Aguerre of Vélez had the lowest expected goals allowed, but let in almost 19 more goals than expected.

So what do the gaps between expected and reality look like, and what are the averages over 90 minutes? The following chart displays those figures, own goals ignored.

Expressing the expected and actual amounts per 90 minutes reveals some more interesting results. We see Martín Campaña of Independiente at the top of the list with the smallest differences between actual and expected saves and goals allowed. Following him is Lucas Hoyos of relegated Atlético Rafaela, who allowed significantly more goals per 90 minutes than Campaña but not much more than expected. All of the goalkeepers made fewer saves than expected and allowed more goals than expected, but the better ones (or at least, some of the higher-rated ‘keepers in the media) such as Monetti, Rossi, and Gabriel Arias have low amounts of expected goals allowed and smaller gaps between expected and reality. At the bottom of the list you see ‘keepers such as Aguerre and Lucchetti, who appeared to be good for missing an achievable save and letting past a savable goal per game.

Raw data sourced from DataFactory Latinoamericana.

]]>It’s easy to view the expected saves (xS) metric as a complement of the expected goal (xG) metric. The latter attempts to estimate the probability of a goal given the characteristics of the shot attempt (whether on goal or not), while the former estimates the probability of a goalkeeper save given the characteristics of the shot on goal attempt. Not all shots become shots on goal, but few shots become goals in the same manner that most shots on goal are saved.

Expected saves have been discussed in the football analytics research community for a while — Thom Lawrence (@deepxg) described results from (but did not present details of) an expected saves model in 2015, and Raven Beale (@sbourgenforcer) wrote a piece at Chance Analytics on expected saves that he defined as 1 – xG – xM (1.0 – probability of goal – probability of miss). Like expected goals, expected saves have started to break out into the mainstream sports media such as this article at the UK Telegraph.

Most of the authors who have discussed expected goals have kept the model details close to their vest and discussed solely the results. That’s understandable as most readers don’t care about how an expected goals or saves model was derived. Well, I’m not most people, and I do care about how the model was derived. Others have coupled expected saves very strongly to expected goals by expressing them as a complement (1.0 – xG, or something close). That may be correct, but I wanted to build an expected saves model from the ground up and relate it to an expected goals model. So that’s what I will do.

In the rest of this post I’ll describe the parameters that go into my expected goals model as well as the procedures for training, validating, and testing it. I don’t expect to break new ground here; I just want to put my derivation in the open.

Like an expected goals model, an expected saves model is a conditional probability model. It seeks to answer the question, *“Given a collection of parameters that describes a shot on goal, what is the probability that the shot is saved by the goalkeeper?”*

Let’s say that \(\mathbf{x}\) represents this collection of parameters (the parameter vector), and \(S\) the save event. Then we can write this conditional probability model as

\[

Pr(S|\mathbf{x}) = f(\mathbf{\beta}, \mathbf{x})

\]

where \(\mathbf{\beta}\) represents the model coefficients associated with the shot parameters.

A save is a binary event with probability of success between 0 and 1, so we describe it with a logistic function. The resulting model is

\[

Pr(G|\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{\beta}^T\mathbf{x}}}

\]

Here are the shot parameters that I believe have some relevance to the goalkeeper’s ability to save an attempt on goal.

**Distance:** A shot closer to goal has a greater chance of being converted, and a lower chance of being saved, than one further away, but the position of the goalkeeper matters. (And unless one has access to positional data, it won’t be possible to know goalkeeper position.) Distance is measured from shot coordinate to the center of the goal line and normalized by the distance \(r_{max}\) between that point and the far corner so that the rescaled distance is between 0 and 1.

**Field Angle:** It makes sense that the field angle θ of the shot has an effect on the likelihood of a save. I define this as the angle between the distance vector and the centerline intersecting the two goal lines, and positive angles are shots from the shooting team’s left flank. Then the angle is scaled by \(\frac{\pi}{2}\) so that it lies between -1 and 1.

The next two parameters are defined by the straight path between the shot attempt position and the three-dimensional position in the goalmouth plane that it intersects.

**Path Azimuth Angle:** The angle α of the shot path relative to an (invisible) line that starts at the shot location and intersects perpendicular to the goalline. A positive angle represents a shot toward the opponent’s right post. The angle is scaled by \(\frac{\pi}{2}\) so that it lies between -1 and 1.

**Path Elevation Angle:** The angle β of the shot path relative to the playing surface. This angle is almost always positive, but can be negative in situations where a ball is headed downwards. It’s scaled by \(\frac{\pi}{2}\) so that it lies between -1 and 1.

**Play Type:** There are certain events that are more likely to produce goals than others, but the usefulness of this feature depends on the richness of the event data. Some data sets describe shots as the result of set-pieces or throughballs or crosses. Other sets do no more than differentiate open play shots from penalties. The time between play type and the shot can be useful as its own parameter, but it may not always be available.

**Body Type:** The body part used to execute the shot (and no, hands do not count). I believe that body part could be proxy for shot velocity, and certain body parts are more likely to be used from specific plays (headers from crosses or corners, for example). Again, some data companies include this data but others don’t. I know of a few companies that only describe headed goals but not all headed shots. So this parameter is regrettably optional for some types of data.

**Match Time:** This variable was present in the expected goals model but I left it out of the expected saves model. My expectation was that shot quality, and therefore shot difficulty to the goalkeeper, had little to do with the time of the match.

**Match State:** Another variable present in the expected goals model but not in the expected saves model. My expectation here is that a shot is no more difficult to the goalkeeper if the team is tied, up a goal, or down three. (There could be a psychological effect being captured here.)

**Other variables:** There are a few more variables that I’d like to incorporate if I had years of tracking data, such as goalkeeper position (distance and angle) relative to shot taker and position of closest defender relative to shot taker. I’d like to think that these variables would provide more predictive power to the model, but I’d have to test that.

Another way to estimate xS, which is what I believe most of the analytics community is doing already, is to calculate xG for shots on goal and subtract the sum of those xGs from the total shots on goal. Now we’ve guaranteed that the expected saves model is a complement of the expected goals model for all shots on target.

Expected saves models, like expected goals models are logistic regression models, so I train both models the same way. I described my procedure in my post on expected goals, but if you want to know the basic outline it’s this:

- Scikit-learn‘s LogisticRegression class to create my xG model
- Training, validation, and testing data sets, partitioned by matches
- Brier score as my evaluation metric on the validation data set

One thing that makes xS models challenging than xG models is that there is much less data available to train. While xG models take into account all attempted shots, xS models only consider shots on goal, which reduces the number of data points by 30-40 percent. One can overcome this problem by collecting more data, but that’s not always an option if you don’t work for a data company. It would be useful to run a bias-variance study as a function of the amount of data points in a trained xS model. Come to think of it, such a study would be useful for xG models as well.

I would put some evaluation graphics here, but with the new season of Argentina’s Superliga fast approaching, I want to apply this model to last season’s match data first and run some model evaluations later. I will say that the expected saves model gives results that are optimistic — it inflates the number of saves — but are much more believable than an expected saves model based on xG. More about this later.

]]>