So how do you make football players unique?

I haven’t written much about the software that we develop in order to do our analysis, although I have written about our football data models.  Hopefully this will be the start of a series of posts in which I unveil a little bit under the covers of what we do here.  This post is aimed more at the software developers than general football fans but do read if you’re interested.

Last night I made a rather stunning discovery and shared it with the world on Twitter:

Yes, it’s true.  In the 2011-12 English Premier League season there were two Paul Robinsons: one is the 33-year-old former England goalkeeper who played for Blackburn Rovers, and the other is the 34-year-old left back who played for Bolton Wanderers (and now captains Birmingham City).  (These were the only two players with identical names in that Premier League season, by the way.)  These were not little-used players either; they had 51 match appearances between them.

It’s not hard to find players with identical first and last names.  For example if you go to the site How Many of Me you can find out how many people in the USA share your name.  (Chris Anderson and I aren’t as special a snowflake as we like to think — there are 988 Chris Andersons and 262 Howard Hamiltons.  There are only nine Matthew Stigalls, however.)  With the Brazilian, Portuguese, and Spanish practice of using one name, the probability of two players having the same name is even greater.

This is a nightmare if you’re managing a football database.  You need to be able to ensure that every football player is unique so that you can guarantee that players linked to matches or match events are the correct ones.  If you can’t ensure that, your database is unreliable, which means that none of the information you derive from it will be reliable.

One place where this problem will manifest itself is if your database is ingesting data using data files.  For example we use a CSV delimited file to record match lineup data — who appeared for which team in a match and whether they started or captained the side:

Matchday,Home Team,Away Team,Player's Team,Player,Starting,Captain
 1,Blackburn Rovers,Wolverhampton Wanderers,Blackburn Rovers,Paul Robinson,1,1
 1,Blackburn Rovers,Wolverhampton Wanderers,Blackburn Rovers,Grant Hanley,1,0
 1,Blackburn Rovers,Wolverhampton Wanderers,Blackburn Rovers,Martin Olsson,1,0
 1,Queens Park Rangers,Bolton Wanderers,Bolton Wanderers,Gary Cahill,1,0
 1,Queens Park Rangers,Bolton Wanderers,Bolton Wanderers,Zat Knight,1,0
 1,Queens Park Rangers,Bolton Wanderers,Bolton Wanderers,Paul Robinson,1,0
 1,Queens Park Rangers,Bolton Wanderers,Bolton Wanderers,Grétar Steinsson,1,0

The home and away teams and player names need to be converted to unique primary keys, and we’ve written some software routines to retrieve the appropriate primary key number.  (We use PostgreSQL as the backend and SQLAlchemy as the object-relational mapper, if you’re curious.)
for keys in rows:
player_name = “{Player}”.format(**keys).strip()

playerID = Database.SearchTableRecords(“players_list”,
{“full_name”: player_name})[0].id
But how do you distinguish between the two Paul Robinsons in this file?  Well, you can’t if you’re blithely using the first return object.  And that’s a problem.

There are lots of workarounds that come to mind, and I’m not going to claim that my workaround is the right solution or the best one.  But it does work.

What we do is that we add date-of-birth to player’s name  in the lineup data file, separated by a colon delimiter:

1,Blackburn Rovers,Wolverhampton Wanderers,Blackburn Rovers,Paul Robinson:1979-10-15,1,1
1,Queens Park Rangers,Bolton Wanderers,Bolton Wanderers,Paul Robinson:1978-12-14,1,0

and then we separate the string into a list using Python’s split method:
for keys in rows:
player_name = “{Player}”.format(**keys).split(‘:’)
Then we can add a little logic to get the unique ID from the player_name variable:
if len(player_name) == 1:
(name,) = player_name
playerID = Database.SearchTableRecords(“players_list”,
{“full_name”: name})[0].id
(name,dob) = player_name
playerID = Database.SearchTableRecords(“players_list”,
{“full_name”: name, “birth_date”: dob})[0].id
As I said, this isn’t the only solution to the problem, and there is a possibility that multiple players share the same name and birth date.  You can add nationality or position, for example.  Any additional attribute would reduce the probability that two people will share both attributes (conjunctive probability).

So that’s how we dealt with an annoying but common problem in football data modeling.  We’ll have more to say about other aspects of the data models later.