UPDATE (21 January 2016): I added some additional content throughout the post, but especially in the caveats, what to learn and where to learn it, and breaking into sports analytics. I also added Dean Oliver’s tweet, which reminded me that you need to know your sport to do sports analytics.
UPDATE #2 (9 November 2016): Added a tweet from Andrew Ng, and emphasized the need to have humility.
UPDATE #3 (20 December 2016): Added thoughts on approaching data science from a social science background.
UPDATE #4 (19 December 2017): Made style/syntax edits throughout the post, and added a section on legal/ethical awareness.
Ever since I started Soccermetrics almost seven years ago I have received emails from those seeking advice on getting started in the sports analytics field. At the beginning I gave what I thought were honest and tailored answers that drew upon my own background and experience, but after a while I stopped because drafting responses required a lot of time and effort and it seemed that most lost interest once they learned about all the math courses I took.
This post is an attempt to write any advice in one place so that I only have to write it once and then direct others to it. The advice is applicable to those who want to enter data science, but I’ll focus more on sports analytics toward the end.
It would seem fairly obvious that what I’m about to write are my opinions, but for the uninitiated, yes, these writings are my own opinions based on worldview and experiences accumulated over 20 years of education and work in academic, government, and industrial settings. They are opinionated. They are general, but lean heavily toward the math and statistics side of analytics. You probably will not like what I have to say. You may believe that they don’t apply to your specific situation. You many even think that they are wrong. If you disagree, say so in the comments, write your own post, or chart your own path.
I am neither an admissions officer nor a guidance counselor, so I cannot give personalized career advice or suggest a specific university, academic program, or thesis advisor. It’s up to you to make an honest inventory of your skills and interests (more on that below) and — with assistance from people you trust — create a career plan that is the best fit for you.
The same advice also applies to those who seek employment with a sports team. I can’t make recommendations to specific organizations, and even if I could, having a (paid) position open up right at the time that you are looking requires providential timing and prodigious searching. (Sports teams will always accept people willing to work for free, but unpaid jobs are evil.)
While the advice that I’m about to write is my own, it is not created in a vacuum. I read and listened to similar posts and videos; I didn’t agree with everything presented, but I did use them to refine my thoughts. Here’s a sample of those articles:
- Advice from select data science writers on Quora
- So you want to be a data scientist, on the NatureJobs blog
- The Data Science Handbook, which interviews 25 high-profile data scientists
- How to become a data scientist [San Francisco Data Science Meetup]
- What is a data scientist? by Mike Gualtieri
Are you sure you want to do this?
It is true that data science is a hot field in terms of jobs and compensation today. Sports analytics — and the sports industry in general — is a very sexy industry, but the path to a job is extremely steep. An analyst means very different things within the sports enterprise, and encompasses anywhere from glorified video analysis to talent scouting to opposition report compiler to (at very few places) a proper data scientist. A lot of people want to work for sports teams and organizations, and the sports teams and organizations are fully aware of this. Perhaps it is because of this awareness that the compensation is nowhere close to what is possible in other data science jobs. This is especially true if you have an advanced degree, which a lot of data scientists have. There’s also a cultural resistance to analytics in professional sports, and especially soccer, but that attitude is slowly changing as clubs recognize the value of analytics and offer competitive compensation to their analysts.
Mark Cuban’s interview at the 2012 MIT Sloan Sports Analytics Conference, in which he strongly recommends against new college graduates working in the sports industry, was controversial, but it’s difficult to refute his arguments. Perhaps the operative term is “new college graduate”, because I know of graduates with more work experience who have pitched themselves to Cuban and are now working for him.
What kind of data analyst are you?
Analytics means a lot of things to different people. Some are more comfortable with data modeling, others with machine learning research or application, another group with data visualization, and others with translating business objectives into data problems. All of these areas require knowledge of a similar set of skills but in varying levels of depth, so it’s important to understand where your interests lie.
To help with this understanding, I recommend that you read Analyzing the Analyzers by Harlan Harris, Sean Patrick Murphy, and Marck Vaisman and available for free from O’Reilly Media. They survey several hundred data science practitioners on their skills, careers, and experiences in finding work, and they come up with some illuminating descriptions of the four types of careers in data science. It’s an easy read and it will help you clarify where you fit in data science.
What skills do you need to learn?
Now I get into the skills that you need to learn in order to get into analytics. Again, I write from the perspective of someone who has been working on the computational/algorithmic side of analytics for close to 20 years, but I argue that everyone involved in analytics needs to know material in these subjects. How much material will vary, of course.
There is really no getting around it — to do analytics, you must learn mathematics. If the thought of mathematics brings back repressed memories from high school, or you feel that you don’t have the talent, get over it. Practice. Even those who we think are math geniuses became that way by practicing their craft constantly.
So what kind of mathematics? At a minimum, I would learn linear algebra and probability. If you seek solutions to a system of equations (multiple equations with multiple unknowns), you need to learn linear algebra. Probability will teach you how to compute and measure the uncertainty that occurs in nature. If you only have time and mindspace to learn one math subject, learn probability. But learn both — both subjects are straightforward to learn.
Multivariate calculus is a good subject to learn and serves as a gateway to more advanced mathematics, but it’s not absolutely necessary. Numerical methods is another valuable subject as it involves solving systems of equations numerically; how much you need to learn will vary. There are other subjects that have been useful to me, such as linear dynamical systems, differential equations, topology, combinatorics, or convex optimization, but I learned those topics during graduate school work and it’s not necessary to know them to work in analytics. More tools in the toolbox do permit different angles from which to attack problems, but you also need to figure out when it useful to use them.
Statistics is a subject that has a deep reach into society, but unfortunately, few people are proficient or even understanding of it. I agree with Nate Silver that in today’s society, it is more important to learn statistics than calculus in schools, but of course it is nice to know both. There are two methodologies that make up statistics: descriptive statistics, what sports people think of when they say “statistics”, such as tabulations, sums, means and standard deviations, and inferential statistics, which attempts to draw conclusions from the data.
I would learn traditional (frequentist) statistics in order to understand the language of statistical inference. And once you appreciate the limitations of those techniques, learn Bayesian inference. Bayesian inference expresses uncertainty about an event in terms of probabilities and updates that uncertainty with data. It has not been as accessible as frequentist statistics, but there are attempts to make it more understandable to people from different backgrounds (this site aimed at hackers, for example).
Machine learning draws from subjects in mathematics and statistics, and is practically its own subject (there are PhD programs in Machine Learning at a number of universities). I would be familiar at least with the major algorithms in machine learning — supervised algorithms such as linear and logistic regression, or unsupervised algorithms such as clustering — and then go deeper depending on what your goals are. As with all algorithms, know their uses, assumptions, and limitations. In particular, understand how to formulate machine learning problems that will yield reliable and meaningful results, from selecting the training/evaluation/testing data sets to deciding on the features to the metrics that you will use to evaluate the model performance.
To be able to implement your analytics projects, you will have to do coding. The most popular programming languages in the data science world right now are R and Python, but some people also use Java, Scala, Julia, Clojure, or C/C++. As that article by Trey Causey says, it almost doesn’t matter which language you learn first, as long as you learn one and go deep. Write code that breaks, understand why it breaks, and understand the capabilities and limitations of the language, which will motivate you to learn one that meets more of your needs.
R has a long history in the statistics world and there is a large ecosystem of libraries for a wide range of statistical analyses, whether frequentist or Bayesian. Python has been playing catch-up in the data science department, but in the last four years there has been an explosion of tools in data processing, modeling, analysis, and visualization. Python also has the advantage of being a proper programming language and an easier language to write (and read).
While nothing is stopping you from learning both (and knowing both is a good idea), if you have time to learn just one, learn Python. While you are at it, learn data structures and the basics of object-oriented programming.
There’s not a good section in which to place this, but I should also mention what type of computer platform to use. I use Linux for almost all of my programming and analysis. I also have a Mac, which is basically Unix under the hood, but I don’t use it for number crunching. It comes down to personal preference, but you should be able to know your way around a command line prompt.
It is possible to do analytics in a single spreadsheet file. It’s easy and anyone can create one. Unfortunately, spreadsheets don’t scale for large-scale projects.
Learn relational database design. It’s true that non-relational databases are a hot topic and carry some advantages for large-scale uses, but you can still get very far with relational databases. Sports data is full of relations between entities, which lends itself perfectly to a relational data design. Once you’ve learned relational database design — not before! — learn SQL.
As for databases, SQLite is a good starting database to learn — it doesn’t require a server, it’s small and it’s fast. If you need to build more production-quality databases, learn PostgreSQL or MySQL (or MariaDB).
And if you know Python, get to learn SQLAlchemy. It makes creating and accessing databases much easier. And buy the 2nd edition of Essential SQLAlchemy. It’s a fantastic book and by way of disclosure I know and worked for one of the authors (Rick Copeland).
Data visualization is part of the process of communicating our analytical findings to a broader audience. It is more art than science, and what you show and how you show it depends on your intent — are you trying to inform, or persuade?
Before you go too deep into a specific visualization tool or library, think about how to design your data visualization. This lecture by Noah Ilinsky and this document from the Interaction Design Foundation provide some ideas and examples. You can find examples of innovative visualization approaches here.
As for books, but if you can get your hands on The Semiology of Graphics by Jacques Bertin do so. First written in French in the 1960s, it is a classic book on information visualization that hasn’t been widely known in the English-speaking world until relatively recently. You can buy the book at ESRI Press as well. I thank Kirk Goldsberry for mentioning it during his plenary talk at CASSIS 2016 in Vancouver. Edward Tufte’s The Visual Display of Quantitative Information is very good and recommended, but I would get that book only if you can’t get Bertin’s.
There are lots of free and open-source visualization tools out there for the language of your choice, but I won’t link to them because (a) it’s not that difficult to find them via a web search and (b) I want to focus more on design.
In this section I refer to written and spoken communication but the fundamental advice is this: learn to speak and write in the English language as correctly and succinctly as you can.
I write “the English language” intentionally because English is the international language of science and business and the overwhelming majority of analytics content is in English. Of course, it is important to be proficient in your native language or any other language in which you wish to communicate.
Learn to be succinct and communicate technical concepts as concisely as possible. Some concepts are complex, but most of the time complex writing hides muddled thinking on a subject. Just as with mathematics, to become a better writer you must practice often, and that means that you must write often. It’s important to read other people’s writing in order to identify the good and copy those, as well as identify the opaque and the sloppy and avoid those. To this end, I recommend Style: Lessons in Clarity and Grace by Joseph Williams and Gregory Colomb. There are plenty of books on technical writing, but if you write well, you can write technical topics well.
(And for the love of all that’s holy, know the difference between its and it’s. Please.)
Public speaking is hard. I’m not a naturally talented speaker, and I’m much more comfortable as a writer. But I am a better speaker for having practiced over and over again. Take a creative writing course, a public speaking course or even an acting and speaking class. Place yourself into situations where you will have to write and speak often. Like writing a blog, for instance.
Where can you acquire these skills?
Now that I’ve described the skills that you need to get into analytics, I’ll discuss where you can acquire them.
My undergraduate and graduate education was in engineering shortly after the end of the Cold War through the first dot-com boom (1990s and first half of 2000s) and as a result, I was exposed to much of the math and some of the statistics that I would see in data science assignments. I knew programming, but I was doing much of my computational work in C/C++ and Matlab and my data munging in Perl. Python was around, but it had yet to go mainstream. I didn’t know anything about databases.
I write about my own experience to say that while a traditional degree program in Mathematics, Statistics, Operations Research, Engineering, or Computer Science will provide you with most of what you need to learn in data science, you will still have to learn a number of subjects on your own. More universities are starting certificate or degree-granting programs in Data Science (or Big Data Analytics, or Predictive Analytics), which reminds me of the Financial Engineering fad ten years ago. These programs tend to be less than five years old, or even two years old, so they won’t have everything figured out. In general, the better Data Science programs happen to be at the universities with the best math, statistics, or AI departments — MIT, Stanford, Carnegie Mellon, NYU, Cambridge, Imperial College London, to name a few — but you really need to examine the course curricula.
It’s easy for someone like me, who is coming from a physical science background, to neglect the contributions of those in the social sciences to data science, but a few months ago I listened to an episode of the O’Reilly Data Show that challenged that mindset. There are advantages to coming to data science from field such as political science, psychology, anthropology, or economics — people in those fields are already used to modeling non-physical systems (especially those with categorical data) and incorporating prior assumptions into their analysis, and they are familiar with having to present and defend those models to their non-technical peers. So a major in the social sciences with a strong analytical component can be very valuable if you wish to cross over into other fields.
In my opinion, it’s a poor strategy to pursue a degree in whatever field is popular right now. Financial Engineering and Quantitative Finance were super-hot fields in the mid-2000s, and universities were falling all over themselves to establish programs, and then the 2008 financial crisis happened. It’s always better to major in something you like but within reason — it’s stupid to go tens (or hundreds!) of thousands of dollars in debt to get a degree in something with few job prospects. But no matter what field you choose for your degree, remember that you will more than likely have to reinvent yourself within five to ten years.
The half-life of knowledge is decreasing. That’s why you need to keep learning your whole life, not only through college.
— Andrew Ng (@AndrewYNg) October 17, 2016
Which gets us to…
Everyone has to learn some topics in data science on their own — which topics will depend on the person’s previous education. The good news is that we are living in an era of freely accessible education in these subject areas.
MOOCs such as Coursera or edX contain a number of classes on the above topics that make up data science, and they’ve started to create certificate programs in data science. A really great resource is Open Source Data Masters which presents a curriculum for learning data science at a Masters degree level and links to the aforementioned MOOCs.
Perhaps the most important thing you will learn is humility — the recognition of how little you actually know. When you learn that, you’ll be motivated to learn more things.
Ethics and the Law
This topic doesn’t fall under any heading very well. Awareness of ethics and the law is not a prerequisite to enter data analytics to be sure, and it’s not a topic that many in the analytics community feel comfortable discussing. (I’m not sure I feel totally comfortable discussing this; I’m not perfect after all!) But I believe that everyone in the analytics enterprise needs to develop a system of personal and professional ethics as well as a familiarity with the law as it pertains to data usage. It is good and important to treat others equally, honestly, and respectfully, to conduct yourself appropriately and professionally in public meetings, to report your results accurately and honestly without deception, and to consider the legal and ethical dimensions of your work.
There are resources that will assist you in being aware of ethical considerations and dilemmas, such as online courses or courses at a local university, or professional societies related to data science (here is a Code of Conduct from one such society). As far as legal resources, you can find information on database rights (which is more relevant to residents of the United Kingdom and the European Union) and data protection laws around the world.
Breaking into Sports Analytics
I leave this section for last — how to break into sports analytics.
If you want to do sports analytics, know 1. The sport (very well), 2. At least one of data/databases, stats, programming, 3. communication.
— Dean Oliver (@DeanO_Lytics) January 8, 2016
For starters, you need to be knowledgeable about your sport of interest. That should be understood, but it does require emphasis. That knowledge can come from having played the sport at some level, but it can also come from being an active and engaged fan. You need to understand the rules and strategies of the sport very well, and if you never played the sport beyond rec level you must work even harder to gain that knowledge.
Learn the rules of the sports league very well. What are the rules of the competition? How are new players selected? What are the roster rules — are there different classes of contracts? Is there a salary cap?
You have to be in the same city as the sports team of interest. I used to think that that wasn’t necessary in an era of virtual meetings and instant communication, but sports teams do everything in-house and keep everything in-house. The road to trusted outside consultant is extremely long and steep. If you have to, move.
Communication is everything in sports analytics. You will be communicating to decisionmakers who know nothing about specific analytic models and may not know much about analytics beyond the Moneyball caricature (owners such as Matthew Benham and Vivek Ranadivé are the exceptions that prove the rule). To convey analytics results succinctly means that you have to understand them more deeply than you will communicate.
Work on bite-size projects that interest you and write about them. If they’re too small, you won’t go into enough depth and may be drawn into premature conclusions; too large, and it will take months for you to finish them. But consider a question that you want to answer, collect enough data to answer it, do the analysis, and present it. And then consider exceptions or wrinkles to the original problem, repeat the process, and write another article. Along the way, you’ll learn processes and acquire insight that will help you pursue more complicated problems.
Document your work thoroughly, from the data fields to the algorithms to the codes that generate the data visualizations. Your future self will thank you.
After you’ve worked on your projects, do a web search and find out who did the same thing. And yes, there will be other analysts who have done the same thing. Don’t forget to attribute them.
One piece of advice that I heard Aaron Schatz (Football Outsiders) give that is excellent advice: become friends with the video analysts of your local sports team and ask to watch them work a game. You will learn a huge amount about what they are looking for, and you will also learn how demanding and labor-intensive their jobs are.
Be rigorous and precise, but be concise and understandable, also.
Develop a high standard for personal and professional integrity in your data collection, analysis, interpretation, and reporting, even if — no, especially if — your results go against the grain of conventional wisdom.
Get on Twitter. Twitter is where the sports analytics community congregates. Facebook and Instagram will reach different crowds, and Google+ doesn’t do anything for me.
Networking is important, and it’s good to be known, but it’s even better to be known as someone who does things or writes about interesting things.
Cultivate relationships and be generous with sharing contacts. It’s a small world.
Be nice and don’t be a jerk. It’s a small world.
Share as much of your work and data as you can, unless you have a defensible reason not to. (Refusing to share data because someone will challenge your results is not a defensible reason.) GitHub and Bitbucket are excellent resources in this regard.
Never lose humility. I mean it: Never. Lose. Humility.
There is a lot of humility that is lacking in computational endeavors, and especially in analytics. Analytics have a lot of power, but a lot of its practitioners are left with a lot of hubris and reality has a way of knocking people hard on their rear ends. Recognize the limits of analytics, recognize and account for uncertainties in models and assumptions, and remember the human element that is always present.
Well, I’ve written a lot, and it’s more than enough advice. I hope it’s useful to someone.