Written for a beginner by a beginner.
What is R?
My previous post was the first time that I have produced visuals for a post using R, an open source software for data analysis and visualisation. It is a programming language which means that rather than typing data or clicking icons as you might in Excel, to use R you type commands.
Why use R?
So, what is the point of learning R? Here I will defer to Rob Carroll (@thevideoanalyst on Twitter) who wrote in a post titled ‘Excel is dead long live…’ that ‘no matter what systems organisations seem to use everyone ends up in Excel. While this might work fine for basic tasks like auto sum and filtering data, the sheer quantity of data we are now producing is staggering and Excel cannot handle what’s coming’. One only has to think of the growth of tracking technologies/GPS or the future of wearables to see that Excel might not be the best tool to collect, analyse and present all this data.
Additionally, R has fantastic graphical capabilities. Not only are its built in graphs great but you can download packages that extend the capabilities of R, allowing you to make new and different graphs. Now that I have outlined what R is and why you should use it, I will take you through the process of creating a graph.
R and RStudio
Firstly you will need to download R (another great thing about it is that it works on Windows, Mac and Linux). After this I recommend downloading RStudio (screenshot below).
The benefit of RStudio is in its panel display – you can view your data, the help menu, see what packages you have installed, all plots you have previously made and the console where you enter your commands concurrently. In the screenshot above you can see I have imported some data – a .csv file titled ‘EHLnew’. Although you can manipulate your data in R, at the moment I prefer to prepare a file in Excel. When you import a dataset in RStudio (see screenshot below) you can choose your separator and check that your data frame (essentially a data table in a format that R can use) is correct.
For this post I will use the package ggplot2. ggplot2 is one of the most popular packages for graphing in R as its default aesthetics are much nicer than R’s base graphics. Follow the method below (code in blue) to install ggplot2. The library(ggplot2) command just tells R that I want to open and use ggplot2.
Now it’s ready to go I will begin to start graphing.
Creating a graph
I want to investigate the relationship between goals scored at home and away. Plotting these two in a scatterplot may help identify the relationship between these two variables (presumably teams which score lots at home will score lots away) but also any outliers (teams who score lots at home but few away, for instance).
To do this I entered the following code:
Here, EHLnew is my data, x and y refer to the graphs axis and + geom_point() tells ggplot2 that I want to make a scatterplot. Different graphs can be made by adding on different code, for comparison + geom_line() produces a line graph. The full range of graphs (or geoms) that can be made in ggplot2 (and their code) can be viewed here. The above code produces the plot below:
To save retyping the above code, I will assign it to the word ‘graph’ (like how in SportsCode you assign text/equations using the $) by using <- (below):
Editing and Exploring
Whilst the above graph is a good start, there a lot of improvements that need to be made. Lets start by editing the axes and adding a title and trend line:
The breaks=seq(0,8,1),limits=c(0,8) code tells R that I want to start the x axis at 0, end at 8, and have breaks at every 1 unit. + geom_smooth(method = lm) adds a linear regression line with a 95% confidence region. The code makes the following graph, which is much more presentable:
Now I will further explore my data. I want to continue to examine the relationship between home and away goals but see how or if it has changed over time or differs by league. I can do this easily using the + facet_grid() command and have provided 3 examples below:
Example 1: + facet_grid(. ~ League)
Example 2: + facet_grid(. ~ Year)
Example 3: + facet_grid(League ~ Year)
It is interesting in the above graph the range of relationships – some years and leagues seem to have very strong relationships between scoring at home and scoring away (West 2014-15 for instance) whilst others (East 2010-11) do not. With only 10 teams per season per league outliers have the potential to have a large effects on the strength of these relationships though.
The above graph is a great example of what makes R so good, and how it can be used to do things that cannot be done in Excel. A little code goes a long way.
FiveThirtyEight style graphs
I enjoy reading FiveThirtyEight, created by Nate Silver whose book The Signal and the Noise I thoroughly recommend. The site is dedicated to data driven storytelling across a range of subjects – as a sports analysis blogger their post on Lionel Messi is a favourite of mine and a must read for football fans. I also like the ‘style’ of their graphs and want to use it on my graphs above. There are two ways of doing this; manually adding/editing layers (writing code) or installing a package that contains a FiveThirtyEight theme.
Method 1: Writing code
Firstly I will assign the code used to create the last graph to ‘graph2’ (though I have edited the x axis breaks to tidy it up a bit):
Now, I will add layers to recreate fivethirtyeights style:
From the above code it should be reasonably easy to understand what each layer edits. To find the appropriate colours and their code to use in R I recommend a site like this one. The code produces the following output:
I think this is a further improvement on the previous version of this graph. I felt there were too many lines, the white was too prominent and that the x axis too messy. The FiveThirtyEight style graph is much cleaner, and thus more effective.
Method 2: Package
The package ggthemes contains a range of graph styles that are based on well known sites and styles such as The Economist and The Wall Street Journal. To install and open ggthemes we follow the same method as for ggplot2:
Now, I simply need to add + theme_fivethirtyeight() to apply the theme to my graph.
It produces this graph:
Whilst its great that with minimal code it edits large amounts of the graph, for this style of graph (using facet_grid) its hard to see where one plot begins and another ends, axis aren’t labelled and theres no title. Nevertheless, it is possible to edit the above graph (see below) and it does have the advantage of getting rid of the grey blocks around each year and league:
I hope that for those of you who are beginning to use R or are thinking about using R it has helped you in some way. I have tried to show all my steps and code I have used as I have found that many online guides, even those aimed for beginners, can be tricky to follow.
I’m also aware that there is probably a quicker and/or better way to produce the graphs in this post, so if you are a more advanced R user feel free to leave a comment below.