Chapter 9 Some useful tools in R
Learning goals for this lesson
- Get to know some neat tools in R that can make coding more elegant - and easier
- Get introduced to the
tidyverse - Learn about loops
- Get to know the
applyfunction family
9.1 An evolving language - and a lifelong learning process
The R universe is a very active space, with lots of improvements being made all the time in various places. Through these improvements, the language has evolved far beyond the relatively basic capabilities of base R. When I started learning R around 2010, I solved most of my problems with base R functions. This often resulted in convoluted code and ugly plots… I’d like to believe this was because the more advanced functions weren’t available yet, but the real reason is that my personal learning curve hadn’t caught up (and still hasn’t caught up) with the true state of the art in R programming.
Over the years, I have gradually come around to adopting some of these more modern tools and more elegant programming styles. Since we’ll be using some of these throughout the remaining chapters, it’s now time for an introduction. For all the tools in this chapter, there are much better and more comprehensive instruction materials elsewhere on the web (I’ll provide pointers), but I’ll try to give you the basics you need in order to follow the materials in this book.
9.2 The tidyverse
Many of the specific tools I want to introduce to you are part of the tidyverse, a set of packages developed by Hadley Wickham and his team. The whole collection is described here. I have only scratched the surface of this, but I encourage you to delve into this treasure chest to look for ways to improve your programming capabilities. Here, I’ll only highlight the functions that are used in this book. A nice feature of the tidyverse is that we only have to load a single package to access all the tidyverse capabilities: library(tidyverse) does the trick.
9.3 The ggplot2 package
We’ve already encountered ggplot2, so I’m just listing this here for completeness. Initially released in 2007 by Hadley Wickham, ggplot2 has become one of the most popular R packages, because it greatly facilitates making attractive figures. You can read up on the history of the package here.
A great introduction to ggplot2 and links to various tutorials etc. can be accessed here.
9.4 The tibble package
A tibble is an advanced version of a data.frame, which includes several improvements. These are described here. The most relevant improvement in my view is that tibbles don’t follow the classic data.frame habit of converting strings to factors at times when you don’t expect it. I’m fairly new to tibbles myself, but I’ll try to use them throughout the remainder of this book.
You can easily create a tibble from a normal data.frame (or a similar structure) by using the as_tibble command.
## # A tibble: 3 × 2
## a b
## <dbl> <dbl>
## 1 1 4
## 2 2 5
## 3 3 6
9.5 The magrittr package - pipes
The main thing magrittr adds is a structure to organize workflows that are applied to the same dataset. A data structure such as a tibble can be subjected to one or multiple operations organized in a pipe. The notation for such a pipe is %>%.
For instance, we can calculate the sum of all numbers in the tibble d we created above by the following operation.
## [1] 21
Note that we didn’t have to pass the d to the sum command as an input. After a pipe, the following function always assumes that the first input to the function is the product received through the pipe. You can add more commands by adding another pipe after the first one. We’ll get to some more complex - and more useful - examples below.
9.6 The tidyr package
tidyr provides useful functions for organizing your data. I’ll use the KA_weather dataset from chillR to demonstrate how some of these work.
## # A tibble: 10 × 5
## Year Month Day Tmax Tmin
## <int> <int> <int> <dbl> <dbl>
## 1 1998 1 1 8.2 5.1
## 2 1998 1 2 9.1 5
## 3 1998 1 3 10.4 3.3
## 4 1998 1 4 8.4 4.5
## 5 1998 1 5 7.7 4.5
## 6 1998 1 6 8.1 4.4
## 7 1998 1 7 12 6.9
## 8 1998 1 8 11.2 8.6
## 9 1998 1 9 13.9 8.5
## 10 1998 1 10 14.5 3.6
9.6.1 pivot_longer
We already encountered the pivot_longer function in the previous lesson. We can use this to transfer data from separate columns (e.g. Tmin and Tmax in this case) into distinct rows. In this example, we’ll have one row containing Tmin and one row for Tmax for each day of the record. We’ll often have to do this, for instance, when we want to use the ggplot2 package for plotting our data. Here’s how this works (with a pipe).
## # A tibble: 20 × 5
## Year Month Day name value
## <int> <int> <int> <chr> <dbl>
## 1 1998 1 1 Tmax 8.2
## 2 1998 1 1 Tmin 5.1
## 3 1998 1 2 Tmax 9.1
## 4 1998 1 2 Tmin 5
## 5 1998 1 3 Tmax 10.4
## 6 1998 1 3 Tmin 3.3
## 7 1998 1 4 Tmax 8.4
## 8 1998 1 4 Tmin 4.5
## 9 1998 1 5 Tmax 7.7
## 10 1998 1 5 Tmin 4.5
## 11 1998 1 6 Tmax 8.1
## 12 1998 1 6 Tmin 4.4
## 13 1998 1 7 Tmax 12
## 14 1998 1 7 Tmin 6.9
## 15 1998 1 8 Tmax 11.2
## 16 1998 1 8 Tmin 8.6
## 17 1998 1 9 Tmax 13.9
## 18 1998 1 9 Tmin 8.5
## 19 1998 1 10 Tmax 14.5
## 20 1998 1 10 Tmin 3.6
As you can see, we had to specify the columns that we wanted to stack up. Note that pivot_longer fulfills a similar function to the melt function of the reshape2 package, which I used until recently (and in earlier versions of this book). I find pivot_longer more intuitive, so I’ll be using this throughout the remaining chapters.
9.6.2 pivot_wider
We can also do an opposite conversion to the one implemented by pivot_longer by using the pivot_wider command.
## # A tibble: 10 × 5
## Year Month Day Tmax Tmin
## <int> <int> <int> <dbl> <dbl>
## 1 1998 1 1 8.2 5.1
## 2 1998 1 2 9.1 5
## 3 1998 1 3 10.4 3.3
## 4 1998 1 4 8.4 4.5
## 5 1998 1 5 7.7 4.5
## 6 1998 1 6 8.1 4.4
## 7 1998 1 7 12 6.9
## 8 1998 1 8 11.2 8.6
## 9 1998 1 9 13.9 8.5
## 10 1998 1 10 14.5 3.6
The names_from argument specified the column that contains the new column headers. In this example, the call would also have worked without this argument, but that may not always be the case.
9.6.3 select
With the select function, we can pick out a subset of the columns of a data.frame or tibble.
## # A tibble: 10 × 3
## Month Day Tmax
## <int> <int> <dbl>
## 1 1 1 8.2
## 2 1 2 9.1
## 3 1 3 10.4
## 4 1 4 8.4
## 5 1 5 7.7
## 6 1 6 8.1
## 7 1 7 12
## 8 1 8 11.2
## 9 1 9 13.9
## 10 1 10 14.5
9.6.4 filter
The filter function reduces a data.frame or tibble to just the rows that fulfill certain conditions.
## # A tibble: 5 × 5
## Year Month Day Tmax Tmin
## <int> <int> <int> <dbl> <dbl>
## 1 1998 1 3 10.4 3.3
## 2 1998 1 7 12 6.9
## 3 1998 1 8 11.2 8.6
## 4 1998 1 9 13.9 8.5
## 5 1998 1 10 14.5 3.6
9.6.5 mutate
The mutate function is a work horse for creating, modifying, and deleting columns from a data.frame or tibble.
Let’s first create new columns, e.g. two columns that contain Tmin and Tmax in Kelvin.
## # A tibble: 10 × 7
## Year Month Day Tmax Tmin Tmax_K Tmin_K
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1998 1 1 8.2 5.1 281. 278.
## 2 1998 1 2 9.1 5 282. 278.
## 3 1998 1 3 10.4 3.3 284. 276.
## 4 1998 1 4 8.4 4.5 282. 278.
## 5 1998 1 5 7.7 4.5 281. 278.
## 6 1998 1 6 8.1 4.4 281. 278.
## 7 1998 1 7 12 6.9 285. 280.
## 8 1998 1 8 11.2 8.6 284. 282.
## 9 1998 1 9 13.9 8.5 287. 282.
## 10 1998 1 10 14.5 3.6 288. 277.
Now we delete these columns again, by setting them to NULL.
## # A tibble: 10 × 5
## Year Month Day Tmax Tmin
## <int> <int> <int> <dbl> <dbl>
## 1 1998 1 1 8.2 5.1
## 2 1998 1 2 9.1 5
## 3 1998 1 3 10.4 3.3
## 4 1998 1 4 8.4 4.5
## 5 1998 1 5 7.7 4.5
## 6 1998 1 6 8.1 4.4
## 7 1998 1 7 12 6.9
## 8 1998 1 8 11.2 8.6
## 9 1998 1 9 13.9 8.5
## 10 1998 1 10 14.5 3.6
Now I’ll replace the original temperature values directly with the Fahrenheit values. The following code modifies these columns accordingly.
## # A tibble: 10 × 5
## Year Month Day Tmax Tmin
## <int> <int> <int> <dbl> <dbl>
## 1 1998 1 1 281. 278.
## 2 1998 1 2 282. 278.
## 3 1998 1 3 284. 276.
## 4 1998 1 4 282. 278.
## 5 1998 1 5 281. 278.
## 6 1998 1 6 281. 278.
## 7 1998 1 7 285. 280.
## 8 1998 1 8 284. 282.
## 9 1998 1 9 287. 282.
## 10 1998 1 10 288. 277.
There are many other interesting things you can do with mutate, so please check out the help file for more options.
9.6.6 arrange
arrange is a function to sort data in data.frames or tibbles.
## # A tibble: 10 × 5
## Year Month Day Tmax Tmin
## <int> <int> <int> <dbl> <dbl>
## 1 1998 1 5 7.7 4.5
## 2 1998 1 6 8.1 4.4
## 3 1998 1 1 8.2 5.1
## 4 1998 1 4 8.4 4.5
## 5 1998 1 2 9.1 5
## 6 1998 1 3 10.4 3.3
## 7 1998 1 8 11.2 8.6
## 8 1998 1 7 12 6.9
## 9 1998 1 9 13.9 8.5
## 10 1998 1 10 14.5 3.6
You can also sort in descending order.
## # A tibble: 10 × 5
## Year Month Day Tmax Tmin
## <int> <int> <int> <dbl> <dbl>
## 1 1998 1 10 14.5 3.6
## 2 1998 1 9 13.9 8.5
## 3 1998 1 7 12 6.9
## 4 1998 1 8 11.2 8.6
## 5 1998 1 3 10.4 3.3
## 6 1998 1 2 9.1 5
## 7 1998 1 4 8.4 4.5
## 8 1998 1 1 8.2 5.1
## 9 1998 1 6 8.1 4.4
## 10 1998 1 5 7.7 4.5
9.7 Loops
In addition to the tidyverse functions, we have to talk about an important code structure that will allow us to get a lot of work done in an efficient manner: loops. A loop allows us to repeat the same operation many times without having to explicitly retype (or copy and paste) the code. More importantly, it allows us to run the same code while introducing certain modifications in every run. You can read detailed explanations on loops here, but I’ll give you the basics in this chapter.
There are two basic types of loops: for loops and while loops. For both of them, we have to provide instructions that regulate the number of runs, as well as instructions on what to do in each of the runs.
9.7.1 For loops
In a for loop, we provide explicit instructions on how many times the code within the loop should be run. This is usually specified by providing a vector or list of elements and instructing R to run the code for each of these elements. This means that the number of times the code is run equals the number of elements in the vector or list. We need a counter (often called i but can also be any other variable name) to keep track of which run we’re in.
## [1] "Hello"
## [1] "Hello"
## [1] "Hello"
This command ran the code three times, plotting the same output each time. We can make this structure more complex by providing multiple lines of code within winged brackets.
## [1] 2
## [1] 3
## [1] 4
The code in this loop added 1 to the element addition (with an initial value of 1) in each iteration, and it printed the resulting value (note that you may have to explicitly instruct R to print such values, when the operation is embedded within a loop).
We can add more flexibility to the operations by using the index i within the code block.
## [1] 2
## [1] 4
## [1] 7
Now we added the respective value of i to the addition element in each of the runs. We can also use i in more creative ways.
## [1] "Hello Paul"
## [1] "Hello Mary"
## [1] "Hello John"
The counter doesn’t have to be numeric, but it can assume many other shapes, e.g. that of a string. We can therefore generate the same output as from the last code block by formulating this as follows:
## [1] "Hello Paul"
## [1] "Hello Mary"
## [1] "Hello John"
9.7.2 While loops
We can also specify the decision on whether to run a loop with a while statement. The code is then run, until the specified condition is no longer fulfilled. This only makes sense, of course, if the condition can change as a result of what happens inside the loop.
## [1] 5
## [1] 4
## [1] 3
## [1] 2
## [1] 1
As soon as cond reaches 0, the starting condition is no longer fulfilled, so that the code isn’t run again. Note that while loops can easily cause problems if the condition remains fulfilled regardless of what happens in the code block. Your code will then get hung up and needs to be cancelled manually.
9.8 apply functions
In addition to loops, R has another elegant method for applying certain operations to multiple elements at the same time. Don’t ask me why, but this is often a much faster way of getting things done. Such operations are implemented by the functions from the apply family: apply, lapply and sapply. The two central arguments that need to be provided to these functions are the list of items to apply the operation to, and the operation itself.
9.8.1 sapply
When you just want to apply an operation to a vector of elements, the easiest function to use is sapply. It only needs two arguments: the vector (X), and the function to be applied (FUN). To illustrate this, I’ll create a simple function, func, which just adds 1 to an object.
## [1] 2 3 4 5 6
As you can see, the output is a vector of numbers that are 1 greater than the input vector. If we apply this function to a list of numbers, the output is a matrix (but the values are the same).
## [,1]
## [1,] 2
## [2,] 3
## [3,] 4
## [4,] 5
## [5,] 6
9.8.2 lapply
If we want the output to be a list, we can use the lapply function. It interprets the input element X as a list and returns a list with as many elements as were provided in that list, with each one containing the output of applying FUN to the respective element.
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 4
##
## [[4]]
## [1] 5
##
## [[5]]
## [1] 6
Note that if the input element X is itself a list, this list is treated as one input element, with FUN applied to the entire list and the result returned as a single list element. It may be easier to look at an example to understand this.
## [[1]]
## [1] 2 3 4 5 6
9.8.3 apply
The basic apply function is for applying functions to arrays, where we can operate either on the rows (MARGIN = 1) or on the columns (MARGIN = 1) of the array. We probably won’t use this much, so here are just some simple examples of what this function does. Feel free to look through the help file (or google - lots of helpful materials out there) to learn more about this.
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 2 3
## [3,] 1 2 3
## [1] 6 6 6
## [1] 3 6 9
Exercises on useful R tools
Please document all results of the following assignments in your learning logbook.
- Based on the
Winters_hours_gapsdataset, usemagrittrpipes and functions of thetidyverseto accomplish the following:
- Convert the dataset into a
tibble
- Convert the dataset into a
- Select only the top 10 rows of the dataset
- Convert the
tibbleto alongformat, with separate rows forTemp_gapsandTemp
- Convert the
- Use
ggplot2to plotTemp_gapsandTempas facets (point or line plot)
- Use
- Convert the dataset back to the
wideformat
- Convert the dataset back to the
- Select only the following columns:
Year,Month,DayandTemp
- Select only the following columns:
- Sort the dataset by the
Tempcolumn, in descending order
- Sort the dataset by the
- For the
Winter_hours_gapsdataset, write aforloop to convert all temperatures (Tempcolumn) to degrees Fahrenheit - Execute the same operation with a function from the
applyfamily - Now use the
tidyversefunctionmutateto achieve the same outcome - Voluntary: consider taking a look at the instruction materials on all these functions, which I linked above, as well as at other sources on the internet. There’s a lot more to discover here, with lots of potential for making your coding more elegant and easier - and possibly even more fun!