R.2: introduction to Tidyverse

Laurent Modolo laurent.modolo@ens-lyon.fr; Hélène Polvèche hpolveche@istem.fr

2022

https://can.gitbiopages.ens-lyon.fr/R_basis/

1 Introduction

In the last session, we have gone through the basis of R. Instead of continuing to learn more about R programming, in this session we are going to jump directly to rendering plots.

We make this choice for three reasons:

  • Rendering nice plots is directly rewarding
  • You will be able to apply what you learn in this session to your own data (given that they are correctly formatted)
  • We will come back to R programming later, when you have all the necessary tools to visualize your results.

The objectives of this session will be to:

  • Create basic plot with the ggplot2 library
  • Understand the tibble type
  • Learn the different aesthetics in R plots
  • Compose complex graphics

1.1 Tidyverse

The tidyverse package is a collection of R packages designed for data science that include ggplot2.

All packages share an underlying design philosophy, grammar, and data structures (plus the same shape of logo).

tidyverse is a meta library, which can be long to install with the following command:

install.packages("tidyverse")

Luckily for you, tidyverse is preinstalled on your Rstudio server. So you just have to load the library

library("tidyverse")

1.2 Toy data set mpg

This dataset contains a subset of the fuel economy data that the EPA makes available on fueleconomy.gov. It contains only models which had a new release every year between 1999 and 2008.

You can use the ? command to know more about this dataset.

?mpg

But instead of using a dataset included in a R package, you may want to be able to use any dataset with the same format. For that we are going to use the command read_csv which is able to read a csv file.

This command also works for file URL

new_mpg <- read_csv(
  "https://can.gitbiopages.ens-lyon.fr/R_basis/session_2/mpg.csv"
)

You can check the number of lines and columns of the data with dim:

dim(new_mpg)
[1] 45006    12

To visualize the data in Rstudio you can use the command. View

View(new_mpg)

Or by simply calling the variable. Like for simple data type calling a variable print it. But complex data type like new_mpg can use complex print function.

new_mpg
# A tibble: 45,006 × 12
      id make  model        year class trans drive   cyl displ fuel    hwy   cty
   <dbl> <chr> <chr>       <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
 1 13309 Acura 2.2CL/3.0CL  1997 Comp… Auto… Fron…     4   2.2 Regu…    26    20
 2 13310 Acura 2.2CL/3.0CL  1997 Comp… Manu… Fron…     4   2.2 Regu…    28    22
 3 13311 Acura 2.2CL/3.0CL  1997 Comp… Auto… Fron…     6   3   Regu…    26    18
 4 14038 Acura 2.3CL/3.0CL  1998 Comp… Auto… Fron…     4   2.3 Regu…    27    19
 5 14039 Acura 2.3CL/3.0CL  1998 Comp… Manu… Fron…     4   2.3 Regu…    29    21
 6 14040 Acura 2.3CL/3.0CL  1998 Comp… Auto… Fron…     6   3   Regu…    26    17
 7 14834 Acura 2.3CL/3.0CL  1999 Comp… Auto… Fron…     4   2.3 Regu…    27    20
 8 14835 Acura 2.3CL/3.0CL  1999 Comp… Manu… Fron…     4   2.3 Regu…    29    21
 9 14836 Acura 2.3CL/3.0CL  1999 Comp… Auto… Fron…     6   3   Regu…    26    17
10 11789 Acura 2.5TL        1995 Comp… Auto… Fron…     5   2.5 Regu…    23    18
# … with 44,996 more rows

Here we can see that new_mpg is a tibble we will come back to tibble later.

1.3 New script

Like in the last session, instead of typing your commands directly in the console, you are going to write them in an R script.

2 First plot with ggplot2

We are going to make the simplest plot possible to study the relationship between two variables: the scatterplot.

The following command generates a plot between engine size displ and fuel efficiency hwy present the new_mpg tibble.

ggplot(data = new_mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

Are cars with bigger engines less fuel efficient ?

ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
  • you begin a plot with the function ggplot()
  • you complete your graph by adding one or more layers
  • geom_point() adds a layer with a scatterplot
  • each geom function in ggplot2 takes a mapping argument
  • the mapping argument is always paired with aes()

What happend when you use only the command ggplot(data = mpg) ?

Solution

ggplot(data = new_mpg) 

Make a scatterplot of hwy ( fuel efficiency ) vs. cyl ( number of cylinders ).

Solution

ggplot(data = new_mpg, mapping = aes(x = hwy, y = cyl)) + 
  geom_point()

What seems to be the problem ?

Solution

Dots with the same coordinates are superposed.

3 Aesthetic mappings

ggplot2 will automatically assign a unique level of the aesthetic (here a unique color) to each unique value of the variable, a process known as scaling. ggplot2 will also add a legend that explains which levels correspond to which values.

Try the following aesthetic:

  • size
  • alpha
  • shape

3.1 color mapping

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point()

3.2 size mapping

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, size = class)) + 
  geom_point()

3.3 alpha mapping

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, alpha = class)) + 
  geom_point()

3.4 shape mapping

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, shape = class)) + 
  geom_point()

You can also set the aesthetic properties of your geom manually. For example, we can make all of the points in our plot blue and squares:

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(color = "blue", shape=0)

Here is a list of different shapes available in R:

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = "blue")) + 
  geom_point()

Solution

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(color = "blue")

3.5 Mapping a continuous variable to a color.

You can also map continuous variable to a color

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = cyl)) + 
  geom_point()

What happens if you map an aesthetic to something other than a variable name, like color = displ < 5?

Solution

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = displ < 5)) + 
  geom_point()

4 Facets

You can create multiple plots at once by faceting. For this you can use the command facet_wrap. This command takes a formula as input. We will come back to formulas in R later, for now, you have to know that formulas start with a ~ symbol.

To make a scatterplot of displ versus hwy per car class you can use the following code:

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  facet_wrap(~class, nrow = 2)

Now try to facet your plot by fuel + class

Solution

Formulas allow you to express complex relationship between variables in R !

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ fuel + class, nrow = 2)

5 Composition

There are different ways to represent the information :

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point()

 

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth()

 

We can add as many layers as we want

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_smooth()


We can make mapping layer specific

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) +
  geom_smooth()

 

We can use different data (here new_mpg and mpg tables) for different layers (you will lean more on filter() later)

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) +
  geom_smooth(data = filter(mpg, class == "subcompact"))

6 Challenge !

6.1 First challenge

Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) + 
  geom_point(show.legend = FALSE) + 
  geom_smooth(se = FALSE)
  • What does show.legend = FALSE do?
  • What does the se argument to geom_smooth() do?
Solution

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = drive)) + 
  geom_point(show.legend = FALSE) + 
  geom_smooth(se = FALSE)
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

6.2 Second challenge

How being a Two Seaters car (class column) impact the engine size (displ column) versus fuel efficiency relationship (hwy column) ?

  1. Make a plot of hwy in function of displ
  2. Colorize this plot in another color for Two Seaters class
  3. Split this plot for each class
Solution 1

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point()

Solution 2

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red")

Solution 3

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() +
  geom_point(data = filter(new_mpg, class == "Two Seaters"), color = "red") +
  facet_wrap(~class)

Write a function called plot_color_a_class that can take as argument the class and plot the same graph for this class

Solution

plot_color_a_class <- function(my_class) {
  ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy)) + 
    geom_point() +
    geom_point(data = filter(new_mpg, class == my_class), color = "red") +
    facet_wrap(~class)
}
plot_color_a_class("Two Seaters")

plot_color_a_class("Compact Cars")

6.3 Third challenge

Recreate the R code necessary to generate the following graph (see “linetype” option of “geom_smooth”)

Solution

ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = fuel)) +
  geom_point() +
  geom_smooth(linetype = "dashed", color = "black") +
  facet_wrap(~fuel)

7 To go further: publication ready plots

Once you have created the graph you need for your publication, you have to save it. You can do it with the ggsave function.

First save your plot in a variable :

p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) + 
      geom_point()

Then save it in the wanted format:

ggsave("test_plot_1.png", p1, width = 12, height = 8, units = "cm")
ggsave("test_plot_1.pdf", p1, width = 12, height = 8, units = "cm")

You may also change the appearance of your plot by adding a theme layer to your plot:

p1 + theme_bw()

p1 + theme_minimal()

You may have to combine several plots, for that you can use the cowplot package which is a ggplot2 extension. First install it :

install.packages("cowplot")

Then you can use the function plot grid to combine plots in a publication ready style:

library(cowplot)
p1 <- ggplot(data = new_mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))
p1

p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()
p2

plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)

You can also save it in a file.

p_final = plot_grid(p1, p2, labels = c('A', 'B'), label_size = 12)
ggsave("test_plot_1_and_2.png", p_final, width = 20, height = 8, units = "cm")

You can learn more features about cowplot on https://wilkelab.org/cowplot/articles/introduction.html.

Use the cowplot documentation to reproduce this plot and save it.

Solution

p1 <- ggplot(data = new_mpg, mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point() + theme_bw()

p2 <- ggplot(data = new_mpg, mapping = aes(x = cty, y = hwy, color = class)) + 
  geom_point() + theme_bw()

p_row <- plot_grid(p1 + theme(legend.position = "none"), p2 + theme(legend.position = "none"), labels = c('A', 'B'), label_size = 12)
p_legend <- get_legend(p1 + theme(legend.position = "top"))

p_final <- plot_grid(p_row, p_legend, nrow = 2, rel_heights = c(1,0.2))
p_final
ggsave("plot_1_2_and_legend.png", p_final, width = 20, height = 8, units = "cm")

There are a lot of other available ggplot2 extensions which can be useful (and also beautiful). You can take a look at them here: https://exts.ggplot2.tidyverse.org/gallery/