3  R.3: Transformations with ggplot2

Authors

Laurent Modolo laurent.modolo@ens-lyon.fr

Hélène Polvèche hpolveche@istem.fr

Published

January 1, 2022

3.1 Introduction

In the last session, we have seen how to use ggplot2 and The Grammar of Graphics. The goal of this session is to practice more advanced features of ggplot2.

The objectives will be to:

  • learn about statistical transformations
  • practices position adjustments
  • change the coordinate systems

The first step is to load the tidyverse.

Solution

library("tidyverse")

Like in the previous sessions, it’s good practice to create a new .R file to write your code instead of using the R terminal directly.

3.2 ggplot2 statistical transformations

In the previous session, we have plotted the data as they are by using the variable values as x or y coordinates, color shade, size or transparency.
When dealing with categorical variables, also called factors, it can be interesting to perform some simple statistical transformations. For example, we may want to have coordinates on an axis proportional to the number of records for a given category.

We are going to use the diamonds data set included in tidyverse.

Exercise
  • Use the help and View commands to explore this data set.
  • How many records does this dataset contain ?
  • Try the str command. What information is displayed ?
str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

3.2.1 Introduction to geom_bar

We saw scatterplot (geom_point()) and smoothplot (geom_smooth()). We can also use geom_bar() to draw barplot:

ggplot(data = diamonds, mapping = aes(x = cut)) +
  geom_bar()

More diamonds are available with high quality cuts.

On the x-axis, the chart displays cut, a variable from diamonds. On the y-axis, it displays count, but count is not a variable in diamonds!

3.2.2 geom and stat

The algorithm used to calculate new values for a graph is called a stat, short for statistical transformation. The figure below describes how this process works with geom_bar().

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():

ggplot(data = diamonds, mapping = aes(x = cut)) +
  stat_count()

Every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation. There are three main reasons you might need to use a stat explicitly, we discuss them in the next two sections.

3.2.3 Why stat ?

You might want to override the default stat.

For example, in the following demo dataset we already have a variable for the counts per cut.

demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

(Don’t worry that you haven’t seen tribble() before. You may be able to guess their meaning from the context, and you will learn exactly what they do soon!)

Exercise

So instead of using the default geom_bar parameter stat = "count" try to use "identity"

Solution

ggplot(data = demo, mapping = aes(x = cut, y = freq)) +
  geom_bar(stat = "identity")

You might want to override the default mapping from transformed variables to aesthetics ( e.g., proportion).

ggplot(data = diamonds, mapping = aes(x = cut, y = after_stat(prop), group = 1)) +
  geom_bar()

Exercise

In our proportion bar chart, we need to set group = 1. Why?

Solution

ggplot(data = diamonds, mapping = aes(x = cut, y = after_stat(prop))) +
  geom_bar()

If group is not used, the proportion is calculated with respect to the data that contain that field and is ultimately going to be 100% in any case. For instance, the proportion of an ideal cut in the ideal cut specific data will be 1.

3.2.4 More details with stat_summary

You might want to draw greater attention to the statistical transformation in your code. You might use stat_summary(), which summarize the y values for each unique x value, to draw attention to the summary that you are computing.

ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
  stat_summary()

Exercise

Set the fun.min, fun.max and fun to the min, max and median function respectively.

Solution

ggplot(data = diamonds, mapping = aes(x = cut, y = depth)) +
  stat_summary(
    fun.min = min,
    fun.max = max,
    fun = median
  )

3.3 Coloring area plots

You can color a bar chart using either the color aesthetic, or, more usefully fill.

Exercise

Try both approaches on a cut, histogram.

Solution

ggplot(data = diamonds, mapping = aes(x = cut, color = cut)) +
  geom_bar()

ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
  geom_bar()

You can also use fill with another variable.

Exercise

Try to color by clarity. Is clarity a continuous or categorical variable ?

Solution

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar()

3.4 Position adjustments

The stacking of the fill parameter is performed by the position adjustment position.

Exercise

Try the following position parameter for geom_bar: "fill", "dodge" and "jitter"

Solution

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(position = "fill")

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(position = "dodge")

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
  geom_bar(position = "jitter")

jitter is often used for plotting points when they are stacked on top of each other.

Exercise

Compare geom_point to geom_jitter plot cut versus depth and color by clarity

Solution

ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
  geom_point()

ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
  geom_jitter()

Exercise

What parameters of geom_jitter control the amount of jittering ?

Solution

ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
  geom_jitter(width = .1, height = .1)

In the geom_jitter plot that we made, we cannot really see the limits of the different clarity groups. A violin plot can be used often to display their density.

Exercise

Use geom_violin instead of geom_jitter.

Solution

ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
  geom_violin()

3.5 Coordinate systems

A Cartesian coordinate system is a coordinate system where the x and y positions act independently to determine the location of each point. There are a number of other coordinate systems that are occasionally helpful.

ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
  geom_boxplot()

Exercise

Add the coord_flip() layer to the previous plot.

Solution

ggplot(data = diamonds, mapping = aes(x = cut, y = depth, color = clarity)) +
  geom_boxplot() +
  coord_flip()

Exercise

Add the coord_polar() layer to the following plot.

ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
  geom_bar(show.legend = FALSE, width = 1) +
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)
Solution

ggplot(data = diamonds, mapping = aes(x = cut, fill = cut)) +
  geom_bar(show.legend = FALSE, width = 1) +
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL) +
  coord_polar()

By combining the right geom, coordinates and faceting functions, you can build a large number of different plots to present your results.

See you in R.4: data transformation

3.6 To go further: animated plots from xls files

In order to be able to read information from a xls file, we will use the openxlsx packages. To generate animation we will use the ggannimate package. The additional gifski package will allow R to save your animation in the GIF (Graphics Interchange Format) format.

install.packages(c("openxlsx", "gganimate", "gifski"))
library(openxlsx)
library(gganimate)
library(gifski)
Exercise

Use the openxlsx package to save the gapminder.xlsx file into the gapminder variable.

Solution

2 solutions :

Use directly the url:

gapminder <- read.xlsx("https://can.gitbiopages.ens-lyon.fr/R_basis/session_3/gapminder.xlsx")

Download the file, save it in the same directory as your script then use the local path:

gapminder <- read.xlsx("gapminder.xlsx")

This dataset contains 4 variables of interest for us to display per country:

  • gdpPercap the GDP par capita (US$, inflation-adjusted)
  • lifeExp the life expectancy at birth, in years
  • pop the population size
  • contient a factor with 5 levels
Exercise

Using ggplot2, build a scatterplot of the gdpPercap vs lifeExp. Add the pop and continent information to this plot.

Solution

ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
  geom_point()

Exercise

What’s wrong ? You can use the scale_x_log10() to display the gdpPercap on the log10 scale.

Solution

ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
  geom_point() +
  scale_x_log10()

We would like to add the year information to the plots. We could use a facet_wrap, but instead we are going to use the gganimate package.

Exercise

Add a transition_time layer that will take as an argument year to our plot.

Solution

ggplot(gapminder, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
  geom_point() +
  scale_x_log10() +
  transition_time(year) +
  labs(title = "Year: {as.integer(frame_time)}")

License: FIXME.
Made with Quarto.