4  R.4: data transformation

Authors

Laurent Modolo laurent.modolo@ens-lyon.fr

Hélène Polvèche hpolveche@istem.fr

Published

January 1, 2022

4.1 Introduction

The goal of this session is to practice data transformation with tidyverse. The objectives will be to:

  • Filter rows with filter()
  • Arrange rows with arrange()
  • Select columns with select()
  • Add new variables with mutate()

For this session, we are going to work with a new dataset included in the nycflights13 package.

Exercise

Install this package and load it. As usual you will also need the tidyverse library.

Solution

install.packages("nycflights13")
library("tidyverse")
library("nycflights13")

4.1.1 Data set : nycflights13

nycflights13::flights contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights

?flights

You can display the first rows of the dataset to have an overview of the data.

flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

You can use the function colnames(dataset) to get all the column names of a table:

colnames(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     

4.1.2 Data type

In programming languages, variables can have different types. When you display a tibble you can see the type of a column. Here is a list of common variable types that you will encounter:

  • int stands for integers.
  • dbl stands for doubles or real numbers.
  • chr stands for character vectors or strings.
  • dttm stands for date-times (a date + a time).
  • lgl stands for logical, vectors that contain only TRUE or FALSE.
  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
  • date stands for dates.

It’s important for you to know about and understand the different types because certain operations are only allowed between certain types. For instance, you cannot add an int to a chr, but you can add an int to a dbl the results will be a dbl.

4.2 filter rows

Variable types are important to keep in mind for comparisons. The filter() function allows you to subset observations based on their values.

The good reflex to take when you meet a new function of a package is to look at the help with ?function_name to learn how to use it and to know the different arguments.

?filter

4.2.1 Use test to filter on a column

You can use the relational operators (<,>,==,<=,>=,!=) to make a test on a column and keep rows for which the results is TRUE.

filter(flights, air_time >= 680)
filter(flights, carrier == "HA")
filter(flights, origin != "JFK")

The operator %in% is very useful to test if a value is in a list.

filter(flights, carrier %in% c("OO", "AS"))
filter(flights, month %in% c(5, 6, 7, 12))

dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-.

Exercise

Save the flights longer than 680 minutes in a long_flights variable.

Solution

long_flights <- filter(flights, air_time >= 680)

4.2.2 Logical operators to filter on several columns

Multiple arguments to filter() are combined with AND: every expression must be TRUE in order for a row to be included in the output.

filter(flights, month == 12, day == 25)
# A tibble: 719 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    12    25      456            500        -4      649            651
 2  2013    12    25      524            515         9      805            814
 3  2013    12    25      542            540         2      832            850
 4  2013    12    25      546            550        -4     1022           1027
 5  2013    12    25      556            600        -4      730            745
 6  2013    12    25      557            600        -3      743            752
 7  2013    12    25      557            600        -3      818            831
 8  2013    12    25      559            600        -1      855            856
 9  2013    12    25      559            600        -1      849            855
10  2013    12    25      600            600         0      850            846
# ℹ 709 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

In R you can use the symbols & (and), | (or), ! (not) and the function xor() to build other kinds of tests.

Exercise

Display the long_flights variable and predict the results of the following operations.

filter(long_flights, day <= 15 & carrier == "HA")
filter(long_flights, day <= 15 | carrier == "HA")
filter(long_flights, (day <= 15 | carrier == "HA") & (!month > 2))
Solution

long_flights
# A tibble: 5 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     2     6      853            900        -7     1542           1540
2  2013     3    15     1001           1000         1     1551           1530
3  2013     3    16     1001           1000         1     1544           1530
4  2013     3    17     1006           1000         6     1607           1530
5  2013     3    17     1337           1335         2     1937           1836
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
filter(long_flights, day <= 15 & carrier == "HA")
# A tibble: 2 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     2     6      853            900        -7     1542           1540
2  2013     3    15     1001           1000         1     1551           1530
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
filter(long_flights, day <= 15 | carrier == "HA")
# A tibble: 4 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     2     6      853            900        -7     1542           1540
2  2013     3    15     1001           1000         1     1551           1530
3  2013     3    16     1001           1000         1     1544           1530
4  2013     3    17     1006           1000         6     1607           1530
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
filter(long_flights, (day <= 15 | carrier == "HA") & (!month > 2))
# A tibble: 1 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     2     6      853            900        -7     1542           1540
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercise

Test the following operations and translate them with words.

filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120 & dep_delay <= 120)
filter(flights, arr_delay <= 120, dep_delay <= 120)
Tip

Combining logical operators is a powerful programmatic way to select subset of data. However, keep in mind that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of a long one.

R either prints out the results, or saves them to a variable.

Exercise

What happens when you put your variable assignment code between parenthesis ( ) ?

(dec25 <- filter(flights, month == 12, day == 25))

4.2.3 Missing values

One important feature of R that can make comparison tricky are missing values, or NAs for Not Availables. Indeed, each of the variable type can contain either a value of this type (i.e., 2 for an int) or nothing. The nothing recorded in a variable status is represented with the NA symbol.

As operations with NA values don’t make sense, if you have NA somewhere in your operation, the results will be NA:

NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA

However, you can test for NAs with the function is.na():

is.na(NA)
[1] TRUE

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

df <- tibble(
  x = c("A", "B", "C"),
  y = c(1, NA, 3)
)
df
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 A         1
2 B        NA
3 C         3
filter(df, y > 1)
# A tibble: 1 × 2
  x         y
  <chr> <dbl>
1 C         3
filter(df, is.na(y) | y > 1)
# A tibble: 2 × 2
  x         y
  <chr> <dbl>
1 B        NA
2 C         3

4.2.4 Challenges

Find all flights that:

  • Had an arrival delay (arr_delay) of two or more hours (you can check ?flights)
  • Flew to Houston (IAH or HOU)
Solution

filter(flights, arr_delay >= 120 & dest %in% c("IAH", "HOU"))
# A tibble: 220 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1     1114            900       134     1447           1222
 2  2013     1    10     2137           1630       307       17           1925
 3  2013     1    15     1603           1446        77     1957           1757
 4  2013     1    16     1239           1043       116     1558           1340
 5  2013     1    20     2136           1700       276       27           2011
 6  2013     1    21     1708           1446       142     2032           1757
 7  2013     1    25     1409           1155       134     1710           1459
 8  2013     1    30     2312           2040       152      227           2345
 9  2013     1    31     1837           1635       122     2241           1945
10  2013     1    31     2123           1856       147       49           2204
# ℹ 210 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

How many flights have a missing dep_time ?

Solution

filter(flights, is.na(dep_time))
# A tibble: 8,255 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
 3  2013     1     1       NA           1500        NA       NA           1825
 4  2013     1     1       NA            600        NA       NA            901
 5  2013     1     2       NA           1540        NA       NA           1747
 6  2013     1     2       NA           1620        NA       NA           1746
 7  2013     1     2       NA           1355        NA       NA           1459
 8  2013     1     2       NA           1420        NA       NA           1644
 9  2013     1     2       NA           1321        NA       NA           1536
10  2013     1     2       NA           1545        NA       NA           1910
# ℹ 8,245 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

Solution

NA^0 # ^ 0 is always 1 it's an arbitrary rule not a computation
[1] 1
NA | TRUE # if a member of a OR operation is TRUE the results is TRUE
[1] TRUE
FALSE & NA # if a member of a AND operation is FALSE the results is FALSE
[1] FALSE
NA * 0 # here we have a true computation
[1] NA

4.3 Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order.

arrange(flights, distance, dep_delay)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     7    27       NA            106        NA       NA            245
 2  2013     1    15     2122           2130        -8     2207           2225
 3  2013     3    30     1942           1950        -8     2026           2044
 4  2013     2     2     1610           1617        -7     1702           1722
 5  2013     2    12     2123           2130        -7     2211           2225
 6  2013     1    21     2123           2129        -6     2216           2224
 7  2013     1     5     1155           1200        -5     1241           1306
 8  2013     1     7     2124           2129        -5     2212           2224
 9  2013     1     6     2125           2129        -4     2224           2224
10  2013     1    12     1613           1617        -4     1708           1722
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

You can use desc() to reorder by a column in descending order:

arrange(flights, distance, desc(dep_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     7    27       NA            106        NA       NA            245
 2  2013     1     4     1829           1615       134     1937           1721
 3  2013     1    17     2318           2129       109     2358           2224
 4  2013     1    25     2305           2129        96     2357           2224
 5  2013     2    11     2305           2129        96        3           2224
 6  2013     1    30     2244           2129        75     2341           2224
 7  2013     1    13     2243           2129        74     2400           2224
 8  2013     1    24     2241           2129        72     2350           2224
 9  2013     1    18     2231           2129        62     2320           2224
10  2013     2     4     2231           2129        62     2333           2224
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

4.3.1 Missing values

Missing values are always sorted at the end:

df <- tibble(
  x = c("A", "B", "C"),
  y = c(1, NA, 3)
)
df
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 A         1
2 B        NA
3 C         3
arrange(df, y)
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 A         1
2 C         3
3 B        NA
arrange(df, desc(y))
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 C         3
2 A         1
3 B        NA

4.3.2 Challenges

  • Find the most delayed flight at arrival (arr_delay).
  • Find the flight that left earliest (dep_delay).
  • How could you arrange all missing values to the start in the df tibble ?
Solution

Find the most delayed flight at arrival.

arrange(flights, desc(arr_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     7    22     2257            759       898      121           1026
 9  2013    12     5      756           1700       896     1058           2020
10  2013     5     3     1133           2055       878     1250           2215
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Find the flight that left earliest.

arrange(flights, dep_delay)
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    12     7     2040           2123       -43       40           2352
 2  2013     2     3     2022           2055       -33     2240           2338
 3  2013    11    10     1408           1440       -32     1549           1559
 4  2013     1    11     1900           1930       -30     2233           2243
 5  2013     1    29     1703           1730       -27     1947           1957
 6  2013     8     9      729            755       -26     1002            955
 7  2013    10    23     1907           1932       -25     2143           2143
 8  2013     3    30     2030           2055       -25     2213           2250
 9  2013     3     2     1431           1455       -24     1601           1631
10  2013     5     5      934            958       -24     1225           1309
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

How could you arrange all missing values to the start in the df tibble ?

arrange(df, desc(is.na(y)))
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 B        NA
2 A         1
3 C         3

4.4 Select columns with select()

select() lets you quickly zoom in on a useful subset using operations based on variable names.

You can select by column names:

select(flights, year, month, day)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

By defining a range of columns:

select(flights, year:day)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

Or, you can use a negative (-) to remove columns:

select(flights, -(year:day))
# A tibble: 336,776 × 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1      517            515         2      830            819        11 UA     
 2      533            529         4      850            830        20 UA     
 3      542            540         2      923            850        33 AA     
 4      544            545        -1     1004           1022       -18 B6     
 5      554            600        -6      812            837       -25 DL     
 6      554            558        -4      740            728        12 UA     
 7      555            600        -5      913            854        19 B6     
 8      557            600        -3      709            723       -14 EV     
 9      557            600        -3      838            846        -8 B6     
10      558            600        -2      753            745         8 AA     
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

You can also rename column names on the fly:

select(flights, Y = year, M = month, D = day)
# A tibble: 336,776 × 3
       Y     M     D
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows

4.4.1 Helper functions

Here are a number of helper functions you can use within select():

  • starts_with("abc"): matches column names that begin with "abc".
  • ends_with("xyz"): matches column names that end with "xyz".
  • contains("ijk"): matches column names that contain "ijk".
  • num_range("x", 1:3): matches x1, x2 and x3.
  • where(test_function): selects columns for which the result is TRUE.

See ?select for more details.

4.4.2 Challenges

  • Brainstorm as many ways as possible to select only dep_time, dep_delay, arr_time, and arr_delay from flights. You can associate several selections arguments with | , & and !.

The simplest way to start:

df_dep_arr <- select(flights, dep_time, dep_delay, arr_time, arr_delay)
colnames(df_dep_arr)
[1] "dep_time"  "dep_delay" "arr_time"  "arr_delay"
Other solutions

select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with("dep"), starts_with("arr"))
select(flights, starts_with("dep") | starts_with("arr"))
select(flights, matches("^(dep|arr)"))
select(flights, dep_time:arr_delay & !starts_with("sched"))

  • What does the any_of() function do?
  • Why might it be helpful in conjunction with this vector? What is the difference with all_of() (hint : add “toto” to vars) ?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
Solution

select(flights, any_of(vars))
select(flights, all_of(vars))

From the help message (?all_of()) :

  • all_of() is for strict selection. If any of the variables in the character vector is missing, an error is thrown.
  • any_of() doesn’t check for missing variables. It is particularly useful with negative selections, when you would like to make sure a variable is removed.
vars <- c(vars, "toto")
select(flights, any_of(vars))
select(flights, all_of(vars))

  • Select all columns which contain character values ? numeric values ?
Solution

select(flights, where(is.character))
select(flights, where(is.numeric))

  • Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
Solution

select(flights, contains("TIME", ignore.case = FALSE))

4.5 Add new variables with mutate()

It’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

We will first create a thinner dataset flights_thin_toy to work on flights_thin that contains:

  • columns from year to day
  • columns that ends with delay
  • the distance and air_time columns
  • the dep_time and sched_dep_time columns

Then we will create an even smaller toy dataset flights_thin_toy2 to test our commands before using them on the larger one (It a good reflex to take). For that you can use the function head or sample_n for a random sampling alternative.

Exercise

Create both flights_thin_toy and flights_thin_toy2, select only 5 row for the latter.

Solution

(flights_thin <- select(flights, year:day, ends_with("delay"), distance, air_time, contains("dep_time")))
# A tibble: 336,776 × 9
    year month   day dep_delay arr_delay distance air_time dep_time
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>
 1  2013     1     1         2        11     1400      227      517
 2  2013     1     1         4        20     1416      227      533
 3  2013     1     1         2        33     1089      160      542
 4  2013     1     1        -1       -18     1576      183      544
 5  2013     1     1        -6       -25      762      116      554
 6  2013     1     1        -4        12      719      150      554
 7  2013     1     1        -5        19     1065      158      555
 8  2013     1     1        -3       -14      229       53      557
 9  2013     1     1        -3        -8      944      140      557
10  2013     1     1        -2         8      733      138      558
# ℹ 336,766 more rows
# ℹ 1 more variable: sched_dep_time <int>
(flights_thin_toy <- head(flights_thin, n = 5))
# A tibble: 5 × 9
   year month   day dep_delay arr_delay distance air_time dep_time
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>
1  2013     1     1         2        11     1400      227      517
2  2013     1     1         4        20     1416      227      533
3  2013     1     1         2        33     1089      160      542
4  2013     1     1        -1       -18     1576      183      544
5  2013     1     1        -6       -25      762      116      554
# ℹ 1 more variable: sched_dep_time <int>
(flights_thin_toy2 <- sample_n(flights_thin, size = 5))
# A tibble: 5 × 9
   year month   day dep_delay arr_delay distance air_time dep_time
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>
1  2013    10    20        -3         2      937      145      822
2  2013    10     4        -6       -14     2565      343      824
3  2013    10    26         0       -24     2475      331     1815
4  2013     4    26        14         1      187       34     1304
5  2013     7    29        -6       -22      740      117      824
# ℹ 1 more variable: sched_dep_time <int>

4.5.1 mutate()

mutate(tbl, new_var_a = opperation_a, ..., new_var_n = opperation_n)

mutate() allows you to add new columns (new_var_a, … , new_var_n) and to fill them with the results of an operation.

We can create a gain column, which can be the difference between departure and arrival delays, to check whether the pilot has managed to compensate for his departure delay.

mutate(flights_thin_toy, gain = dep_delay - arr_delay)
# A tibble: 5 × 10
   year month   day dep_delay arr_delay distance air_time dep_time
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>
1  2013     1     1         2        11     1400      227      517
2  2013     1     1         4        20     1416      227      533
3  2013     1     1         2        33     1089      160      542
4  2013     1     1        -1       -18     1576      183      544
5  2013     1     1        -6       -25      762      116      554
# ℹ 2 more variables: sched_dep_time <int>, gain <dbl>
Exercise

Use mutate to add a new column gain and speed that contains the average speed of the plane to the flights_thin_toy tibble (speed = distance / time).

Solution

flights_thin_toy <- mutate(flights_thin_toy,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
flights_thin_toy
# A tibble: 5 × 11
   year month   day dep_delay arr_delay distance air_time dep_time
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>
1  2013     1     1         2        11     1400      227      517
2  2013     1     1         4        20     1416      227      533
3  2013     1     1         2        33     1089      160      542
4  2013     1     1        -1       -18     1576      183      544
5  2013     1     1        -6       -25      762      116      554
# ℹ 3 more variables: sched_dep_time <int>, gain <dbl>, speed <dbl>

Currently dep_time and sched_dep_time are convenient to look at, but difficult to work with, as they’re not really continuous numbers (see the help to get more information on these columns).

Exercise

In the flight dataset, convert dep_time and sched_dep_time to a more convenient representation of the number of minutes since midnight.

Hints :

  • dep_time and sched_dep_time are in the HHMM format (see the help to get these information). So you have to first get the number of hours HH, convert them in minutes and then add the number of minutes MM.

  • For example: 20:03 will be display 2003, so to convert it in minutes you have to do 20 * 60 + 03 (= 1203).

  • To split the number HHMM in hours (HH) and minutes (MM) you have to use an euclidean division of HHMM by 100 to get the number of hours as the divisor and the number of minute as the remainder. For that, use the modulo operator %% to get the remainder and it’s friend %/% which returns the divisor.

HH <- 2003 %/% 100
HH
[1] 20
MM <- 2003 %% 100
MM
[1] 3
HH * 60 + MM
[1] 1203

It is always a good idea to decompose a problem in small parts. First, only start with dep_time. Build the HH and MM columns. Then, try to write both conversions in one row.

Partial solution

mutate(
  flights_thin_toy,
  HH = dep_time %/% 100,
  MM = dep_time %% 100,
  dep_time2 = HH * 60 + MM
)

Note: You can use the .after option to tell where to put the new columns,

mutate(
  flights_thin_toy,
  HH = dep_time %/% 100,
  MM = dep_time %% 100,
  dep_time2 = HH * 60 + MM,
  .after = "dep_time"
)
# A tibble: 5 × 14
   year month   day dep_delay arr_delay distance air_time dep_time    HH    MM
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int> <dbl> <dbl>
1  2013     1     1         2        11     1400      227      517     5    17
2  2013     1     1         4        20     1416      227      533     5    33
3  2013     1     1         2        33     1089      160      542     5    42
4  2013     1     1        -1       -18     1576      183      544     5    44
5  2013     1     1        -6       -25      762      116      554     5    54
# ℹ 4 more variables: dep_time2 <dbl>, sched_dep_time <int>, gain <dbl>,
#   speed <dbl>

or .keep = "used" to keep only the columns used for the calculus which can be usefull for debugging,

mutate(
  flights_thin_toy,
  HH = dep_time %/% 100,
  MM = dep_time %% 100,
  dep_time2 = HH * 60 + MM,
  .keep = "used"
)
# A tibble: 5 × 4
  dep_time    HH    MM dep_time2
     <int> <dbl> <dbl>     <dbl>
1      517     5    17       317
2      533     5    33       333
3      542     5    42       342
4      544     5    44       344
5      554     5    54       354

In one row (or you can also remove columns HH and MM using select):

mutate(
  flights_thin_toy,
  dep_time2 = dep_time %/% 100 * 60 + dep_time %% 100,
  .after = "dep_time"
)

Note: You can also directly replace a column by the result of the mutate operation,

mutate(
  flights_thin_toy,
  dep_time = dep_time * 60 + dep_time
)

Final solution

mutate(
  flights,
  dep_time = (dep_time %/% 100) * 60 + dep_time %% 100,
  sched_dep_time = (sched_dep_time %/% 100) * 60 + sched_dep_time %% 100
)

4.5.2 Useful creation functions

  • Offsets: lead(x) and lag(x) allow you to refer to the previous or next values of the column x.
    This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).
  • R provides functions for running cumulative sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.
  • Logical comparisons, <, <=, >, >=, !=, and ==.
  • Ranking: there are a number of ranking functions, the most frequently used being min_rank(). They differ by the way ties are treated, etc. Try ?mutate, ?min_rank, ?rank, for more information.

See you in R.5: Pipping and grouping

4.6 To go further: Data transformation and color sets.

There are a number of color palettes available in R, thanks to different packages such as RColorBrewer, Viridis or Ghibli. We will use them here to decorate our graphs, either on data already studied in the training, mpg, or on more specialized data such as lists of differentially expressed genes ( GSE86356 )

install.packages(c("ghibli", "RColorBrewer", "viridis"))
library(tidyverse)

library(RColorBrewer)
library(ghibli)
library(viridis)
Loading required package: viridisLite

4.6.1 RColorBrewer & Ghibli

Using mpg and the ggplot2 package, reproduce the graph studied in Section 2.3.1. Modify the colors representing the class of cars with the palettes Dark2 of RColorBrewer, then MononokeMedium from Ghibli.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
  geom_point()

Exercise

Go to the links to find the appropriate function: they are very similar between the two packages.

Solution

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_color_brewer(palette = "Dark2")

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_colour_ghibli_d("MononokeMedium")

The choice of colors is very important for the comprehension of a graphic. Some palettes are not suitable for everyone. For example, for people with color blindness, color gradients from green to red, or from yellow to blue should be avoided.

To display only Brewer palettes that are colorblind friendly, specify the option colorblindFriendly = TRUE as follows:

display.brewer.all(colorblindFriendly = TRUE)

4.6.2 Viridis

The viridis package provides a series of color maps that are designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency.

For the next part, we will use a real data set. Anterior tibial muscle tissue was collected from 20 patients, with or without confirmed myotonic dystrophy type 1 (DM1). Illumina RNAseq was performed on these samples and the sequencing data are available on GEO with the identifier GSE86356.

First, we will use the gene count table of these samples, formatted for use in ggplot2 ( pivot_longer() function ).

Exercise

Open the csv file using the read_csv2() function. The file is located at:
https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv

Solution

Download the file “Expression_matrice_pivot_longer_DEGs_GSE86356.csv” and save it in your working directory. You may have to set you working directory using setwd()

expr_DM1 <- read_csv2("Expression_matrice_pivot_longer_DEGs_GSE86356.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 2808 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (3): Genes, samples, condition
dbl (1): counts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
expr_DM1
# A tibble: 2,808 × 4
   Genes samples counts condition
   <chr> <chr>    <dbl> <chr>    
 1 MYH3  DM1_41  52488. DM1      
 2 MYH3  DM1_36   3467. DM1      
 3 MYH3  DM1_53    955. DM1      
 4 MYH3  DM1_45   1945. DM1      
 5 MYH3  DM1_17   3464. DM1      
 6 MYH3  DM1_22   2995. DM1      
 7 MYH3  DM1_39   3235. DM1      
 8 MYH3  DM1_34   3001. DM1      
 9 MYH3  DM1_25   2576. DM1      
10 MYH3  DM1_33   2386. DM1      
# ℹ 2,798 more rows

or you can read it from the following url:

(expr_DM1 <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv"))

Exercise

With this tibble, use ggplot2 and the geom_tile() function to make a heatmap. Fit the samples on the x-axis and the genes on the y-axis.

Tip: Transform the counts into log10(x + 1) for a better visualization.

Solution

(DM1_tile_base <-
  ggplot(expr_DM1, aes(samples, Genes, fill = log1p(counts))) +
  geom_tile() +
  labs(y = "Genes", x = "Samples") +
  theme(
    axis.text.y = element_text(size = 6),
    axis.text.x = element_text(size = 6, angle = 90)
  ))

Nota bene: The elements of the axes, and the theme in general, are modified in the theme() function.

With the default color gradient, even with the transformation, the heatmap is difficult to study.

R interprets a large number of colors, indicated in RGB, hexadecimal, or just by name. For example :

Exercise

With scale_fill_gradient2() function, change the colors of the gradient, taking “white” for the minimum value and “springgreen4” for the maximum value.

Solution

DM1_tile_base + scale_fill_gradient2(low = "white", high = "springgreen4")

It’s better, but still not perfect!

Exercise

Use the viridis color gradient for this graph.

Solution

DM1_tile_base + scale_fill_viridis_c()

4.6.3 Volcano Plot

For this last exercise, we will use the results of the differential gene expression analysis between DM1 vs WT conditions.

Exercise

Open the csv file using the read_csv2() function. The file is located at:
http://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv

Solution

Download the file “EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv” and save it in your working directory.

tab <- read_csv2("EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 15019 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (2): ENSEMBL_ID, gene_symbol
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tab
# A tibble: 15,019 × 8
   ENSEMBL_ID      gene_symbol baseMean log2FoldChange lfcSE  stat  pvalue  padj
   <chr>           <chr>          <dbl>          <dbl> <dbl> <dbl>   <dbl> <dbl>
 1 ENSG00000000003 TSPAN6          70.1         0.192  0.198  0    1       1    
 2 ENSG00000000005 TNMD            11.0         2.31   1.00   1.91 0.0568  1    
 3 ENSG00000000419 DPM1           267.         -0.0149 0.163  0    1       1    
 4 ENSG00000000457 SCYL3          257.         -0.0119 0.136  0    1       1    
 5 ENSG00000000460 C1orf112        86.0        -0.344  0.169  0    1       1    
 6 ENSG00000000938 FGR             29.7        -1.12   0.280 -2.56 0.0106  0.442
 7 ENSG00000000971 CFH           1231.          1.05   0.250  2.59 0.00965 0.421
 8 ENSG00000001036 FUCA2          103.         -0.191  0.169  0    1       1    
 9 ENSG00000001084 GCLC           275.         -0.268  0.143  0    1       1    
10 ENSG00000001167 NFYA           176.         -0.0748 0.148  0    1       1    
# ℹ 15,009 more rows
tab <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
tab

To make a Volcano plot, displaying different information on the significance of variation using colors, we will have to make a series of modifications on this table.

With mutate() and ifelse() fonctions, we will have to create:

  • a column sig: it indicates if the gene is significant ( TRUE or FALSE ).
    Thresholds: baseMean > 20 and padj < 0.05 and abs(log2FoldChange) >= 1.5

  • a column UpDown: it indicates if the gene is significantly up-regulated (Up), down-regulated (Down), or not significantly regulated (NO).

Exercise

Create the columns sig and UpDown.

Solution

(
  tab.sig <- mutate(
    tab,
    sig = baseMean > 20 & padj < 0.05 & abs(log2FoldChange) >= 1.5,
    UpDown = ifelse(sig, ### we can use in the same mutate a column created by a previous line
      ifelse(log2FoldChange > 0, "Up", "Down"), "NO"
    )
  )
)
# A tibble: 15,019 × 10
   ENSEMBL_ID      gene_symbol baseMean log2FoldChange lfcSE  stat  pvalue  padj
   <chr>           <chr>          <dbl>          <dbl> <dbl> <dbl>   <dbl> <dbl>
 1 ENSG00000000003 TSPAN6          70.1         0.192  0.198  0    1       1    
 2 ENSG00000000005 TNMD            11.0         2.31   1.00   1.91 0.0568  1    
 3 ENSG00000000419 DPM1           267.         -0.0149 0.163  0    1       1    
 4 ENSG00000000457 SCYL3          257.         -0.0119 0.136  0    1       1    
 5 ENSG00000000460 C1orf112        86.0        -0.344  0.169  0    1       1    
 6 ENSG00000000938 FGR             29.7        -1.12   0.280 -2.56 0.0106  0.442
 7 ENSG00000000971 CFH           1231.          1.05   0.250  2.59 0.00965 0.421
 8 ENSG00000001036 FUCA2          103.         -0.191  0.169  0    1       1    
 9 ENSG00000001084 GCLC           275.         -0.268  0.143  0    1       1    
10 ENSG00000001167 NFYA           176.         -0.0748 0.148  0    1       1    
# ℹ 15,009 more rows
# ℹ 2 more variables: sig <lgl>, UpDown <chr>

We want to see the top10 DEGs on the graph. For this, we will use the package ggrepel.

Exercise

Install and load the ggrepel package.

Solution

install.packages("ggrepel")
library(ggrepel)

Let’s filter out the table into a new variable, top10, to keep only the significant differentially expressed genes, those with the top 10 adjusted pvalue. The smaller the adjusted pvalue, the more significant the gene.

Exercise

Create the new variable top10.

Tip: You can use the function slice_min().

Solution

(top10 <- arrange(tab.sig, desc(sig), padj))
# A tibble: 15,019 × 10
   ENSEMBL_ID  gene_symbol baseMean log2FoldChange lfcSE  stat   pvalue     padj
   <chr>       <chr>          <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 ENSG000000… ERBB3          182.            4.33 0.510  7.71 1.31e-14 3.04e-10
 2 ENSG000001… NWD1            26.0           3.86 0.506  6.83 8.45e-12 9.84e- 8
 3 ENSG000002… PEG10          172.            2.04 0.269  6.11 9.76e-10 3.36e- 6
 4 ENSG000001… MYH8           832.            3.37 0.491  6.05 1.46e- 9 3.78e- 6
 5 ENSG000000… PKP2           283.            1.92 0.255  5.97 2.34e- 9 5.23e- 6
 6 ENSG000001… SLC14A2         20.9           3.53 0.525  5.96 2.47e- 9 5.23e- 6
 7 ENSG000001… MYH3          2741.            4.37 0.668  5.94 2.85e- 9 5.52e- 6
 8 ENSG000001… WBSCR17         95.3           3.17 0.497  5.58 2.34e- 8 3.41e- 5
 9 ENSG000001… HBA2           747.           -2.66 0.410 -5.52 3.33e- 8 4.31e- 5
10 ENSG000001… ATP1A4          38.6           2.01 0.300  5.37 7.66e- 8 7.76e- 5
# ℹ 15,009 more rows
# ℹ 2 more variables: sig <lgl>, UpDown <chr>
(top10 <- mutate(top10, row_N = row_number()))
# A tibble: 15,019 × 11
   ENSEMBL_ID  gene_symbol baseMean log2FoldChange lfcSE  stat   pvalue     padj
   <chr>       <chr>          <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 ENSG000000… ERBB3          182.            4.33 0.510  7.71 1.31e-14 3.04e-10
 2 ENSG000001… NWD1            26.0           3.86 0.506  6.83 8.45e-12 9.84e- 8
 3 ENSG000002… PEG10          172.            2.04 0.269  6.11 9.76e-10 3.36e- 6
 4 ENSG000001… MYH8           832.            3.37 0.491  6.05 1.46e- 9 3.78e- 6
 5 ENSG000000… PKP2           283.            1.92 0.255  5.97 2.34e- 9 5.23e- 6
 6 ENSG000001… SLC14A2         20.9           3.53 0.525  5.96 2.47e- 9 5.23e- 6
 7 ENSG000001… MYH3          2741.            4.37 0.668  5.94 2.85e- 9 5.52e- 6
 8 ENSG000001… WBSCR17         95.3           3.17 0.497  5.58 2.34e- 8 3.41e- 5
 9 ENSG000001… HBA2           747.           -2.66 0.410 -5.52 3.33e- 8 4.31e- 5
10 ENSG000001… ATP1A4          38.6           2.01 0.300  5.37 7.66e- 8 7.76e- 5
# ℹ 15,009 more rows
# ℹ 3 more variables: sig <lgl>, UpDown <chr>, row_N <int>
(top10 <- filter(top10, row_N <= 10))
# A tibble: 10 × 11
   ENSEMBL_ID  gene_symbol baseMean log2FoldChange lfcSE  stat   pvalue     padj
   <chr>       <chr>          <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 ENSG000000… ERBB3          182.            4.33 0.510  7.71 1.31e-14 3.04e-10
 2 ENSG000001… NWD1            26.0           3.86 0.506  6.83 8.45e-12 9.84e- 8
 3 ENSG000002… PEG10          172.            2.04 0.269  6.11 9.76e-10 3.36e- 6
 4 ENSG000001… MYH8           832.            3.37 0.491  6.05 1.46e- 9 3.78e- 6
 5 ENSG000000… PKP2           283.            1.92 0.255  5.97 2.34e- 9 5.23e- 6
 6 ENSG000001… SLC14A2         20.9           3.53 0.525  5.96 2.47e- 9 5.23e- 6
 7 ENSG000001… MYH3          2741.            4.37 0.668  5.94 2.85e- 9 5.52e- 6
 8 ENSG000001… WBSCR17         95.3           3.17 0.497  5.58 2.34e- 8 3.41e- 5
 9 ENSG000001… HBA2           747.           -2.66 0.410 -5.52 3.33e- 8 4.31e- 5
10 ENSG000001… ATP1A4          38.6           2.01 0.300  5.37 7.66e- 8 7.76e- 5
# ℹ 3 more variables: sig <lgl>, UpDown <chr>, row_N <int>
(top10 <- filter(tab.sig, sig == TRUE))
# A tibble: 52 × 10
   ENSEMBL_ID  gene_symbol baseMean log2FoldChange lfcSE  stat   pvalue     padj
   <chr>       <chr>          <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 ENSG000000… FAM184B         88.3           1.61 0.289  4.19 2.81e- 5 7.37e- 3
 2 ENSG000000… ALX4            48.3           3.14 0.613  4.47 7.74e- 6 3.16e- 3
 3 ENSG000000… PKP2           283.            1.92 0.255  5.97 2.34e- 9 5.23e- 6
 4 ENSG000000… ERBB3          182.            4.33 0.510  7.71 1.31e-14 3.04e-10
 5 ENSG000000… RPS6KA6         77.5           2.26 0.371  5.02 5.04e- 7 3.45e- 4
 6 ENSG000001… TUBB1           37.7          -2.54 0.410 -5.23 1.74e- 7 1.45e- 4
 7 ENSG000001… KLHL4           57.3           2.74 0.555  4.22 2.42e- 5 6.96e- 3
 8 ENSG000001… ITGB8           35.2           2.08 0.370  4.52 6.11e- 6 2.64e- 3
 9 ENSG000001… DNAH11         102.            2.93 0.606  4.17 3.01e- 5 7.62e- 3
10 ENSG000001… CRHR2           30.2           2.36 0.456  4.30 1.74e- 5 5.62e- 3
# ℹ 42 more rows
# ℹ 2 more variables: sig <lgl>, UpDown <chr>
(top10 <- slice_min(top10, padj, n = 10))
# A tibble: 10 × 10
   ENSEMBL_ID  gene_symbol baseMean log2FoldChange lfcSE  stat   pvalue     padj
   <chr>       <chr>          <dbl>          <dbl> <dbl> <dbl>    <dbl>    <dbl>
 1 ENSG000000… ERBB3          182.            4.33 0.510  7.71 1.31e-14 3.04e-10
 2 ENSG000001… NWD1            26.0           3.86 0.506  6.83 8.45e-12 9.84e- 8
 3 ENSG000002… PEG10          172.            2.04 0.269  6.11 9.76e-10 3.36e- 6
 4 ENSG000001… MYH8           832.            3.37 0.491  6.05 1.46e- 9 3.78e- 6
 5 ENSG000000… PKP2           283.            1.92 0.255  5.97 2.34e- 9 5.23e- 6
 6 ENSG000001… SLC14A2         20.9           3.53 0.525  5.96 2.47e- 9 5.23e- 6
 7 ENSG000001… MYH3          2741.            4.37 0.668  5.94 2.85e- 9 5.52e- 6
 8 ENSG000001… WBSCR17         95.3           3.17 0.497  5.58 2.34e- 8 3.41e- 5
 9 ENSG000001… HBA2           747.           -2.66 0.410 -5.52 3.33e- 8 4.31e- 5
10 ENSG000001… ATP1A4          38.6           2.01 0.300  5.37 7.66e- 8 7.76e- 5
# ℹ 2 more variables: sig <lgl>, UpDown <chr>

The data is ready to be used to make a volcano plot!

Exercise

To make the graph below, use ggplot2, the functions geom_point(), geom_hline(), geom_vline(), theme_minimal(), theme() (to remove the legend), geom_label_repel() and the function scale_color_manual() for the colors.

  • Tips 1: Don’t forget the transformation of the adjusted pvalue.
  • Tips 2: Feel free to search your favorite Web browser for help.
  • Tips 3: geom_label_repel() function needs a new parameter ‘data’ and ‘label’ in aes parameters.

Solution

ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) +
  geom_point() +
  scale_color_manual(values = c("steelblue", "lightgrey", "firebrick")) +
  geom_hline(yintercept = -log10(0.05), col = "black") +
  geom_vline(xintercept = c(-1.5, 1.5), col = "black") +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(y = "-log10(p-value)", x = "log2(FoldChange)") +
  geom_label_repel(data = top10, mapping = aes(label = gene_symbol))

License: FIXME.
Made with Quarto.