R.4: data transformation

Laurent Modolo laurent.modolo@ens-lyon.fr, Hélène Polvèche hpolveche@istem.fr

2022

https://can.gitbiopages.ens-lyon.fr/R_basis/

1 Introduction

The goal of this practical is to practice data transformation with tidyverse. The objectives of this session will be to:

  • Filter rows with filter()
  • Arrange rows with arrange()
  • Select columns with select()
  • Add new variables with mutate()

For this session we are going to work with a new dataset included in the nycflights13 package. Install this package and load it. As usual you will also need the tidyverse library.

Solution

install.packages("nycflights13")
library("tidyverse")
library("nycflights13")

1.1 Data set : nycflights13

nycflights13::flights Contains all 336,776 flights that departed from New York City in 2013. The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights

?flights

You can display the first rows of the dataset to have an overview of the data.

flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

To know all the colnames of a table you can use the function colnames(dataset)

colnames(flights)
 [1] "year"           "month"          "day"            "dep_time"      
 [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
 [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
[13] "origin"         "dest"           "air_time"       "distance"      
[17] "hour"           "minute"         "time_hour"     

1.2 Data type

In programming languages, all variables are not equal. When you display a tibble you can see the type of a column. Here is a list of common variable types that you will encounter

  • int stands for integers.
  • dbl stands for doubles or real numbers.
  • chr stands for character vectors or strings.
  • dttm stands for date-times (a date + a time).
  • lgl stands for logical, vectors that contain only TRUE or FALSE.
  • fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
  • date stands for dates.

You cannot add an int to a chr, but you can add an int to a dbl the results will be a dbl.

2 filter rows

Variable types are important to keep in mind for comparisons. The filter() function allows you to subset observations based on their values.

The good reflex to take when you meet a new function of a package is to look at the help with ?function_name to learn how to use it and to know the different arguments.

?filter

2.1 Use test to filter on a column

You can use the relational operators (<,>,==,<=,>=,!=) to make a test on a column and keep rows for which the results is TRUE.

filter(flights, air_time >= 680)
filter(flights, carrier ==  "HA")
filter(flights, origin !=  "JFK")

The operator %in% is very usefull to test if a value is in a list.

filter(flights, carrier %in% c("OO","AS"))
filter(flights, month %in% c(5,6,7,12))

dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-

Save the flights longer than 680 minutes in a long_flights variable

Solution

long_flights <- filter(flights, air_time >= 680)

2.2 Logical operators to filter on several columns

Multiple arguments to filter() are combined with AND: every expression must be TRUE in order for a row to be included in the output.

filter(flights, month == 12, day == 25)
# A tibble: 719 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    12    25      456        500      -4     649     651      -2 US     
 2  2013    12    25      524        515       9     805     814      -9 UA     
 3  2013    12    25      542        540       2     832     850     -18 AA     
 4  2013    12    25      546        550      -4    1022    1027      -5 B6     
 5  2013    12    25      556        600      -4     730     745     -15 AA     
 6  2013    12    25      557        600      -3     743     752      -9 DL     
 7  2013    12    25      557        600      -3     818     831     -13 DL     
 8  2013    12    25      559        600      -1     855     856      -1 B6     
 9  2013    12    25      559        600      -1     849     855      -6 B6     
10  2013    12    25      600        600       0     850     846       4 B6     
# … with 709 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

In R you can use the symbols & (and), | (or), ! (not) and the function xor() to build other kinds of tests.

Display the long_flights variable and predict the results of

filter(long_flights,  day <= 15 & carrier == "HA")
filter(long_flights,  day <= 15 | carrier == "HA")
filter(long_flights,  (day <= 15 | carrier == "HA") & (! month > 2))
Solution

long_flights
# A tibble: 5 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     2     6      853         900      -7    1542    1540       2 HA     
2  2013     3    15     1001        1000       1    1551    1530      21 HA     
3  2013     3    16     1001        1000       1    1544    1530      14 HA     
4  2013     3    17     1006        1000       6    1607    1530      37 HA     
5  2013     3    17     1337        1335       2    1937    1836      61 UA     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
filter(long_flights,  day <= 15 & carrier == "HA")
# A tibble: 2 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     2     6      853         900      -7    1542    1540       2 HA     
2  2013     3    15     1001        1000       1    1551    1530      21 HA     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
filter(long_flights,  day <= 15 | carrier == "HA")
# A tibble: 4 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     2     6      853         900      -7    1542    1540       2 HA     
2  2013     3    15     1001        1000       1    1551    1530      21 HA     
3  2013     3    16     1001        1000       1    1544    1530      14 HA     
4  2013     3    17     1006        1000       6    1607    1530      37 HA     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
filter(long_flights,  (day <= 15 | carrier == "HA") & (! month > 2))
# A tibble: 1 × 19
   year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
  <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
1  2013     2     6      853         900      -7    1542    1540       2 HA     
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Test the following operations and translate them with words

filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120 & dep_delay <= 120)
filter(flights, arr_delay <= 120, dep_delay <= 120)

Combinations of logical operators is a powerful programmatic way to select subset of data. Keep in mind, however, that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of one long one.

R either prints out the results, or saves them to a variable. What happens when you put your variable assignment code between parenthesis ( ) ?

(dec25 <- filter(flights, month == 12, day == 25))

2.3 Missing values

One important feature of R that can make comparison tricky is missing values, or NAs for Not Availables. Indeed each of the variable type can contain either a value of this type (i.e., 2 for an int) or nothing. The nothing recorded in a variable status is represented with the NA symbol.

As operations with NA values don’t make sense, if you have NA somewhere in your operation, the results will be NA

NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA

However, you can test for NAs with the function is.na():

is.na(NA)
[1] TRUE

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values. If you want to preserve missing values, ask for them explicitly:

df <- tibble( x = c("A","B","C"),
              y = c(1, NA, 3)
            )
df
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 A         1
2 B        NA
3 C         3
filter(df, y > 1)
# A tibble: 1 × 2
  x         y
  <chr> <dbl>
1 C         3
filter(df, is.na(y) | y > 1)
# A tibble: 2 × 2
  x         y
  <chr> <dbl>
1 B        NA
2 C         3

2.4 Challenges

Find all flights that: - Had an arrival delay (arr_delay) of two or more hours (you can check ?flights) - Flew to Houston (IAH or HOU)

Solution

filter(flights, arr_delay >= 120 & dest %in% c("IAH", "HOU"))
# A tibble: 220 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1     1114        900     134    1447    1222     145 UA     
 2  2013     1    10     2137       1630     307      17    1925     292 UA     
 3  2013     1    15     1603       1446      77    1957    1757     120 UA     
 4  2013     1    16     1239       1043     116    1558    1340     138 UA     
 5  2013     1    20     2136       1700     276      27    2011     256 UA     
 6  2013     1    21     1708       1446     142    2032    1757     155 UA     
 7  2013     1    25     1409       1155     134    1710    1459     131 B6     
 8  2013     1    30     2312       2040     152     227    2345     162 B6     
 9  2013     1    31     1837       1635     122    2241    1945     176 WN     
10  2013     1    31     2123       1856     147      49    2204     165 UA     
# … with 210 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

How many flights have a missing dep_time ?

Solution

filter(flights, is.na(dep_time))
# A tibble: 8,255 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1       NA       1630      NA      NA    1815      NA EV     
 2  2013     1     1       NA       1935      NA      NA    2240      NA AA     
 3  2013     1     1       NA       1500      NA      NA    1825      NA AA     
 4  2013     1     1       NA        600      NA      NA     901      NA B6     
 5  2013     1     2       NA       1540      NA      NA    1747      NA EV     
 6  2013     1     2       NA       1620      NA      NA    1746      NA EV     
 7  2013     1     2       NA       1355      NA      NA    1459      NA EV     
 8  2013     1     2       NA       1420      NA      NA    1644      NA EV     
 9  2013     1     2       NA       1321      NA      NA    1536      NA EV     
10  2013     1     2       NA       1545      NA      NA    1910      NA AA     
# … with 8,245 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Why is NA ^ 0 not missing? Why is NA | TRUE not missing? Why is FALSE & NA not missing? Can you figure out the general rule? (NA * 0 is a tricky counterexample!)

Solution

NA ^ 0 # ^ 0 is always 1 it's an arbitrary rule not a computation
[1] 1
NA | TRUE # if a member of a OR operation is TRUE the results is TRUE
[1] TRUE
FALSE & NA # if a member of a AND operation is FALSE the results is FALSE
[1] FALSE
NA * 0 # here we have a true computation
[1] NA

3 Arrange rows with arrange()

arrange() works similarly to filter() except that instead of selecting rows, it changes their order.

arrange(flights, distance, dep_delay)
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     7    27       NA        106      NA      NA     245      NA US     
 2  2013     1    15     2122       2130      -8    2207    2225     -18 EV     
 3  2013     3    30     1942       1950      -8    2026    2044     -18 EV     
 4  2013     2     2     1610       1617      -7    1702    1722     -20 EV     
 5  2013     2    12     2123       2130      -7    2211    2225     -14 EV     
 6  2013     1    21     2123       2129      -6    2216    2224      -8 EV     
 7  2013     1     5     1155       1200      -5    1241    1306     -25 EV     
 8  2013     1     7     2124       2129      -5    2212    2224     -12 EV     
 9  2013     1     6     2125       2129      -4    2224    2224       0 EV     
10  2013     1    12     1613       1617      -4    1708    1722     -14 EV     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

You can use desc() to reorder by a column in descending order:

arrange(flights, distance, desc(dep_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     7    27       NA        106      NA      NA     245      NA US     
 2  2013     1     4     1829       1615     134    1937    1721     136 EV     
 3  2013     1    17     2318       2129     109    2358    2224      94 EV     
 4  2013     1    25     2305       2129      96    2357    2224      93 EV     
 5  2013     2    11     2305       2129      96       3    2224      99 EV     
 6  2013     1    30     2244       2129      75    2341    2224      77 EV     
 7  2013     1    13     2243       2129      74    2400    2224      96 EV     
 8  2013     1    24     2241       2129      72    2350    2224      86 EV     
 9  2013     1    18     2231       2129      62    2320    2224      56 EV     
10  2013     2     4     2231       2129      62    2333    2224      69 EV     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

3.1 Missing values

Missing values are always sorted at the end:

df <- tibble( x = c("A","B","C"),
              y = c(1, NA, 3)
            )
df
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 A         1
2 B        NA
3 C         3
arrange(df, y)
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 A         1
2 C         3
3 B        NA
arrange(df, desc(y))
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 C         3
2 A         1
3 B        NA

3.2 Challenges

  • Find the most delayed flight at arrival (arr_delay).
  • Find the flight that left earliest (dep_delay).
  • How could you arrange all missing values to the start in the df tibble ?
Solution

Find the most delayed flight at arrival

arrange(flights, desc(arr_delay))
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     9      641        900    1301    1242    1530    1272 HA     
 2  2013     6    15     1432       1935    1137    1607    2120    1127 MQ     
 3  2013     1    10     1121       1635    1126    1239    1810    1109 MQ     
 4  2013     9    20     1139       1845    1014    1457    2210    1007 AA     
 5  2013     7    22      845       1600    1005    1044    1815     989 MQ     
 6  2013     4    10     1100       1900     960    1342    2211     931 DL     
 7  2013     3    17     2321        810     911     135    1020     915 DL     
 8  2013     7    22     2257        759     898     121    1026     895 DL     
 9  2013    12     5      756       1700     896    1058    2020     878 AA     
10  2013     5     3     1133       2055     878    1250    2215     875 MQ     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Find the flight that left earliest.

arrange(flights, dep_delay)
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013    12     7     2040       2123     -43      40    2352      48 B6     
 2  2013     2     3     2022       2055     -33    2240    2338     -58 DL     
 3  2013    11    10     1408       1440     -32    1549    1559     -10 EV     
 4  2013     1    11     1900       1930     -30    2233    2243     -10 DL     
 5  2013     1    29     1703       1730     -27    1947    1957     -10 F9     
 6  2013     8     9      729        755     -26    1002     955       7 MQ     
 7  2013    10    23     1907       1932     -25    2143    2143       0 EV     
 8  2013     3    30     2030       2055     -25    2213    2250     -37 MQ     
 9  2013     3     2     1431       1455     -24    1601    1631     -30 9E     
10  2013     5     5      934        958     -24    1225    1309     -44 B6     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

How could you arrange all missing values to the start in the df tibble ?

arrange(df, desc(is.na(y)))
# A tibble: 3 × 2
  x         y
  <chr> <dbl>
1 B        NA
2 A         1
3 C         3

4 Select columns with select()

select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

You can select by column names

select(flights, year, month, day)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows

By defining a range of columns

select(flights, year:day)
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows

Or, you can do a negative (-) to remove columns.

select(flights, -(year:day))
# A tibble: 336,776 × 16
   dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
     <int>   <int>   <dbl>   <int>   <int>   <dbl> <chr>    <int> <chr>   <chr> 
 1     517     515       2     830     819      11 UA        1545 N14228  EWR   
 2     533     529       4     850     830      20 UA        1714 N24211  LGA   
 3     542     540       2     923     850      33 AA        1141 N619AA  JFK   
 4     544     545      -1    1004    1022     -18 B6         725 N804JB  JFK   
 5     554     600      -6     812     837     -25 DL         461 N668DN  LGA   
 6     554     558      -4     740     728      12 UA        1696 N39463  EWR   
 7     555     600      -5     913     854      19 B6         507 N516JB  EWR   
 8     557     600      -3     709     723     -14 EV        5708 N829AS  LGA   
 9     557     600      -3     838     846      -8 B6          79 N593JB  JFK   
10     558     600      -2     753     745       8 AA         301 N3ALAA  LGA   
# … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
#   variable names ¹​dep_time, ²​sched_dep_time, ³​dep_delay, ⁴​arr_time,
#   ⁵​sched_arr_time, ⁶​arr_delay

And, you can also rename column names on the fly.

select(flights, Y = year, M = month, D = day)
# A tibble: 336,776 × 3
       Y     M     D
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# … with 336,766 more rows

4.1 Helper functions

here are a number of helper functions you can use within select():

  • starts_with("abc"): matches column names that begin with "abc".
  • ends_with("xyz"): matches column names that end with "xyz".
  • contains("ijk"): matches column names that contain "ijk".
  • num_range("x", 1:3): matches x1, x2 and x3.
  • where(test_function): select columns for which the result is TRUE.

See ?select for more details.

4.2 Challenges

  • Brainstorm as many ways as possible to select only dep_time, dep_delay, arr_time, and arr_delay from flights. You can associate several selections arguments with | , & and !.

The simplest way to start:

df_dep_arr <- select(flights, dep_time, dep_delay, arr_time, arr_delay)
colnames(df_dep_arr)
[1] "dep_time"  "dep_delay" "arr_time"  "arr_delay"
Other solutions

select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with("dep"), starts_with("arr") )
select(flights, starts_with("dep") | starts_with("arr") )
select(flights, matches("^(dep|arr)") )
select(flights, dep_time : arr_delay & !starts_with("sched"))

  • What does the any_of() function do?
  • Why might it be helpful in conjunction with this vector? What is the difference with all_of() (hint : add “toto” to vars) ?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
Solution

select(flights, any_of(vars))
select(flights, all_of(vars))

From the help message (?all_of()) :

  • all_of() is for strict selection. If any of the variables in the character vector is missing, an error is thrown.
  • any_of() doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.
vars <- c(vars, "toto")
select(flights, any_of(vars))
select(flights, all_of(vars))

  • Select all columns wich contain character values ? numeric values ?
Solution

select(flights, where(is.character))
select(flights, where(is.numeric))

  • Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
Solution

select(flights, contains("TIME", ignore.case = FALSE))

5 Add new variables with mutate()

It’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

First let’s create a thiner dataset to work on flights_thin that contains

  • columns from year to day
  • columns that ends with delays
  • the distance and air_time columns
  • the dep_time and sched_dep_time columns

Then let’s create an even smaller dataset as toy dataset to test your commands before using them on the large dataset (It a good reflex to take). For that you can use the function head or sample_n for a more random sampling.

  • select only 5 rows
Solution

(flights_thin <- select(flights,  year:day, ends_with("delay"), distance, air_time, contains("dep_time")))
# A tibble: 336,776 × 9
    year month   day dep_delay arr_delay distance air_time dep_time sched_dep_…¹
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>        <int>
 1  2013     1     1         2        11     1400      227      517          515
 2  2013     1     1         4        20     1416      227      533          529
 3  2013     1     1         2        33     1089      160      542          540
 4  2013     1     1        -1       -18     1576      183      544          545
 5  2013     1     1        -6       -25      762      116      554          600
 6  2013     1     1        -4        12      719      150      554          558
 7  2013     1     1        -5        19     1065      158      555          600
 8  2013     1     1        -3       -14      229       53      557          600
 9  2013     1     1        -3        -8      944      140      557          600
10  2013     1     1        -2         8      733      138      558          600
# … with 336,766 more rows, and abbreviated variable name ¹​sched_dep_time
(flights_thin_toy <- head(flights_thin, n=5))
# A tibble: 5 × 9
   year month   day dep_delay arr_delay distance air_time dep_time sched_dep_t…¹
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>         <int>
1  2013     1     1         2        11     1400      227      517           515
2  2013     1     1         4        20     1416      227      533           529
3  2013     1     1         2        33     1089      160      542           540
4  2013     1     1        -1       -18     1576      183      544           545
5  2013     1     1        -6       -25      762      116      554           600
# … with abbreviated variable name ¹​sched_dep_time
(flights_thin_toy2 <- sample_n(flights_thin, size=5))
# A tibble: 5 × 9
   year month   day dep_delay arr_delay distance air_time dep_time sched_dep_t…¹
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>         <int>
1  2013     6     9        48        48      762      112     1458          1410
2  2013    10    25        -7       -20     1372      191      748           755
3  2013     1     1        -5         2      488      108     1230          1235
4  2013     6    14         9         0      733      108     1409          1400
5  2013     8    24        -3       -22      748      102      641           644
# … with abbreviated variable name ¹​sched_dep_time

5.1 mutate()

mutate(tbl, new_var_a = opperation_a, ..., new_var_n = opperation_n)

mutate() allows you to add new columns (new_var_a, … , new_var_n) and to fill them with the results of an operation.

We can create a gain column whic can be the difference betwenn the delay at the departure and at the arrival to check if the pilot managed to compensate is departure delay.

mutate(flights_thin_toy, gain = dep_delay - arr_delay)
# A tibble: 5 × 10
   year month   day dep_delay arr_delay distance air_time dep_time sched…¹  gain
  <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>    <int>   <int> <dbl>
1  2013     1     1         2        11     1400      227      517     515    -9
2  2013     1     1         4        20     1416      227      533     529   -16
3  2013     1     1         2        33     1089      160      542     540   -31
4  2013     1     1        -1       -18     1576      183      544     545    17
5  2013     1     1        -6       -25      762      116      554     600    19
# … with abbreviated variable name ¹​sched_dep_time

Using mutate to add a new column gain and speed that contains the average speed of the plane to the flights_thin_toy tibble (speed = distance / time).

Solution

flights_thin_toy <- mutate(flights_thin_toy,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
flights_thin_toy
# A tibble: 5 × 11
   year month   day dep_de…¹ arr_d…² dista…³ air_t…⁴ dep_t…⁵ sched…⁶  gain speed
  <int> <int> <int>    <dbl>   <dbl>   <dbl>   <dbl>   <int>   <int> <dbl> <dbl>
1  2013     1     1        2      11    1400     227     517     515    -9  370.
2  2013     1     1        4      20    1416     227     533     529   -16  374.
3  2013     1     1        2      33    1089     160     542     540   -31  408.
4  2013     1     1       -1     -18    1576     183     544     545    17  517.
5  2013     1     1       -6     -25     762     116     554     600    19  394.
# … with abbreviated variable names ¹​dep_delay, ²​arr_delay, ³​distance,
#   ⁴​air_time, ⁵​dep_time, ⁶​sched_dep_time

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they’re not really continuous numbers. (see the help to get more information on these columns) In the flight dataset, convert them to a more convenient representation of the number of minutes since midnight.

Hints :

  • dep_time and sched_dep_time are in the HHMM format (see the help to get these information). So you have to first get the number of hours HH, convert them in minutes and then add the number of minutes MM.

  • For exemple : 20:03 will be display 2003, so to convert it in minutes you have to do 20 * 60 + 03 (= 1203).

  • To split the number HHMM in hours (HH) and minutes (MM) you have to use an eucledean division of HHMM by 100 to get the number of hours as the divisor and the number of minute as the remainder. For that use the modulo operator %% to get the remainder and it’s friend %/% which return the divisor.

HH <- 2003 %/% 100
HH
[1] 20
MM <- 2003 %% 100 
MM
[1] 3
HH * 60 + MM
[1] 1203

It is always a good idea to decompose a problem in small parts. First train you only on dep_time. Build the HH and MM columns. Then try to do the convertions in one row.

Partial solution

mutate(
  flights_thin_toy,
  HH = dep_time %/% 100,
  MM = dep_time %% 100,
  dep_time2 = HH * 60 + MM
)

** Note ** You can use the .after option to tell where to put the new columns

mutate(
  flights_thin_toy,
  HH = dep_time %/% 100,
  MM = dep_time %% 100,
  dep_time2 = HH * 60 + MM,
  .after = "dep_time" )
# A tibble: 5 × 14
   year month   day dep_de…¹ arr_d…² dista…³ air_t…⁴ dep_t…⁵    HH    MM dep_t…⁶
  <int> <int> <int>    <dbl>   <dbl>   <dbl>   <dbl>   <int> <dbl> <dbl>   <dbl>
1  2013     1     1        2      11    1400     227     517     5    17     317
2  2013     1     1        4      20    1416     227     533     5    33     333
3  2013     1     1        2      33    1089     160     542     5    42     342
4  2013     1     1       -1     -18    1576     183     544     5    44     344
5  2013     1     1       -6     -25     762     116     554     5    54     354
# … with 3 more variables: sched_dep_time <int>, gain <dbl>, speed <dbl>, and
#   abbreviated variable names ¹​dep_delay, ²​arr_delay, ³​distance, ⁴​air_time,
#   ⁵​dep_time, ⁶​dep_time2

or .keep = "used" to keep only the columns used for the calculus which can be usefull for debugging

mutate(
  flights_thin_toy,
  HH = dep_time %/% 100,
  MM = dep_time %% 100,
  dep_time2 = HH * 60 + MM,
  .keep = "used" )
# A tibble: 5 × 4
  dep_time    HH    MM dep_time2
     <int> <dbl> <dbl>     <dbl>
1      517     5    17       317
2      533     5    33       333
3      542     5    42       342
4      544     5    44       344
5      554     5    54       354

In one row (or you can also remove columns HH and MM using select):

mutate(
  flights_thin_toy,
  dep_time2 =  dep_time %/% 100 * 60 + dep_time %% 100,
  .after = "dep_time" )

** Note ** You can also directly replace a column by the result of the mutate operation.

mutate(
  flights_thin_toy,
  dep_time = dep_time * 60 + dep_time)

Final solution

mutate(
  flights,
  dep_time = (dep_time %/% 100) * 60 +
    dep_time %% 100,
  sched_dep_time = (sched_dep_time %/% 100) * 60 +
    sched_dep_time %% 100
)

5.2 Useful creation functions

  • Offsets: lead(x) and lag(x) allow you to refer to the previous or next values of the column x. This allows you to compute running differences (e.g. x - lag(x)) or find when values change (x != lag(x)).
  • R provides functions for running cumulative sums, products, mins and maxes: cumsum(), cumprod(), cummin(), cummax(); and dplyr provides cummean() for cumulative means.
  • Logical comparisons, <, <=, >, >=, !=, and ==
  • Ranking: there are a number of ranking functions, the most frequently used being min_rank(). They differ by the way ties are treated, etc. Try ?mutate, ?min_rank, ?rank, for more information.

5.3 See you in R#5: Pipping and grouping

6 To go further: Data transformation and color sets.

There are a number of color palettes available in R, thanks to different packages such as RColorBrewer, Viridis or Ghibli. We will use them here to decorate our graphs, either on data already studied in the training, mpg, or on more specialized data such as lists of differentially expressed genes ( GSE86356 )

install.packages(c("ghibli", "RColorBrewer", "viridis"))
library(tidyverse)

library(RColorBrewer)
library(ghibli)
library(viridis)
Loading required package: viridisLite

6.1 RColorBrewer & Ghibli

Using mpg and the ‘ggplot2’ package, reproduce the graph studied in session 2, 3.1: color mapping. Modify the colors representing the class of cars with the palettes Dark2 of RColorBrewer, then MononokeMedium from Ghibli.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point()

Go to the links to find the appropriate function: they are very similar between the two packages.

Solution

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point() +
  scale_color_brewer(palette = "Dark2")

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) + 
  geom_point() +
  scale_colour_ghibli_d("MononokeMedium")

The choice of colors is very important for the comprehension of a graphic. Some palettes are not suitable for everyone. For example, for people with color blindness, color gradients from green to red, or from yellow to blue should be avoided.

To display only Brewer palettes that are colorblind friendly, specify the option colorblindFriendly = TRUE as follows:

display.brewer.all(colorblindFriendly = TRUE)

6.2 Viridis

viridis package provide a series of color maps that are designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency.

For the next part, we will use a real data set. Anterior tibial muscle tissue was collected from 20 patients, with or without confirmed myotonic dystrophy type 1 (DM1). Illumina RNAseq was performed on these samples and the sequencing data are available on GEO with the identifier GSE86356.

First, we will use the gene count table of these samples, formatted for use in ggplot2 ( pivot_longer() function ).

Open the csv file using the read_csv2() function. The file is located at “https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv”.

Solution

Download the Expression_matrice_pivot_longer_DEGs_GSE86356.csv file and save it in your working directory. You may have to set you working directory using setwd()

expr_DM1 <- read_csv2("Expression_matrice_pivot_longer_DEGs_GSE86356.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 2808 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (3): Genes, samples, condition
dbl (1): counts

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
expr_DM1
# A tibble: 2,808 × 4
   Genes samples counts condition
   <chr> <chr>    <dbl> <chr>    
 1 MYH3  DM1_41  52488. DM1      
 2 MYH3  DM1_36   3467. DM1      
 3 MYH3  DM1_53    955. DM1      
 4 MYH3  DM1_45   1945. DM1      
 5 MYH3  DM1_17   3464. DM1      
 6 MYH3  DM1_22   2995. DM1      
 7 MYH3  DM1_39   3235. DM1      
 8 MYH3  DM1_34   3001. DM1      
 9 MYH3  DM1_25   2576. DM1      
10 MYH3  DM1_33   2386. DM1      
# … with 2,798 more rows

or you can read it from the url

(expr_DM1 <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv"))

With this tibble, use ggplot2 and the geom_tile() function to make a heatmap. Fit the samples on the x-axis and the genes on the y-axis.

Tip: Transform the counts into log10(x + 1) for a better visualization.

Solution

(DM1_tile_base <-
  ggplot(expr_DM1, aes(samples, Genes, fill = log1p(counts))) +
  geom_tile() +
  labs(y="Genes", x = "Samples") +
  theme(
    axis.text.y = element_text(size= 6),
    axis.text.x = element_text(size = 6, angle = 90)
  ))
Nota bene : The elements of the axes, and the theme in general, are modified in the theme() function.

With the default color gradient, even with the transformation, the heatmap is difficult to study.

R interprets a large number of colors, indicated in RGB, hexadimal, or just by name. For example :

With scale_fill_gradient2() function, change the colors of the gradient, taking white for the minimum value and ‘springgreen4’ for the maximum value.

Solution

DM1_tile_base + scale_fill_gradient2(low = "white", high = "springgreen4")

It s better, but still not perfect! Now let s use the viridis color gradient for this graph.

Solution

DM1_tile_base + scale_fill_viridis_c()

6.3 Volcano Plot

For this last exercise, we will use the results of the differential gene expression analysis between DM1 vs WT conditions.

Open the csv file using the read_csv2() function. The file is located at “http://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv”.

Solution

Download the “EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv” file and save it in your working directory.

tab <- read_csv2("EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 15019 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (2): ENSEMBL_ID, gene_symbol
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tab
# A tibble: 15,019 × 8
   ENSEMBL_ID      gene_symbol baseMean log2FoldChange lfcSE  stat  pvalue  padj
   <chr>           <chr>          <dbl>          <dbl> <dbl> <dbl>   <dbl> <dbl>
 1 ENSG00000000003 TSPAN6          70.1         0.192  0.198  0    1       1    
 2 ENSG00000000005 TNMD            11.0         2.31   1.00   1.91 0.0568  1    
 3 ENSG00000000419 DPM1           267.         -0.0149 0.163  0    1       1    
 4 ENSG00000000457 SCYL3          257.         -0.0119 0.136  0    1       1    
 5 ENSG00000000460 C1orf112        86.0        -0.344  0.169  0    1       1    
 6 ENSG00000000938 FGR             29.7        -1.12   0.280 -2.56 0.0106  0.442
 7 ENSG00000000971 CFH           1231.          1.05   0.250  2.59 0.00965 0.421
 8 ENSG00000001036 FUCA2          103.         -0.191  0.169  0    1       1    
 9 ENSG00000001084 GCLC           275.         -0.268  0.143  0    1       1    
10 ENSG00000001167 NFYA           176.         -0.0748 0.148  0    1       1    
# … with 15,009 more rows
tab <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
tab

To make a Volcano plot, displaying different information about the significativity of the variation thanks to the colors, we will have to make a series of modifications on this table.

With mutate() and ifelse() fonctions, we will have to create :

  • a column ‘sig’ : it indicates if the gene is significant ( TRUE or FALSE ). Thresholds : baseMean > 20 and padj < 0.05 and abs(log2FoldChange) >= 1.5

  • a column ‘UpDown’ : it indicates if the gene is significantly up-regulated (Up), down-regulated (Down), or not significantly regulated (NO).

Solution

(tab.sig <- mutate(tab,
                   sig = baseMean > 20 & padj < 0.05 & abs(log2FoldChange) >= 1.5,
                   UpDown = ifelse(sig, ### we can use in the same mutate a column created by a previous line
                                  ifelse(log2FoldChange > 0, "Up", "Down"), "NO")
                   )
)
# A tibble: 15,019 × 10
   ENSEMBL_ID     gene_…¹ baseM…² log2F…³ lfcSE  stat  pvalue  padj sig   UpDown
   <chr>          <chr>     <dbl>   <dbl> <dbl> <dbl>   <dbl> <dbl> <lgl> <chr> 
 1 ENSG000000000… TSPAN6     70.1  0.192  0.198  0    1       1     FALSE NO    
 2 ENSG000000000… TNMD       11.0  2.31   1.00   1.91 0.0568  1     FALSE NO    
 3 ENSG000000004… DPM1      267.  -0.0149 0.163  0    1       1     FALSE NO    
 4 ENSG000000004… SCYL3     257.  -0.0119 0.136  0    1       1     FALSE NO    
 5 ENSG000000004… C1orf1…    86.0 -0.344  0.169  0    1       1     FALSE NO    
 6 ENSG000000009… FGR        29.7 -1.12   0.280 -2.56 0.0106  0.442 FALSE NO    
 7 ENSG000000009… CFH      1231.   1.05   0.250  2.59 0.00965 0.421 FALSE NO    
 8 ENSG000000010… FUCA2     103.  -0.191  0.169  0    1       1     FALSE NO    
 9 ENSG000000010… GCLC      275.  -0.268  0.143  0    1       1     FALSE NO    
10 ENSG000000011… NFYA      176.  -0.0748 0.148  0    1       1     FALSE NO    
# … with 15,009 more rows, and abbreviated variable names ¹​gene_symbol,
#   ²​baseMean, ³​log2FoldChange

We want to see the top10 DEGs on the graph. For this, we will use the package ggrepel. Install and load the ggrepel package.

Solution

install.packages("ggrepel")
library(ggrepel)

Let’s filter out table into a new variable, top10, to keep only the significant differentialy expressed genes with the top 10 adjusted pvalue. The smaller the adjusted pvalue, the more significant.

Tips : You can use the function slice_min()

Solution

(top10 <- arrange(tab.sig, desc(sig), padj))
# A tibble: 15,019 × 10
   ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE  stat   pvalue     padj sig   UpDown
   <chr>      <chr>     <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl> <lgl> <chr> 
 1 ENSG00000… ERBB3     182.     4.33 0.510  7.71 1.31e-14 3.04e-10 TRUE  Up    
 2 ENSG00000… NWD1       26.0    3.86 0.506  6.83 8.45e-12 9.84e- 8 TRUE  Up    
 3 ENSG00000… PEG10     172.     2.04 0.269  6.11 9.76e-10 3.36e- 6 TRUE  Up    
 4 ENSG00000… MYH8      832.     3.37 0.491  6.05 1.46e- 9 3.78e- 6 TRUE  Up    
 5 ENSG00000… PKP2      283.     1.92 0.255  5.97 2.34e- 9 5.23e- 6 TRUE  Up    
 6 ENSG00000… SLC14A2    20.9    3.53 0.525  5.96 2.47e- 9 5.23e- 6 TRUE  Up    
 7 ENSG00000… MYH3     2741.     4.37 0.668  5.94 2.85e- 9 5.52e- 6 TRUE  Up    
 8 ENSG00000… WBSCR17    95.3    3.17 0.497  5.58 2.34e- 8 3.41e- 5 TRUE  Up    
 9 ENSG00000… HBA2      747.    -2.66 0.410 -5.52 3.33e- 8 4.31e- 5 TRUE  Down  
10 ENSG00000… ATP1A4     38.6    2.01 0.300  5.37 7.66e- 8 7.76e- 5 TRUE  Up    
# … with 15,009 more rows, and abbreviated variable names ¹​gene_symbol,
#   ²​baseMean, ³​log2FoldChange
(top10 <- mutate(top10, row_N = row_number()))
# A tibble: 15,019 × 11
   ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE  stat   pvalue     padj sig   UpDown
   <chr>      <chr>     <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl> <lgl> <chr> 
 1 ENSG00000… ERBB3     182.     4.33 0.510  7.71 1.31e-14 3.04e-10 TRUE  Up    
 2 ENSG00000… NWD1       26.0    3.86 0.506  6.83 8.45e-12 9.84e- 8 TRUE  Up    
 3 ENSG00000… PEG10     172.     2.04 0.269  6.11 9.76e-10 3.36e- 6 TRUE  Up    
 4 ENSG00000… MYH8      832.     3.37 0.491  6.05 1.46e- 9 3.78e- 6 TRUE  Up    
 5 ENSG00000… PKP2      283.     1.92 0.255  5.97 2.34e- 9 5.23e- 6 TRUE  Up    
 6 ENSG00000… SLC14A2    20.9    3.53 0.525  5.96 2.47e- 9 5.23e- 6 TRUE  Up    
 7 ENSG00000… MYH3     2741.     4.37 0.668  5.94 2.85e- 9 5.52e- 6 TRUE  Up    
 8 ENSG00000… WBSCR17    95.3    3.17 0.497  5.58 2.34e- 8 3.41e- 5 TRUE  Up    
 9 ENSG00000… HBA2      747.    -2.66 0.410 -5.52 3.33e- 8 4.31e- 5 TRUE  Down  
10 ENSG00000… ATP1A4     38.6    2.01 0.300  5.37 7.66e- 8 7.76e- 5 TRUE  Up    
# … with 15,009 more rows, 1 more variable: row_N <int>, and abbreviated
#   variable names ¹​gene_symbol, ²​baseMean, ³​log2FoldChange
(top10 <- filter(top10, row_N <= 10))
# A tibble: 10 × 11
   ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE  stat   pvalue     padj sig   UpDown
   <chr>      <chr>     <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl> <lgl> <chr> 
 1 ENSG00000… ERBB3     182.     4.33 0.510  7.71 1.31e-14 3.04e-10 TRUE  Up    
 2 ENSG00000… NWD1       26.0    3.86 0.506  6.83 8.45e-12 9.84e- 8 TRUE  Up    
 3 ENSG00000… PEG10     172.     2.04 0.269  6.11 9.76e-10 3.36e- 6 TRUE  Up    
 4 ENSG00000… MYH8      832.     3.37 0.491  6.05 1.46e- 9 3.78e- 6 TRUE  Up    
 5 ENSG00000… PKP2      283.     1.92 0.255  5.97 2.34e- 9 5.23e- 6 TRUE  Up    
 6 ENSG00000… SLC14A2    20.9    3.53 0.525  5.96 2.47e- 9 5.23e- 6 TRUE  Up    
 7 ENSG00000… MYH3     2741.     4.37 0.668  5.94 2.85e- 9 5.52e- 6 TRUE  Up    
 8 ENSG00000… WBSCR17    95.3    3.17 0.497  5.58 2.34e- 8 3.41e- 5 TRUE  Up    
 9 ENSG00000… HBA2      747.    -2.66 0.410 -5.52 3.33e- 8 4.31e- 5 TRUE  Down  
10 ENSG00000… ATP1A4     38.6    2.01 0.300  5.37 7.66e- 8 7.76e- 5 TRUE  Up    
# … with 1 more variable: row_N <int>, and abbreviated variable names
#   ¹​gene_symbol, ²​baseMean, ³​log2FoldChange
(top10 <- filter(tab.sig, sig == TRUE))
# A tibble: 52 × 10
   ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE  stat   pvalue     padj sig   UpDown
   <chr>      <chr>     <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl> <lgl> <chr> 
 1 ENSG00000… FAM184B    88.3    1.61 0.289  4.19 2.81e- 5 7.37e- 3 TRUE  Up    
 2 ENSG00000… ALX4       48.3    3.14 0.613  4.47 7.74e- 6 3.16e- 3 TRUE  Up    
 3 ENSG00000… PKP2      283.     1.92 0.255  5.97 2.34e- 9 5.23e- 6 TRUE  Up    
 4 ENSG00000… ERBB3     182.     4.33 0.510  7.71 1.31e-14 3.04e-10 TRUE  Up    
 5 ENSG00000… RPS6KA6    77.5    2.26 0.371  5.02 5.04e- 7 3.45e- 4 TRUE  Up    
 6 ENSG00000… TUBB1      37.7   -2.54 0.410 -5.23 1.74e- 7 1.45e- 4 TRUE  Down  
 7 ENSG00000… KLHL4      57.3    2.74 0.555  4.22 2.42e- 5 6.96e- 3 TRUE  Up    
 8 ENSG00000… ITGB8      35.2    2.08 0.370  4.52 6.11e- 6 2.64e- 3 TRUE  Up    
 9 ENSG00000… DNAH11    102.     2.93 0.606  4.17 3.01e- 5 7.62e- 3 TRUE  Up    
10 ENSG00000… CRHR2      30.2    2.36 0.456  4.30 1.74e- 5 5.62e- 3 TRUE  Up    
# … with 42 more rows, and abbreviated variable names ¹​gene_symbol, ²​baseMean,
#   ³​log2FoldChange
(top10 <- slice_min(top10, padj, n = 10))
# A tibble: 10 × 10
   ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE  stat   pvalue     padj sig   UpDown
   <chr>      <chr>     <dbl>   <dbl> <dbl> <dbl>    <dbl>    <dbl> <lgl> <chr> 
 1 ENSG00000… ERBB3     182.     4.33 0.510  7.71 1.31e-14 3.04e-10 TRUE  Up    
 2 ENSG00000… NWD1       26.0    3.86 0.506  6.83 8.45e-12 9.84e- 8 TRUE  Up    
 3 ENSG00000… PEG10     172.     2.04 0.269  6.11 9.76e-10 3.36e- 6 TRUE  Up    
 4 ENSG00000… MYH8      832.     3.37 0.491  6.05 1.46e- 9 3.78e- 6 TRUE  Up    
 5 ENSG00000… PKP2      283.     1.92 0.255  5.97 2.34e- 9 5.23e- 6 TRUE  Up    
 6 ENSG00000… SLC14A2    20.9    3.53 0.525  5.96 2.47e- 9 5.23e- 6 TRUE  Up    
 7 ENSG00000… MYH3     2741.     4.37 0.668  5.94 2.85e- 9 5.52e- 6 TRUE  Up    
 8 ENSG00000… WBSCR17    95.3    3.17 0.497  5.58 2.34e- 8 3.41e- 5 TRUE  Up    
 9 ENSG00000… HBA2      747.    -2.66 0.410 -5.52 3.33e- 8 4.31e- 5 TRUE  Down  
10 ENSG00000… ATP1A4     38.6    2.01 0.300  5.37 7.66e- 8 7.76e- 5 TRUE  Up    
# … with abbreviated variable names ¹​gene_symbol, ²​baseMean, ³​log2FoldChange

The data is ready to be used to make a volcano plot!

To make the graph below, use ggplot2, the functions geom_point(), geom_hline(), geom_vline(), theme_minimal(), theme() (to remove the legend), geom_label_repel() and the function scale_color_manual() for the colors.

  • Tips 1 : Don t forget the transformation of the adjusted pvalue.
  • Tips 2 : Feel free to search your favorite Web browser for help.
  • Tips 3 : geom_label_repel() function needs a new parameter ‘data’ and ‘label’ in aes parameters.

Solution

ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) +
  geom_point() +
  scale_color_manual(values=c("steelblue", "lightgrey", "firebrick" )) +
  geom_hline(yintercept=-log10(0.05), col="black") +
  geom_vline(xintercept=c(-1.5, 1.5), col="black") +
  theme_minimal() +
  theme(legend.position="none") +
  labs(y="-log10(p-value)", x = "log2(FoldChange)") +
  geom_label_repel(data = top10, mapping = aes(label = gene_symbol))