https://can.gitbiopages.ens-lyon.fr/R_basis/
1 Introduction
The goal of this practical is to practice data transformation with tidyverse
.
The objectives of this session will be to:
- Filter rows with
filter()
- Arrange rows with
arrange()
- Select columns with
select()
- Add new variables with
mutate()
For this session we are going to work with a new dataset included in the nycflights13
package.
Install this package and load it.
As usual you will also need the tidyverse
library.
Solution
install.packages("nycflights13")
library("tidyverse")
library("nycflights13")
1.1 Data set : nycflights13
nycflights13::flights
Contains all 336,776 flights that departed from New York City in 2013.
The data comes from the US Bureau of Transportation Statistics, and is documented in ?flights
?flights
You can display the first rows of the dataset to have an overview of the data.
flights
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 517 515 2 830 819 11 UA
2 2013 1 1 533 529 4 850 830 20 UA
3 2013 1 1 542 540 2 923 850 33 AA
4 2013 1 1 544 545 -1 1004 1022 -18 B6
5 2013 1 1 554 600 -6 812 837 -25 DL
6 2013 1 1 554 558 -4 740 728 12 UA
7 2013 1 1 555 600 -5 913 854 19 B6
8 2013 1 1 557 600 -3 709 723 -14 EV
9 2013 1 1 557 600 -3 838 846 -8 B6
10 2013 1 1 558 600 -2 753 745 8 AA
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
To know all the colnames of a table you can use the function colnames(dataset)
colnames(flights)
[1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
1.2 Data type
In programming languages, all variables are not equal.
When you display a tibble
you can see the type of a column.
Here is a list of common variable types that you will encounter
- int stands for integers.
- dbl stands for doubles or real numbers.
- chr stands for character vectors or strings.
- dttm stands for date-times (a date + a time).
- lgl stands for logical, vectors that contain only
TRUE
orFALSE
. - fctr stands for factors, which R uses to represent categorical variables with fixed possible values.
- date stands for dates.
You cannot add an int to a chr, but you can add an int to a dbl the results will be a dbl.
2 filter
rows
Variable types are important to keep in mind for comparisons.
The filter()
function allows you to subset observations based on their values.
The good reflex to take when you meet a new function of a package is to look at the help with ?function_name
to learn how to use it and to know the different arguments.
?filter
2.1 Use test to filter on a column
You can use the relational operators (<
,>
,==
,<=
,>=
,!=
) to make a test on a column and keep rows for which the results is TRUE
.
filter(flights, air_time >= 680)
filter(flights, carrier == "HA")
filter(flights, origin != "JFK")
The operator %in%
is very usefull to test if a value is in a list.
filter(flights, carrier %in% c("OO","AS"))
filter(flights, month %in% c(5,6,7,12))
dplyr
functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-
Save the flights longer than 680 minutes in a long_flights
variable
Solution
long_flights <- filter(flights, air_time >= 680)
2.2 Logical operators to filter on several columns
Multiple arguments to filter()
are combined with AND: every expression must be TRUE
in order for a row to be included in the output.
filter(flights, month == 12, day == 25)
# A tibble: 719 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 12 25 456 500 -4 649 651 -2 US
2 2013 12 25 524 515 9 805 814 -9 UA
3 2013 12 25 542 540 2 832 850 -18 AA
4 2013 12 25 546 550 -4 1022 1027 -5 B6
5 2013 12 25 556 600 -4 730 745 -15 AA
6 2013 12 25 557 600 -3 743 752 -9 DL
7 2013 12 25 557 600 -3 818 831 -13 DL
8 2013 12 25 559 600 -1 855 856 -1 B6
9 2013 12 25 559 600 -1 849 855 -6 B6
10 2013 12 25 600 600 0 850 846 4 B6
# … with 709 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
In R you can use the symbols &
(and), |
(or), !
(not) and the function xor()
to build other kinds of tests.
Display the long_flights
variable and predict the results of
filter(long_flights, day <= 15 & carrier == "HA")
filter(long_flights, day <= 15 | carrier == "HA")
filter(long_flights, (day <= 15 | carrier == "HA") & (! month > 2))
Solution
long_flights
# A tibble: 5 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 2 6 853 900 -7 1542 1540 2 HA
2 2013 3 15 1001 1000 1 1551 1530 21 HA
3 2013 3 16 1001 1000 1 1544 1530 14 HA
4 2013 3 17 1006 1000 6 1607 1530 37 HA
5 2013 3 17 1337 1335 2 1937 1836 61 UA
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
filter(long_flights, day <= 15 & carrier == "HA")
# A tibble: 2 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 2 6 853 900 -7 1542 1540 2 HA
2 2013 3 15 1001 1000 1 1551 1530 21 HA
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
filter(long_flights, day <= 15 | carrier == "HA")
# A tibble: 4 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 2 6 853 900 -7 1542 1540 2 HA
2 2013 3 15 1001 1000 1 1551 1530 21 HA
3 2013 3 16 1001 1000 1 1544 1530 14 HA
4 2013 3 17 1006 1000 6 1607 1530 37 HA
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
filter(long_flights, (day <= 15 | carrier == "HA") & (! month > 2))
# A tibble: 1 × 19
year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 2 6 853 900 -7 1542 1540 2 HA
# … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
# ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Test the following operations and translate them with words
filter(flights, month == 11 | month == 12)
filter(flights, month %in% c(11, 12))
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120 & dep_delay <= 120)
filter(flights, arr_delay <= 120, dep_delay <= 120)
Combinations of logical operators is a powerful programmatic way to select subset of data. Keep in mind, however, that long logical expression can be hard to read and understand, so it may be easier to apply successive small filters instead of one long one.
R either prints out the results, or saves them to a variable.
What happens when you put your variable assignment code between parenthesis (
)
?
(dec25 <- filter(flights, month == 12, day == 25))
2.3 Missing values
One important feature of R that can make comparison tricky is missing values, or NA
s for Not Availables.
Indeed each of the variable type can contain either a value of this type (i.e., 2
for an int) or nothing.
The nothing recorded in a variable status is represented with the NA
symbol.
As operations with NA
values don’t make sense, if you have NA
somewhere in your operation, the results will be NA
NA > 5
[1] NA
10 == NA
[1] NA
NA + 10
[1] NA
However, you can test for NA
s with the function is.na()
:
is.na(NA)
[1] TRUE
filter()
only includes rows where the condition is TRUE
; it excludes both FALSE
and NA
values. If you want to preserve missing values, ask for them explicitly:
df <- tibble( x = c("A","B","C"),
y = c(1, NA, 3)
)
df
# A tibble: 3 × 2
x y
<chr> <dbl>
1 A 1
2 B NA
3 C 3
filter(df, y > 1)
# A tibble: 1 × 2
x y
<chr> <dbl>
1 C 3
filter(df, is.na(y) | y > 1)
# A tibble: 2 × 2
x y
<chr> <dbl>
1 B NA
2 C 3
2.4 Challenges
Find all flights that:
- Had an arrival delay (arr_delay
) of two or more hours (you can check ?flights
)
- Flew to Houston (IAH or HOU)
Solution
filter(flights, arr_delay >= 120 & dest %in% c("IAH", "HOU"))
# A tibble: 220 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 1114 900 134 1447 1222 145 UA
2 2013 1 10 2137 1630 307 17 1925 292 UA
3 2013 1 15 1603 1446 77 1957 1757 120 UA
4 2013 1 16 1239 1043 116 1558 1340 138 UA
5 2013 1 20 2136 1700 276 27 2011 256 UA
6 2013 1 21 1708 1446 142 2032 1757 155 UA
7 2013 1 25 1409 1155 134 1710 1459 131 B6
8 2013 1 30 2312 2040 152 227 2345 162 B6
9 2013 1 31 1837 1635 122 2241 1945 176 WN
10 2013 1 31 2123 1856 147 49 2204 165 UA
# … with 210 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
How many flights have a missing dep_time
?
Solution
filter(flights, is.na(dep_time))
# A tibble: 8,255 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 1 NA 1630 NA NA 1815 NA EV
2 2013 1 1 NA 1935 NA NA 2240 NA AA
3 2013 1 1 NA 1500 NA NA 1825 NA AA
4 2013 1 1 NA 600 NA NA 901 NA B6
5 2013 1 2 NA 1540 NA NA 1747 NA EV
6 2013 1 2 NA 1620 NA NA 1746 NA EV
7 2013 1 2 NA 1355 NA NA 1459 NA EV
8 2013 1 2 NA 1420 NA NA 1644 NA EV
9 2013 1 2 NA 1321 NA NA 1536 NA EV
10 2013 1 2 NA 1545 NA NA 1910 NA AA
# … with 8,245 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Why is NA ^ 0
not missing? Why is NA | TRUE
not missing? Why is FALSE & NA
not missing? Can you figure out the general rule? (NA * 0
is a tricky counterexample!)
Solution
NA ^ 0 # ^ 0 is always 1 it's an arbitrary rule not a computation
[1] 1
NA | TRUE # if a member of a OR operation is TRUE the results is TRUE
[1] TRUE
FALSE & NA # if a member of a AND operation is FALSE the results is FALSE
[1] FALSE
NA * 0 # here we have a true computation
[1] NA
3 Arrange rows with arrange()
arrange()
works similarly to filter()
except that instead of selecting rows, it changes their order.
arrange(flights, distance, dep_delay)
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 7 27 NA 106 NA NA 245 NA US
2 2013 1 15 2122 2130 -8 2207 2225 -18 EV
3 2013 3 30 1942 1950 -8 2026 2044 -18 EV
4 2013 2 2 1610 1617 -7 1702 1722 -20 EV
5 2013 2 12 2123 2130 -7 2211 2225 -14 EV
6 2013 1 21 2123 2129 -6 2216 2224 -8 EV
7 2013 1 5 1155 1200 -5 1241 1306 -25 EV
8 2013 1 7 2124 2129 -5 2212 2224 -12 EV
9 2013 1 6 2125 2129 -4 2224 2224 0 EV
10 2013 1 12 1613 1617 -4 1708 1722 -14 EV
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
You can use desc()
to reorder by a column in descending order:
arrange(flights, distance, desc(dep_delay))
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 7 27 NA 106 NA NA 245 NA US
2 2013 1 4 1829 1615 134 1937 1721 136 EV
3 2013 1 17 2318 2129 109 2358 2224 94 EV
4 2013 1 25 2305 2129 96 2357 2224 93 EV
5 2013 2 11 2305 2129 96 3 2224 99 EV
6 2013 1 30 2244 2129 75 2341 2224 77 EV
7 2013 1 13 2243 2129 74 2400 2224 96 EV
8 2013 1 24 2241 2129 72 2350 2224 86 EV
9 2013 1 18 2231 2129 62 2320 2224 56 EV
10 2013 2 4 2231 2129 62 2333 2224 69 EV
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
3.1 Missing values
Missing values are always sorted at the end:
df <- tibble( x = c("A","B","C"),
y = c(1, NA, 3)
)
df
# A tibble: 3 × 2
x y
<chr> <dbl>
1 A 1
2 B NA
3 C 3
arrange(df, y)
# A tibble: 3 × 2
x y
<chr> <dbl>
1 A 1
2 C 3
3 B NA
arrange(df, desc(y))
# A tibble: 3 × 2
x y
<chr> <dbl>
1 C 3
2 A 1
3 B NA
3.2 Challenges
- Find the most delayed flight at arrival (
arr_delay
). - Find the flight that left earliest (
dep_delay
). - How could you arrange all missing values to the start in the
df
tibble ?
Solution
Find the most delayed flight at arrival
arrange(flights, desc(arr_delay))
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 1 9 641 900 1301 1242 1530 1272 HA
2 2013 6 15 1432 1935 1137 1607 2120 1127 MQ
3 2013 1 10 1121 1635 1126 1239 1810 1109 MQ
4 2013 9 20 1139 1845 1014 1457 2210 1007 AA
5 2013 7 22 845 1600 1005 1044 1815 989 MQ
6 2013 4 10 1100 1900 960 1342 2211 931 DL
7 2013 3 17 2321 810 911 135 1020 915 DL
8 2013 7 22 2257 759 898 121 1026 895 DL
9 2013 12 5 756 1700 896 1058 2020 878 AA
10 2013 5 3 1133 2055 878 1250 2215 875 MQ
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Find the flight that left earliest.
arrange(flights, dep_delay)
# A tibble: 336,776 × 19
year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 12 7 2040 2123 -43 40 2352 48 B6
2 2013 2 3 2022 2055 -33 2240 2338 -58 DL
3 2013 11 10 1408 1440 -32 1549 1559 -10 EV
4 2013 1 11 1900 1930 -30 2233 2243 -10 DL
5 2013 1 29 1703 1730 -27 1947 1957 -10 F9
6 2013 8 9 729 755 -26 1002 955 7 MQ
7 2013 10 23 1907 1932 -25 2143 2143 0 EV
8 2013 3 30 2030 2055 -25 2213 2250 -37 MQ
9 2013 3 2 1431 1455 -24 1601 1631 -30 9E
10 2013 5 5 934 958 -24 1225 1309 -44 B6
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
# origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
# minute <dbl>, time_hour <dttm>, and abbreviated variable names
# ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
How could you arrange all missing values to the start in the df
tibble ?
arrange(df, desc(is.na(y)))
# A tibble: 3 × 2
x y
<chr> <dbl>
1 B NA
2 A 1
3 C 3
4 Select columns with select()
select()
allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
You can select by column names
select(flights, year, month, day)
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# … with 336,766 more rows
By defining a range of columns
select(flights, year:day)
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# … with 336,766 more rows
Or, you can do a negative (-
) to remove columns.
select(flights, -(year:day))
# A tibble: 336,776 × 16
dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶ carrier flight tailnum origin
<int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
1 517 515 2 830 819 11 UA 1545 N14228 EWR
2 533 529 4 850 830 20 UA 1714 N24211 LGA
3 542 540 2 923 850 33 AA 1141 N619AA JFK
4 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
5 554 600 -6 812 837 -25 DL 461 N668DN LGA
6 554 558 -4 740 728 12 UA 1696 N39463 EWR
7 555 600 -5 913 854 19 B6 507 N516JB EWR
8 557 600 -3 709 723 -14 EV 5708 N829AS LGA
9 557 600 -3 838 846 -8 B6 79 N593JB JFK
10 558 600 -2 753 745 8 AA 301 N3ALAA LGA
# … with 336,766 more rows, 6 more variables: dest <chr>, air_time <dbl>,
# distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and abbreviated
# variable names ¹dep_time, ²sched_dep_time, ³dep_delay, ⁴arr_time,
# ⁵sched_arr_time, ⁶arr_delay
And, you can also rename column names on the fly.
select(flights, Y = year, M = month, D = day)
# A tibble: 336,776 × 3
Y M D
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# … with 336,766 more rows
4.1 Helper functions
here are a number of helper functions you can use within select()
:
starts_with("abc")
: matches column names that begin with"abc"
.ends_with("xyz")
: matches column names that end with"xyz"
.contains("ijk")
: matches column names that contain"ijk"
.num_range("x", 1:3)
: matchesx1
,x2
andx3
.where(test_function)
: select columns for which the result is TRUE.
See ?select
for more details.
4.2 Challenges
- Brainstorm as many ways as possible to select only
dep_time
,dep_delay
,arr_time
, andarr_delay
fromflights
. You can associate several selections arguments with|
,&
and!
.
The simplest way to start:
df_dep_arr <- select(flights, dep_time, dep_delay, arr_time, arr_delay)
colnames(df_dep_arr)
[1] "dep_time" "dep_delay" "arr_time" "arr_delay"
Other solutions
select(flights, dep_time, dep_delay, arr_time, arr_delay)
select(flights, starts_with("dep"), starts_with("arr") )
select(flights, starts_with("dep") | starts_with("arr") )
select(flights, matches("^(dep|arr)") )
select(flights, dep_time : arr_delay & !starts_with("sched"))
- What does the
any_of()
function do? - Why might it be helpful in conjunction with this vector? What is the difference with
all_of()
(hint : add “toto” to vars) ?
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
Solution
select(flights, any_of(vars))
select(flights, all_of(vars))
From the help message (?all_of()
) :
- all_of() is for strict selection. If any of the variables in the character vector is missing, an error is thrown.
- any_of() doesn’t check for missing variables. It is especially useful with negative selections, when you would like to make sure a variable is removed.
vars <- c(vars, "toto")
select(flights, any_of(vars))
select(flights, all_of(vars))
- Select all columns wich contain character values ? numeric values ?
Solution
select(flights, where(is.character))
select(flights, where(is.numeric))
- Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
Solution
select(flights, contains("TIME", ignore.case = FALSE))
5 Add new variables with mutate()
It’s often useful to add new columns that are functions of existing columns. That’s the job of mutate()
.
First let’s create a thiner dataset to work on flights_thin
that contains
- columns from
year
today
- columns that ends with
delays
- the
distance
andair_time
columns - the
dep_time
andsched_dep_time
columns
Then let’s create an even smaller dataset as toy dataset to test your commands before using them on the large dataset (It a good reflex to take). For that you can use the function head
or sample_n
for a more random sampling.
- select only 5 rows
Solution
(flights_thin <- select(flights, year:day, ends_with("delay"), distance, air_time, contains("dep_time")))
# A tibble: 336,776 × 9
year month day dep_delay arr_delay distance air_time dep_time sched_dep_…¹
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2013 1 1 2 11 1400 227 517 515
2 2013 1 1 4 20 1416 227 533 529
3 2013 1 1 2 33 1089 160 542 540
4 2013 1 1 -1 -18 1576 183 544 545
5 2013 1 1 -6 -25 762 116 554 600
6 2013 1 1 -4 12 719 150 554 558
7 2013 1 1 -5 19 1065 158 555 600
8 2013 1 1 -3 -14 229 53 557 600
9 2013 1 1 -3 -8 944 140 557 600
10 2013 1 1 -2 8 733 138 558 600
# … with 336,766 more rows, and abbreviated variable name ¹sched_dep_time
(flights_thin_toy <- head(flights_thin, n=5))
# A tibble: 5 × 9
year month day dep_delay arr_delay distance air_time dep_time sched_dep_t…¹
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2013 1 1 2 11 1400 227 517 515
2 2013 1 1 4 20 1416 227 533 529
3 2013 1 1 2 33 1089 160 542 540
4 2013 1 1 -1 -18 1576 183 544 545
5 2013 1 1 -6 -25 762 116 554 600
# … with abbreviated variable name ¹sched_dep_time
(flights_thin_toy2 <- sample_n(flights_thin, size=5))
# A tibble: 5 × 9
year month day dep_delay arr_delay distance air_time dep_time sched_dep_t…¹
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 2013 6 9 48 48 762 112 1458 1410
2 2013 10 25 -7 -20 1372 191 748 755
3 2013 1 1 -5 2 488 108 1230 1235
4 2013 6 14 9 0 733 108 1409 1400
5 2013 8 24 -3 -22 748 102 641 644
# … with abbreviated variable name ¹sched_dep_time
5.1 mutate()
mutate(tbl, new_var_a = opperation_a, ..., new_var_n = opperation_n)
mutate()
allows you to add new columns (new_var_a
, … , new_var_n
) and to fill them with the results of an operation.
We can create a gain
column whic can be the difference betwenn the delay at the departure and at the arrival to check if the pilot managed to compensate is departure delay.
mutate(flights_thin_toy, gain = dep_delay - arr_delay)
# A tibble: 5 × 10
year month day dep_delay arr_delay distance air_time dep_time sched…¹ gain
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 2013 1 1 2 11 1400 227 517 515 -9
2 2013 1 1 4 20 1416 227 533 529 -16
3 2013 1 1 2 33 1089 160 542 540 -31
4 2013 1 1 -1 -18 1576 183 544 545 17
5 2013 1 1 -6 -25 762 116 554 600 19
# … with abbreviated variable name ¹sched_dep_time
Using mutate
to add a new column gain
and speed
that contains the average speed of the plane to the flights_thin_toy
tibble (speed = distance / time).
Solution
flights_thin_toy <- mutate(flights_thin_toy,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
flights_thin_toy
# A tibble: 5 × 11
year month day dep_de…¹ arr_d…² dista…³ air_t…⁴ dep_t…⁵ sched…⁶ gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 517 515 -9 370.
2 2013 1 1 4 20 1416 227 533 529 -16 374.
3 2013 1 1 2 33 1089 160 542 540 -31 408.
4 2013 1 1 -1 -18 1576 183 544 545 17 517.
5 2013 1 1 -6 -25 762 116 554 600 19 394.
# … with abbreviated variable names ¹dep_delay, ²arr_delay, ³distance,
# ⁴air_time, ⁵dep_time, ⁶sched_dep_time
Currently dep_time
and sched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. (see the help to get more information on these columns) In the flight dataset, convert them to a more convenient representation of the number of minutes since midnight.
Hints :
dep_time
andsched_dep_time
are in the HHMM format (see the help to get these information). So you have to first get the number of hoursHH
, convert them in minutes and then add the number of minutesMM
.For exemple :
20:03
will be display2003
, so to convert it in minutes you have to do20 * 60 + 03 (= 1203)
.To split the number
HHMM
in hours (HH
) and minutes (MM
) you have to use an eucledean division of HHMM by 100 to get the number of hours as the divisor and the number of minute as the remainder. For that use the modulo operator%%
to get the remainder and it’s friend%/%
which return the divisor.
HH <- 2003 %/% 100
HH
[1] 20
MM <- 2003 %% 100
MM
[1] 3
HH * 60 + MM
[1] 1203
It is always a good idea to decompose a problem in small parts.
First train you only on dep_time
. Build the HH and MM columns. Then try to do the convertions in one row.
Partial solution
mutate(
flights_thin_toy,
HH = dep_time %/% 100,
MM = dep_time %% 100,
dep_time2 = HH * 60 + MM
)
** Note ** You can use the .after
option to tell where to put the new columns
mutate(
flights_thin_toy,
HH = dep_time %/% 100,
MM = dep_time %% 100,
dep_time2 = HH * 60 + MM,
.after = "dep_time" )
# A tibble: 5 × 14
year month day dep_de…¹ arr_d…² dista…³ air_t…⁴ dep_t…⁵ HH MM dep_t…⁶
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 517 5 17 317
2 2013 1 1 4 20 1416 227 533 5 33 333
3 2013 1 1 2 33 1089 160 542 5 42 342
4 2013 1 1 -1 -18 1576 183 544 5 44 344
5 2013 1 1 -6 -25 762 116 554 5 54 354
# … with 3 more variables: sched_dep_time <int>, gain <dbl>, speed <dbl>, and
# abbreviated variable names ¹dep_delay, ²arr_delay, ³distance, ⁴air_time,
# ⁵dep_time, ⁶dep_time2
or .keep = "used"
to keep only the columns used for the calculus which can be usefull for debugging
mutate(
flights_thin_toy,
HH = dep_time %/% 100,
MM = dep_time %% 100,
dep_time2 = HH * 60 + MM,
.keep = "used" )
# A tibble: 5 × 4
dep_time HH MM dep_time2
<int> <dbl> <dbl> <dbl>
1 517 5 17 317
2 533 5 33 333
3 542 5 42 342
4 544 5 44 344
5 554 5 54 354
In one row (or you can also remove columns HH and MM using select):
mutate(
flights_thin_toy,
dep_time2 = dep_time %/% 100 * 60 + dep_time %% 100,
.after = "dep_time" )
** Note ** You can also directly replace a column by the result of the mutate operation.
mutate(
flights_thin_toy,
dep_time = dep_time * 60 + dep_time)
Final solution
mutate(
flights,
dep_time = (dep_time %/% 100) * 60 +
dep_time %% 100,
sched_dep_time = (sched_dep_time %/% 100) * 60 +
sched_dep_time %% 100
)
5.2 Useful creation functions
- Offsets: lead(x) and lag(x) allow you to refer to the previous or next values of the column x. This allows you to compute running differences (e.g.
x - lag(x)
) or find when values change (x != lag(x)
). - R provides functions for running cumulative sums, products, mins and maxes:
cumsum()
,cumprod()
,cummin()
,cummax()
; and dplyr providescummean()
for cumulative means. - Logical comparisons,
<
,<=
,>
,>=
,!=
, and==
- Ranking: there are a number of ranking functions, the most frequently used being min_rank(). They differ by the way ties are treated, etc. Try ?mutate, ?min_rank, ?rank, for more information.
5.3 See you in R#5: Pipping and grouping
6 To go further: Data transformation and color sets.
There are a number of color palettes available in R, thanks to different packages such as RColorBrewer
, Viridis
or Ghibli
.
We will use them here to decorate our graphs, either on data already studied in the training, mpg
, or on more specialized data such as lists of differentially expressed genes ( GSE86356 )
install.packages(c("ghibli", "RColorBrewer", "viridis"))
library(tidyverse)
library(RColorBrewer)
library(ghibli)
library(viridis)
Loading required package: viridisLite
6.1 RColorBrewer & Ghibli
Using mpg
and the ‘ggplot2’ package, reproduce the graph studied in session 2, 3.1: color mapping.
Modify the colors representing the class of cars with the palettes Dark2
of RColorBrewer, then MononokeMedium
from Ghibli.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point()
Go to the links to find the appropriate function: they are very similar between the two packages.
Solution
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_color_brewer(palette = "Dark2")
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_colour_ghibli_d("MononokeMedium")
The choice of colors is very important for the comprehension of a graphic. Some palettes are not suitable for everyone. For example, for people with color blindness, color gradients from green to red, or from yellow to blue should be avoided.
To display only Brewer palettes that are colorblind friendly, specify the option colorblindFriendly = TRUE
as follows:
display.brewer.all(colorblindFriendly = TRUE)
6.2 Viridis
viridis
package provide a series of color maps that are designed to improve graph readability for readers with common forms of color blindness and/or color vision deficiency.
For the next part, we will use a real data set. Anterior tibial muscle tissue was collected from 20 patients, with or without confirmed myotonic dystrophy type 1 (DM1). Illumina RNAseq was performed on these samples and the sequencing data are available on GEO with the identifier GSE86356.
First, we will use the gene count table of these samples, formatted for use in ggplot2 ( pivot_longer()
function ).
Open the csv file using the read_csv2()
function. The file is located at “https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv”.
Solution
Download the Expression_matrice_pivot_longer_DEGs_GSE86356.csv file and save it in your working directory.
You may have to set you working directory using setwd()
expr_DM1 <- read_csv2("Expression_matrice_pivot_longer_DEGs_GSE86356.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 2808 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (3): Genes, samples, condition
dbl (1): counts
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
expr_DM1
# A tibble: 2,808 × 4
Genes samples counts condition
<chr> <chr> <dbl> <chr>
1 MYH3 DM1_41 52488. DM1
2 MYH3 DM1_36 3467. DM1
3 MYH3 DM1_53 955. DM1
4 MYH3 DM1_45 1945. DM1
5 MYH3 DM1_17 3464. DM1
6 MYH3 DM1_22 2995. DM1
7 MYH3 DM1_39 3235. DM1
8 MYH3 DM1_34 3001. DM1
9 MYH3 DM1_25 2576. DM1
10 MYH3 DM1_33 2386. DM1
# … with 2,798 more rows
or you can read it from the url
(expr_DM1 <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/Expression_matrice_pivot_longer_DEGs_GSE86356.csv"))
With this tibble, use ggplot2
and the geom_tile()
function to make a heatmap.
Fit the samples on the x-axis and the genes on the y-axis.
Tip: Transform the counts into log10(x + 1) for a better visualization.
Solution
(DM1_tile_base <-
ggplot(expr_DM1, aes(samples, Genes, fill = log1p(counts))) +
geom_tile() +
labs(y="Genes", x = "Samples") +
theme(
axis.text.y = element_text(size= 6),
axis.text.x = element_text(size = 6, angle = 90)
))
Nota bene : The elements of the axes, and the theme in general, are modified in the theme()
function.
With the default color gradient, even with the transformation, the heatmap is difficult to study.
R interprets a large number of colors, indicated in RGB, hexadimal, or just by name. For example :
With scale_fill_gradient2()
function, change the colors of the gradient, taking white for the minimum value and ‘springgreen4’ for the maximum value.
Solution
DM1_tile_base + scale_fill_gradient2(low = "white", high = "springgreen4")
It s better, but still not perfect! Now let s use the viridis color gradient for this graph.
Solution
DM1_tile_base + scale_fill_viridis_c()
6.3 Volcano Plot
For this last exercise, we will use the results of the differential gene expression analysis between DM1 vs WT conditions.
Open the csv file using the read_csv2()
function. The file is located at “http://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv”.
Solution
Download the “EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv” file and save it in your working directory.
tab <- read_csv2("EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 15019 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (2): ENSEMBL_ID, gene_symbol
dbl (6): baseMean, log2FoldChange, lfcSE, stat, pvalue, padj
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tab
# A tibble: 15,019 × 8
ENSEMBL_ID gene_symbol baseMean log2FoldChange lfcSE stat pvalue padj
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ENSG00000000003 TSPAN6 70.1 0.192 0.198 0 1 1
2 ENSG00000000005 TNMD 11.0 2.31 1.00 1.91 0.0568 1
3 ENSG00000000419 DPM1 267. -0.0149 0.163 0 1 1
4 ENSG00000000457 SCYL3 257. -0.0119 0.136 0 1 1
5 ENSG00000000460 C1orf112 86.0 -0.344 0.169 0 1 1
6 ENSG00000000938 FGR 29.7 -1.12 0.280 -2.56 0.0106 0.442
7 ENSG00000000971 CFH 1231. 1.05 0.250 2.59 0.00965 0.421
8 ENSG00000001036 FUCA2 103. -0.191 0.169 0 1 1
9 ENSG00000001084 GCLC 275. -0.268 0.143 0 1 1
10 ENSG00000001167 NFYA 176. -0.0748 0.148 0 1 1
# … with 15,009 more rows
tab <- read_csv2("https://can.gitbiopages.ens-lyon.fr/R_basis/session_4/EWang_Tibialis_DEGs_GRCH37-87_GSE86356.csv")
tab
To make a Volcano plot, displaying different information about the significativity of the variation thanks to the colors, we will have to make a series of modifications on this table.
With mutate()
and ifelse()
fonctions, we will have to create :
a column ‘sig’ : it indicates if the gene is significant ( TRUE or FALSE ). Thresholds : baseMean > 20 and padj < 0.05 and abs(log2FoldChange) >= 1.5
a column ‘UpDown’ : it indicates if the gene is significantly up-regulated (Up), down-regulated (Down), or not significantly regulated (NO).
Solution
(tab.sig <- mutate(tab,
sig = baseMean > 20 & padj < 0.05 & abs(log2FoldChange) >= 1.5,
UpDown = ifelse(sig, ### we can use in the same mutate a column created by a previous line
ifelse(log2FoldChange > 0, "Up", "Down"), "NO")
)
)
# A tibble: 15,019 × 10
ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE stat pvalue padj sig UpDown
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <chr>
1 ENSG000000000… TSPAN6 70.1 0.192 0.198 0 1 1 FALSE NO
2 ENSG000000000… TNMD 11.0 2.31 1.00 1.91 0.0568 1 FALSE NO
3 ENSG000000004… DPM1 267. -0.0149 0.163 0 1 1 FALSE NO
4 ENSG000000004… SCYL3 257. -0.0119 0.136 0 1 1 FALSE NO
5 ENSG000000004… C1orf1… 86.0 -0.344 0.169 0 1 1 FALSE NO
6 ENSG000000009… FGR 29.7 -1.12 0.280 -2.56 0.0106 0.442 FALSE NO
7 ENSG000000009… CFH 1231. 1.05 0.250 2.59 0.00965 0.421 FALSE NO
8 ENSG000000010… FUCA2 103. -0.191 0.169 0 1 1 FALSE NO
9 ENSG000000010… GCLC 275. -0.268 0.143 0 1 1 FALSE NO
10 ENSG000000011… NFYA 176. -0.0748 0.148 0 1 1 FALSE NO
# … with 15,009 more rows, and abbreviated variable names ¹gene_symbol,
# ²baseMean, ³log2FoldChange
We want to see the top10 DEGs on the graph. For this, we will use the package ggrepel
.
Install and load the ggrepel
package.
Solution
install.packages("ggrepel")
library(ggrepel)
Let’s filter out table into a new variable, top10, to keep only the significant differentialy expressed genes with the top 10 adjusted pvalue. The smaller the adjusted pvalue, the more significant.
Tips : You can use the function slice_min()
Solution
(top10 <- arrange(tab.sig, desc(sig), padj))
# A tibble: 15,019 × 10
ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE stat pvalue padj sig UpDown
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <chr>
1 ENSG00000… ERBB3 182. 4.33 0.510 7.71 1.31e-14 3.04e-10 TRUE Up
2 ENSG00000… NWD1 26.0 3.86 0.506 6.83 8.45e-12 9.84e- 8 TRUE Up
3 ENSG00000… PEG10 172. 2.04 0.269 6.11 9.76e-10 3.36e- 6 TRUE Up
4 ENSG00000… MYH8 832. 3.37 0.491 6.05 1.46e- 9 3.78e- 6 TRUE Up
5 ENSG00000… PKP2 283. 1.92 0.255 5.97 2.34e- 9 5.23e- 6 TRUE Up
6 ENSG00000… SLC14A2 20.9 3.53 0.525 5.96 2.47e- 9 5.23e- 6 TRUE Up
7 ENSG00000… MYH3 2741. 4.37 0.668 5.94 2.85e- 9 5.52e- 6 TRUE Up
8 ENSG00000… WBSCR17 95.3 3.17 0.497 5.58 2.34e- 8 3.41e- 5 TRUE Up
9 ENSG00000… HBA2 747. -2.66 0.410 -5.52 3.33e- 8 4.31e- 5 TRUE Down
10 ENSG00000… ATP1A4 38.6 2.01 0.300 5.37 7.66e- 8 7.76e- 5 TRUE Up
# … with 15,009 more rows, and abbreviated variable names ¹gene_symbol,
# ²baseMean, ³log2FoldChange
(top10 <- mutate(top10, row_N = row_number()))
# A tibble: 15,019 × 11
ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE stat pvalue padj sig UpDown
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <chr>
1 ENSG00000… ERBB3 182. 4.33 0.510 7.71 1.31e-14 3.04e-10 TRUE Up
2 ENSG00000… NWD1 26.0 3.86 0.506 6.83 8.45e-12 9.84e- 8 TRUE Up
3 ENSG00000… PEG10 172. 2.04 0.269 6.11 9.76e-10 3.36e- 6 TRUE Up
4 ENSG00000… MYH8 832. 3.37 0.491 6.05 1.46e- 9 3.78e- 6 TRUE Up
5 ENSG00000… PKP2 283. 1.92 0.255 5.97 2.34e- 9 5.23e- 6 TRUE Up
6 ENSG00000… SLC14A2 20.9 3.53 0.525 5.96 2.47e- 9 5.23e- 6 TRUE Up
7 ENSG00000… MYH3 2741. 4.37 0.668 5.94 2.85e- 9 5.52e- 6 TRUE Up
8 ENSG00000… WBSCR17 95.3 3.17 0.497 5.58 2.34e- 8 3.41e- 5 TRUE Up
9 ENSG00000… HBA2 747. -2.66 0.410 -5.52 3.33e- 8 4.31e- 5 TRUE Down
10 ENSG00000… ATP1A4 38.6 2.01 0.300 5.37 7.66e- 8 7.76e- 5 TRUE Up
# … with 15,009 more rows, 1 more variable: row_N <int>, and abbreviated
# variable names ¹gene_symbol, ²baseMean, ³log2FoldChange
(top10 <- filter(top10, row_N <= 10))
# A tibble: 10 × 11
ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE stat pvalue padj sig UpDown
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <chr>
1 ENSG00000… ERBB3 182. 4.33 0.510 7.71 1.31e-14 3.04e-10 TRUE Up
2 ENSG00000… NWD1 26.0 3.86 0.506 6.83 8.45e-12 9.84e- 8 TRUE Up
3 ENSG00000… PEG10 172. 2.04 0.269 6.11 9.76e-10 3.36e- 6 TRUE Up
4 ENSG00000… MYH8 832. 3.37 0.491 6.05 1.46e- 9 3.78e- 6 TRUE Up
5 ENSG00000… PKP2 283. 1.92 0.255 5.97 2.34e- 9 5.23e- 6 TRUE Up
6 ENSG00000… SLC14A2 20.9 3.53 0.525 5.96 2.47e- 9 5.23e- 6 TRUE Up
7 ENSG00000… MYH3 2741. 4.37 0.668 5.94 2.85e- 9 5.52e- 6 TRUE Up
8 ENSG00000… WBSCR17 95.3 3.17 0.497 5.58 2.34e- 8 3.41e- 5 TRUE Up
9 ENSG00000… HBA2 747. -2.66 0.410 -5.52 3.33e- 8 4.31e- 5 TRUE Down
10 ENSG00000… ATP1A4 38.6 2.01 0.300 5.37 7.66e- 8 7.76e- 5 TRUE Up
# … with 1 more variable: row_N <int>, and abbreviated variable names
# ¹gene_symbol, ²baseMean, ³log2FoldChange
(top10 <- filter(tab.sig, sig == TRUE))
# A tibble: 52 × 10
ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE stat pvalue padj sig UpDown
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <chr>
1 ENSG00000… FAM184B 88.3 1.61 0.289 4.19 2.81e- 5 7.37e- 3 TRUE Up
2 ENSG00000… ALX4 48.3 3.14 0.613 4.47 7.74e- 6 3.16e- 3 TRUE Up
3 ENSG00000… PKP2 283. 1.92 0.255 5.97 2.34e- 9 5.23e- 6 TRUE Up
4 ENSG00000… ERBB3 182. 4.33 0.510 7.71 1.31e-14 3.04e-10 TRUE Up
5 ENSG00000… RPS6KA6 77.5 2.26 0.371 5.02 5.04e- 7 3.45e- 4 TRUE Up
6 ENSG00000… TUBB1 37.7 -2.54 0.410 -5.23 1.74e- 7 1.45e- 4 TRUE Down
7 ENSG00000… KLHL4 57.3 2.74 0.555 4.22 2.42e- 5 6.96e- 3 TRUE Up
8 ENSG00000… ITGB8 35.2 2.08 0.370 4.52 6.11e- 6 2.64e- 3 TRUE Up
9 ENSG00000… DNAH11 102. 2.93 0.606 4.17 3.01e- 5 7.62e- 3 TRUE Up
10 ENSG00000… CRHR2 30.2 2.36 0.456 4.30 1.74e- 5 5.62e- 3 TRUE Up
# … with 42 more rows, and abbreviated variable names ¹gene_symbol, ²baseMean,
# ³log2FoldChange
(top10 <- slice_min(top10, padj, n = 10))
# A tibble: 10 × 10
ENSEMBL_ID gene_…¹ baseM…² log2F…³ lfcSE stat pvalue padj sig UpDown
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <chr>
1 ENSG00000… ERBB3 182. 4.33 0.510 7.71 1.31e-14 3.04e-10 TRUE Up
2 ENSG00000… NWD1 26.0 3.86 0.506 6.83 8.45e-12 9.84e- 8 TRUE Up
3 ENSG00000… PEG10 172. 2.04 0.269 6.11 9.76e-10 3.36e- 6 TRUE Up
4 ENSG00000… MYH8 832. 3.37 0.491 6.05 1.46e- 9 3.78e- 6 TRUE Up
5 ENSG00000… PKP2 283. 1.92 0.255 5.97 2.34e- 9 5.23e- 6 TRUE Up
6 ENSG00000… SLC14A2 20.9 3.53 0.525 5.96 2.47e- 9 5.23e- 6 TRUE Up
7 ENSG00000… MYH3 2741. 4.37 0.668 5.94 2.85e- 9 5.52e- 6 TRUE Up
8 ENSG00000… WBSCR17 95.3 3.17 0.497 5.58 2.34e- 8 3.41e- 5 TRUE Up
9 ENSG00000… HBA2 747. -2.66 0.410 -5.52 3.33e- 8 4.31e- 5 TRUE Down
10 ENSG00000… ATP1A4 38.6 2.01 0.300 5.37 7.66e- 8 7.76e- 5 TRUE Up
# … with abbreviated variable names ¹gene_symbol, ²baseMean, ³log2FoldChange
The data is ready to be used to make a volcano plot!
To make the graph below, use ggplot2
, the functions geom_point()
, geom_hline()
, geom_vline()
, theme_minimal()
, theme()
(to remove the legend), geom_label_repel()
and the function scale_color_manual()
for the colors.
- Tips 1 : Don t forget the transformation of the adjusted pvalue.
- Tips 2 : Feel free to search your favorite Web browser for help.
- Tips 3 :
geom_label_repel()
function needs a new parameter ‘data’ and ‘label’ in aes parameters.
Solution
ggplot(tab.sig, aes(x = log2FoldChange, y = -log10(padj), color = UpDown)) +
geom_point() +
scale_color_manual(values=c("steelblue", "lightgrey", "firebrick" )) +
geom_hline(yintercept=-log10(0.05), col="black") +
geom_vline(xintercept=c(-1.5, 1.5), col="black") +
theme_minimal() +
theme(legend.position="none") +
labs(y="-log10(p-value)", x = "log2(FoldChange)") +
geom_label_repel(data = top10, mapping = aes(label = gene_symbol))