library("tidyverse")
library("nycflights13")5 R.5: Pipping and grouping
5.1 Introduction
The goal of this session is to practice combining data transformation with tidyverse. The objectives will be to:
- Combining multiple operations with the pipe
%>% - Work on subgroup of the data with
group_by
For this session, we are going to work with a new dataset included in the nycflights13 package.
Install this package and load it. As usual you will also need the tidyverse library.
Solution
5.2 Combining multiple operations with the pipe
Find the 10 most delayed flights using the ranking function min_rank().
Solution
flights_md <- mutate(
flights,
most_delay = min_rank(desc(dep_delay))
)
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)We don’t want to create useless intermediate variables so we can use the pipe operator: %>% (or ctrl + shift + M).
Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.
Try to pipe operators to rewrite your precedent code with only one variable assignment.
Solution
flights_md2 <- flights %>%
mutate(most_delay = min_rank(desc(dep_delay))) %>%
filter(most_delay < 10) %>%
arrange(most_delay)Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered and use + instead of %>%.
The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations.
5.2.1 When not to use the pipe
You should reach for another tool when:
- Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
- You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.
5.3 Grouping variable
The summarise() function collapses a data frame to a single row. Check the difference between summarise() and mutate() with the following commands:
flights %>%
mutate(delay = mean(dep_delay, na.rm = TRUE))
flights %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))Whereas mutate compute the mean of dep_delay row by row (which is not useful), summarise compute the mean of the whole dep_delay column.
5.3.1 The power of summarise() with group_by()
The group_by() function changes the unit of analysis from the complete dataset to individual groups. Individual groups are defined by categorical variable or factors. Then, when you use aggregation functions on the grouped data frame, they’ll be automatically applied by groups.
You can use the following code to compute the average delay per months across years.
flights_delay <- flights %>%
group_by(year, month) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>%
arrange(month)`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
geom_bar(stat = "identity", color = "black", fill = "#619CFF") +
geom_errorbar(mapping = aes(ymin = 0, ymax = delay + sd)) +
theme(axis.text.x = element_blank())
Why did we group_by year and month and not only year ?
5.3.2 Missing values
You may have wondered about the na.rm argument we used above. What happens if we don’t set it?
flights %>%
group_by(dest) %>%
summarise(
dist = mean(distance),
delay = mean(arr_delay)
)# A tibble: 105 × 3
dest dist delay
<chr> <dbl> <dbl>
1 ABQ 1826 4.38
2 ACK 199 NA
3 ALB 143 NA
4 ANC 3370 -2.5
5 ATL 757. NA
6 AUS 1514. NA
7 AVL 584. NA
8 BDL 116 NA
9 BGR 378 NA
10 BHM 866. NA
# ℹ 95 more rows
Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.
5.3.3 Counts
Whenever you do any aggregation, it’s always a good idea to include a count (n()). This way, you can check that you’re not drawing conclusions based on very small amounts of data.

Imagine that we want to explore the relationship between the average distance (distance) and average delay (arr_delay) for each location (dest) and recreate the above figure.
Hints Here are the steps to prepare those data:
- Group flights by destination.
- Summarize to compute average distance (
avg_distance), average delay (avg_delay), and number of flights usingn()(n_flights). - Filter to remove Honolulu airport (“HNL”), which is almost twice as far away as the next closest airport.
- Filter to remove noisy points with delay superior to 40 or inferior to -20
- Create a
mappingonavg_distance,avg_delayandn_flightsassize. - Use the layer
geom_point()andgeom_smooth()(use method = lm) - We can hide the legend with the layer
theme(legend.position='none')
Solution
flights %>%
group_by(dest) %>%
summarise(
n_flights = n(),
avg_distance = mean(distance, na.rm = TRUE),
avg_delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(dest != "HNL") %>%
filter(avg_delay < 40 & avg_delay > -20) %>%
ggplot(mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
theme(legend.position = "none")5.3.4 Ungrouping
If you need to remove grouping, and return to operations on ungrouped data, use ungroup().
Try the following example.
flights %>%
group_by(year, month, day) %>%
ungroup() %>%
summarise(delay = mean(dep_delay, na.rm = TRUE))# A tibble: 1 × 1
delay
<dbl>
1 12.6
5.4 Grouping challenges
5.4.1 First challenge
Look at the number of canceled flights per day. Is there a pattern?
(A canceled flight is a flight where either the dep_time or the arr_time is NA)
Remember to always try to decompose complex questions into smaller and simple problems
- How can you create a
canceledflights variable which will be TRUE if the flight is canceled or FALSE if not? - We need to define the day of the week
wdayvariable (Monday, Tuesday, …). To do that, you can usestrftime(x,'%A')to get the name of the day of axdate in the POSIXct format as in thetime_hourcolumn, ex:strftime("2013-01-01 05:00:00 EST",'%A')return “Tuesday” ). - We can count the number of canceled flight (
cancel_day) by day of the week (wday). - We can pipe transformed and filtered tibble into a
ggplotfunction. - We can use
geom_colto have a barplot of the number ofcancel_dayfor each.wday - You can use the function
fct_reorder()to reorder thewdayby number ofcancel_dayand make the plot easier to read.
Solution
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
filter(canceled) %>%
mutate(wday = strftime(time_hour, "%A")) %>%
group_by(wday) %>%
summarise(
cancel_day = n()
) %>%
ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
geom_col()
5.4.2 Second challenge
Is the proportion of canceled flights by day of the week related to the average departure delay?
Solution
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time)
) %>%
mutate(wday = strftime(time_hour, "%A")) %>%
group_by(wday) %>%
summarise(
prop_cancel_day = sum(canceled) / n(),
avg_delay = mean(dep_delay, na.rm = TRUE)
) %>%
ungroup() %>%
ggplot(mapping = aes(x = avg_delay, y = prop_cancel_day, color = wday)) +
geom_point()
We can add error bars to this plot to justify our decision. Brainstorm a way to construct the error bars.
Hints:
We can define the error bars with confidence intervals.
cancel_daycan be modeled as a Bernoulli random variable: \(X \sim \mathcal{B}(p)\). The corresponding \(\alpha=5\%\) two-sided confidence interval is defined by: \[ \left[ \ \hat{p} \pm q_{1-\frac{\alpha}{2}} \sqrt{\dfrac{\hat{p}(1-\hat{p})}{n}} \ \right] \ , \] where \(\hat{p}\) is the proportion of canceled flights for a given day, \(n\) is the number of flights for a given day and \(q_{1-\frac{\alpha}{2}}\) is the quantile of order \(1-\frac{\alpha}{2}\) of the Gaussian distribution \(\mathcal{N}(0, 1)\).dep_delaycan be modeled as a Gaussian random variable: \(X \sim \mathcal{N}(\mu, \sigma^2)\). The corresponding \(\alpha=5\%\) two-sided confidence interval is defined by: \[ \left[ \ \hat{\mu} \pm t_{1-\frac{\alpha}{2}, n-1} \frac{\hat{\sigma}}{\sqrt{n}} \ \right] \ , \] where \(\hat{\mu}\) is the average departure delay for a given day, \(\hat{\sigma}\) is the standard deviation of the departure delay for a given day and \(t_{1-\frac{\alpha}{2}}\) is the quantile of order \(1-\frac{\alpha}{2}\) of the Student distribution with \(n-1\) degree of freedom \(t_{n-1}\).We can draw error bars with the functions
geom_errorbarandgeom_errorbarh.
Solution
alpha <- 0.05
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time),
wday = strftime(time_hour, "%A")
) %>%
group_by(wday) %>%
summarize(
n_obs = n(),
prop_cancel_day = sum(canceled) / n_obs,
sd_cancel_day = sqrt(prop_cancel_day * (1 - prop_cancel_day)),
avg_delay = mean(dep_delay, na.rm = T),
sd_delay = sd(dep_delay, na.rm = T)
) %>%
ggplot(mapping = aes(x = avg_delay, y = prop_cancel_day, color = wday)) +
geom_point() +
geom_errorbarh(
mapping = aes(
xmin = avg_delay - qt(1 - alpha / 2, n_obs - 1) * sd_delay / sqrt(n_obs),
xmax = avg_delay + qt(1 - alpha / 2, n_obs - 1) * sd_delay / sqrt(n_obs)
)
) +
geom_errorbar(
mapping = aes(
ymin = prop_cancel_day - qnorm(1 - alpha / 2) * sd_cancel_day / sqrt(n_obs),
ymax = prop_cancel_day + qnorm(1 - alpha / 2) * sd_cancel_day / sqrt(n_obs)
)
) +
theme_linedraw()Warning: `geom_errobarh()` was deprecated in ggplot2 4.0.0.
ℹ Please use the `orientation` argument of `geom_errorbar()` instead.

Instead of redefining the confidence interval by hand, we could use the corresponding pre-defined test functions prop.test and t.test. However, this approach requires to define two custom wrappers around these two functions to retrieve the appropriate outputs and format them as a tibble object. See the Solution bis below.
Solution bis
# custom wrapper around the prop.test function
custom_prop_test <- function(x, n){
test <- prop.test(x, n)
return(
tibble(
prop_cancel_day = test$estimate,
ymin = test$conf.int[1],
ymax = test$conf.int[2]
)
)
}
# custom wrapper around the t.test function
custom_t_test <- function(x){
test <- t.test(x)
return(
tibble(
avg_delay = test$estimate,
xmin = test$conf.int[1],
xmax = test$conf.int[2]
)
)
}
flights %>%
mutate(
canceled = is.na(dep_time) | is.na(arr_time),
wday = strftime(time_hour, "%A")
) %>%
group_by(wday) %>%
summarize(
n_obs = n(),
custom_prop_test(sum(canceled), n_obs),
custom_t_test(dep_delay)
) %>%
ggplot(mapping = aes(x = avg_delay, y = prop_cancel_day, color = wday)) +
geom_point() +
geom_errorbarh(mapping = aes(xmin = xmin, xmax = xmax)) +
geom_errorbar(mapping = aes(ymin = ymin, ymax = ymax)) +
theme_linedraw()
Now that you are aware of the interest of using geom_errorbar, what hour of the day should you fly if you want to avoid delays as much as possible?
Solution
flights %>%
group_by(hour) %>%
summarise(
mean_delay = mean(arr_delay, na.rm = T),
sd_delay = sd(arr_delay, na.rm = T),
) %>%
ggplot() +
geom_errorbar(
mapping = aes(
x = hour,
ymax = mean_delay + sd_delay,
ymin = mean_delay - sd_delay
)
) +
geom_point(
mapping = aes(
x = hour,
y = mean_delay,
)
)
5.4.3 Third challenge
Which carrier has the worst delays?
Solution
flights %>%
group_by(carrier) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T)
) %>%
mutate(carrier = fct_reorder(carrier, carrier_delay, .na_rm = T)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_col(alpha = 0.5)Can you disentangle the effects of bad airports vs. bad carriers?
Hints:
- Think about
group_by(carrier, dest). - We can color points per airport destination with the function
geom_jitter. - We can label points per airport destination with the function
geom_text_repelfrom packageggrepel. - We can control the jitter randomness with the function
position_jitter.
Solution
require(ggrepel)
flights %>%
group_by(carrier, dest) %>%
summarise(
carrier_delay = mean(arr_delay, na.rm = T),
nflight = n()
) %>%
ungroup() %>%
mutate(carrier = fct_reorder(carrier, carrier_delay, .na_rm = T)) %>%
ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(
aes(color = dest), # color points per destination
position = position_jitter(
width = 0.2, # small horizontal jitter
height = 0, # no vertical jitter
seed = 1 # to be reproducible
),
show.legend = FALSE # remove legend
) +
geom_text_repel(
aes(label = dest, color = dest), # color label per destination
max.overlaps = 10, # allow more labels to be drawn
position = position_jitter(
width = 0.2,
height = 0,
seed = 1
),
show.legend = FALSE
) +
theme_linedraw()See you in R.6: tidydata
License: Creative Commons CC-BY-SA-4.0.
Made with Quarto.