R#5: Pipping and grouping

Laurent Modolo laurent.modolo@ens-lyon.fr

2022

https://can.gitbiopages.ens-lyon.fr/R_basis/

1 Introduction

The goal of this practical is to practice combining data transformation with tidyverse. The objectives of this session will be to:

  • Combining multiple operations with the pipe %>%
  • Work on subgroup of the data with group_by

For this session we are going to work with a new dataset included in the nycflights13 package. Install this package and load it. As usual you will also need the tidyverse library.

Solution

library("tidyverse")
library("nycflights13")

2 Combining multiple operations with the pipe

Find the 10 most delayed flights using a ranking function. min_rank()

Solution

flights_md <- mutate(flights,
                     most_delay = min_rank(desc(dep_delay)))
flights_md <- filter(flights_md, most_delay < 10)
flights_md <- arrange(flights_md, most_delay)

We don’t want to create useless intermediate variables so we can use the pipe operator: %>% (or ctrl + shift + M).

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom.

Try to pipe operators to rewrite your precedent code with only one variable assignment.

Solution

flights_md2 <- flights %>%
    mutate(most_delay = min_rank(desc(dep_delay))) %>%
    filter(most_delay < 10) %>%
    arrange(most_delay)

Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered and use + instead of %>%. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.

The pipe is a powerful tool, but it’s not the only tool at your disposal, and it doesn’t solve every problem! Pipes are most useful for rewriting a fairly short linear sequence of operations. I think you should reach for another tool when:

2.1 When not to use the pipe

  • Your pipes are longer than (say) ten steps. In that case, create intermediate functions with meaningful names. That will make debugging easier, because you can more easily check the intermediate results, and it makes it easier to understand your code, because the variable names can help communicate intent.
  • You have multiple inputs or outputs. If there isn’t one primary object being transformed, but two or more objects being combined together, don’t use the pipe. You can create a function that combines or split the results.

3 Grouping variable

The summarise() function collapses a data frame to a single row. Check the difference between summarise() and mutate() with the following commands:

flights %>% 
  mutate(delay = mean(dep_delay, na.rm = TRUE))
flights %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE))

Where mutate compute the mean of dep_delay row by row (which is not useful), summarise compute the mean of the whole dep_delay column.

3.1 The power of summarise() with group_by()

The group_by() function changes the unit of analysis from the complete dataset to individual groups. Individual groups are defined by categorial variable or factors. Then, when you use the function you already know on grouped data frame and they’ll be automatically applied by groups.

You can use the following code to compute the average delay per months across years.

flights_delay <- flights %>% 
  group_by(year, month) %>% 
  summarise(delay = mean(dep_delay, na.rm = TRUE), sd = sd(dep_delay, na.rm = TRUE)) %>% 
  arrange(month)
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
ggplot(data = flights_delay, mapping = aes(x = month, y = delay)) +
  geom_bar(stat="identity", color="black", fill = "#619CFF") +
  geom_errorbar(mapping = aes( ymin=0, ymax=delay+sd)) + 
  theme(axis.text.x = element_blank())

Why did we group_by year and month and not only year ?

3.2 Missing values

You may have wondered about the na.rm argument we used above. What happens if we don’t set it?

flights %>% 
  group_by(dest) %>% 
  summarise(
    dist = mean(distance),
    delay = mean(arr_delay)
  )
# A tibble: 105 × 3
   dest   dist delay
   <chr> <dbl> <dbl>
 1 ABQ   1826   4.38
 2 ACK    199  NA   
 3 ALB    143  NA   
 4 ANC   3370  -2.5 
 5 ATL    757. NA   
 6 AUS   1514. NA   
 7 AVL    584. NA   
 8 BDL    116  NA   
 9 BGR    378  NA   
10 BHM    866. NA   
# … with 95 more rows

Aggregation functions obey the usual rule of missing values: if there’s any missing value in the input, the output will be a missing value.

3.3 Counts

Whenever you do any aggregation, it’s always a good idea to include either a count (n()). That way you can check that you’re not drawing conclusions based on very small amounts of data.

Imagine that we want to explore the relationship between the average distance (distance) and average delay (arr_delay) for each location (dest) and recreate the above figure. here are three steps to prepare this data:

  1. Group flights by destination.
  2. Summarize to compute average distance (avg_distance), average delay (avg_delay), and number of flights using n() (n_flights).
  3. Filter to remove Honolulu airport, which is almost twice as far away as the next closest airport.
  4. Filter to remove noisy points with delay superior to 40 or inferior to -20
  5. Create a mapping on avg_distance, avg_delay and n_flights as size.
  6. Use the layer geom_point() and geom_smooth() (use method = lm)
  7. We can hide the legend with the layer theme(legend.position='none')
Solution

flights %>% 
  group_by(dest) %>% 
  summarise(
    n_flights = n(),
    avg_distance = mean(distance, na.rm = TRUE),
    avg_delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(dest != "HNL") %>% 
  filter(avg_delay < 40 & avg_delay > -20) %>% 
  ggplot(mapping = aes(x = avg_distance, y = avg_delay, size = n_flights)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  theme(legend.position='none')

3.4 Ungrouping

If you need to remove grouping, and return to operations on ungrouped data, use ungroup().

Try the following example

flights %>% 
  group_by(year, month, day) %>% 
  ungroup() %>%
  summarise(delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 1 × 1
  delay
  <dbl>
1  12.6

4 Grouping challenges

4.1 First challenge

Look at the number of canceled flights per day. Is there a pattern?

(A canceled flight is a flight where the dep_time or the arr_time is NA)

Remember to always try to decompose complex questions into smaller and simple problems

  • How can you create a canceled flights variable which will be TRUE if the flight is canceled or FALSE if not?
  • We need to define the day of the week wday variable (Monday, Tuesday, …). To do that, you can use strftime(x,'%A') to get the name of the day of a x date in the POSIXct format as in the time_hour column, ex: strftime("2013-01-01 05:00:00 EST",'%A') return “Tuesday” ).
  • We can count the number of canceled flight (cancel_day) by day of the week (wday).
  • We can pipe transformed and filtered tibble into a ggplot function.
  • We can use geom_col to have a barplot of the number of cancel_day for each. wday
  • You can use the function fct_reorder() to reorder the wday by number of cancel_day and make the plot easier to read.
Solution

flights %>% 
  mutate(
    canceled = is.na(dep_time) | is.na(arr_time)
  ) %>% 
  filter(canceled) %>% 
  mutate(wday = strftime(time_hour,'%A')) %>% 
  group_by(wday) %>% 
  summarise(
    cancel_day = n()
  ) %>%
  ggplot(mapping = aes(x = fct_reorder(wday, cancel_day), y = cancel_day)) +
  geom_col()

4.2 Second challenge

Is the proportion of canceled flights by day of the week related to the average departure delay?

Solution

flights %>% 
  mutate(
    canceled = is.na(dep_time) | is.na(arr_time)
  ) %>% 
  mutate(wday = strftime(time_hour,'%A')) %>% 
  group_by(wday) %>% 
  summarise(
    prop_cancel_day = sum(canceled)/n(),
    av_delay = mean(dep_delay, na.rm = TRUE)
  ) %>%
  ungroup() %>% 
  ggplot(mapping = aes(x = av_delay, y = prop_cancel_day, color = wday)) +
  geom_point()

Which day would you prefer to book a flight ?

We can add error bars to this plot to justify our decision. Brainstorm a way to have access to the mean and standard deviation or the prop_cancel_day and av_delay.

Solution

flights %>% 
  mutate(
    canceled = is.na(dep_time) | is.na(arr_time)
  ) %>% 
  mutate(wday = strftime(time_hour,'%A')) %>% 
  group_by(day) %>% 
  mutate(
    prop_cancel_day = sum(canceled)/sum(!canceled),
    av_delay = mean(dep_delay, na.rm = TRUE)
  ) %>%
  group_by(wday) %>% 
  summarize(
    mean_cancel_day = mean(prop_cancel_day, na.rm = TRUE),
    sd_cancel_day = sd(prop_cancel_day, na.rm = TRUE),
    mean_av_delay = mean(av_delay, na.rm = TRUE),
    sd_av_delay = sd(av_delay, na.rm = TRUE)
  ) %>% 
  ggplot(mapping = aes(x = mean_av_delay, y = mean_cancel_day, color = wday)) +
  geom_point() +
  geom_errorbarh(mapping = aes(
    xmin = -sd_av_delay + mean_av_delay,
    xmax = sd_av_delay + mean_av_delay
  )) +
  geom_errorbar(mapping = aes(
    ymin = -sd_cancel_day + mean_cancel_day,
    ymax = sd_cancel_day + mean_cancel_day
  ))

Now that you are aware of the interest of using geom_errorbar, what hour of the day should you fly if you want to avoid delays as much as possible?

Solution

flights %>% 
  group_by(hour) %>% 
  summarise(
    mean_delay = mean(arr_delay, na.rm = T),
    sd_delay = sd(arr_delay, na.rm = T),
  ) %>% 
  ggplot() +
  geom_errorbar(mapping = aes(
    x = hour,
    ymax = mean_delay + sd_delay,
    ymin = mean_delay - sd_delay)) +
  geom_point(mapping = aes(
    x = hour,
    y = mean_delay,
  ))

4.3 Third challenge

Which carrier has the worst delays?

Solution

flights %>% 
  group_by(carrier) %>% 
  summarise(
    carrier_delay = mean(arr_delay, na.rm = T)
  ) %>%
  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
  geom_col(alpha = 0.5)

Can you disentangle the effects of bad airports vs. bad carriers? (Hint: think about group_by(carrier, dest) %>% summarise(n=n()))

Solution

flights %>% 
  group_by(carrier, dest) %>% 
  summarise(
    carrier_delay = mean(arr_delay, na.rm = T),
    number_of_flight = n()
  ) %>%
  mutate(carrier = fct_reorder(carrier, carrier_delay)) %>%
  ggplot(mapping = aes(x = carrier, y = carrier_delay)) +
  geom_boxplot() +
  geom_jitter(height = 0)

4.4 See you in R.6: tidydata