R.6: tidydata

Laurent Modolo laurent.modolo@ens-lyon.fr; Carine Rey carine.rey@ens-lyon.fr

2022

https://can.gitbiopages.ens-lyon.fr/R_basis/

1 Introduction

Until now we have worked with data already formated in a nice way. In the tidyverse data formated in a nice way are called tidy The goal of this practical is to understand how to transform an hugly blob of information into a tidy data set.

1.1 Tidydata

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column.
  • Each observation must have its own row.
  • Each value must have its own cell.

Doing this kind and transformation is often called data wrangling, due to the felling that we have to wrangle with the data to force them into a tidy format.

But once this step is finish most of the subsequent analysis will be realy fast to do !

As usual we will need the tidyverse library.

Solution

library(tidyverse)

For this practical we are going to use the table set of datasets which demonstrate multiple ways to layout the same tabular data.

Use the help to know more about table1 dataset

Solution
?table1

table1, table2, table3, table4a, table4b, and table5 all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.

The data is a subset of the data contained in the World Health Organization Global Tuberculosis Report

2 Pivoting data

2.1 pivot longer

wide_example <- tibble(X1 = c("A","B"),
                      X2 = c(1,2),
                      X3 = c(0.1,0.2),
                      X4 = c(10,20))

If you have a wide dataset, such as wide_example, that you want to make longer, you will use the pivot_longer() function.

You have to specify the names of the columns you want to pivot into longer format (X2,X3,X4):

wide_example %>%
  pivot_longer(c(X2,X3,X4))

… or the reverse selection (-X1):

wide_example %>% pivot_longer(-X1)

You can specify the names of the columns where the data will be tidy (by default, it is names and value):

long_example <- wide_example %>%
  pivot_longer(-X1, names_to = "V1", values_to = "V2")

2.1.1 Exercice

Visualize the table4a dataset (you can use the View() function).

View(table4a)

Is the data tidy ? How would you transform this dataset to make it tidy ?

Solution

We have information about 3 variables in the table4a: country, year and number of cases. However, the variable information (year) is stored as column names. We want to pivot the horizontal column year, vertically and make the table longer.

You can use the pivot_longer fonction to make your table longer and have one observation per row and one variable per column.

For this we need to :

  • specify which column to select (all except country).
  • give the name of the new variable (year)
  • give the name of the variable stored in the cells of the columns years (case)
table4a %>% 
  pivot_longer(-country,
               names_to = "year",
               values_to = "case")
# A tibble: 6 × 3
  country     year    case
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

2.2 pivot wider

If you have a long dataset, that you want to make wider, you will use the pivot_wider() function.

You have to specify which column contains the name of the output column (names_from), and which column contains the cell values from (values_from).

long_example %>% pivot_wider(names_from = V1,
                             values_from = V2)

2.2.1 Exercice

Visualize the table2 dataset Is the data tidy ? How would you transform this dataset to make it tidy ? (you can now make also make a guess from the name of the subsection)

Solution

The column count store two types of information: the population size of the country and the number of cases in the country.

You can use the pivot_wider fonction to make your table wider and have one observation per row and one variable per column.

table2 %>% 
  pivot_wider(names_from = type,
              values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

3 Merging data

3.1 Relational data

To avoid having a huge table and to save space, information is often splited between different tables.

In our flights dataset, information about the carrier or the airports (origin and dest) are saved in a separate table (airlines, airports).

library(nycflights13)
flights
# A tibble: 336,776 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      517        515       2     830     819      11 UA     
 2  2013     1     1      533        529       4     850     830      20 UA     
 3  2013     1     1      542        540       2     923     850      33 AA     
 4  2013     1     1      544        545      -1    1004    1022     -18 B6     
 5  2013     1     1      554        600      -6     812     837     -25 DL     
 6  2013     1     1      554        558      -4     740     728      12 UA     
 7  2013     1     1      555        600      -5     913     854      19 B6     
 8  2013     1     1      557        600      -3     709     723     -14 EV     
 9  2013     1     1      557        600      -3     838     846      -8 B6     
10  2013     1     1      558        600      -2     753     745       8 AA     
# … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
airlines
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         
airports
# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# … with 1,448 more rows
weather
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_g…¹
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>    <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4        NA
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06       NA
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5        NA
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7        NA
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7        NA
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5        NA
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0        NA
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4        NA
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0        NA
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8        NA
# … with 26,105 more rows, 4 more variables: precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>, and abbreviated variable name ¹​wind_gust
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

3.2 Relational schema

The relationships between tables can be seen in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.

3.3 Joints

If you have to combine data from 2 tables in a a new table, you will use joints.

There are several types of joints depending of what you want to get.

Small concrete examples:

3.3.1 inner_joint()

keeps observations in x AND y

flights2 %>%
  inner_join(airlines)
Joining, by = "carrier"
# A tibble: 336,776 × 9
    year month   day  hour origin dest  tailnum carrier name                    
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
 1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
 2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
 3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
 4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
 6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
 7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
 8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
 9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
# … with 336,766 more rows

3.3.2 left_joint()

keeps all observations in x

flights2 %>%
  left_join(airlines)
Joining, by = "carrier"
# A tibble: 336,776 × 9
    year month   day  hour origin dest  tailnum carrier name                    
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
 1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
 2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
 3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
 4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
 6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
 7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
 8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
 9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
# … with 336,766 more rows

3.3.3 right_joint()

keeps all observations in y

flights2 %>%
  right_join(airlines)
Joining, by = "carrier"
# A tibble: 336,776 × 9
    year month   day  hour origin dest  tailnum carrier name                    
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
 1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
 2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
 3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
 4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
 6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
 7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
 8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
 9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
# … with 336,766 more rows

3.3.4 full_joint()

keeps all observations in x and y

flights2 %>%
  full_join(airlines)
Joining, by = "carrier"
# A tibble: 336,776 × 9
    year month   day  hour origin dest  tailnum carrier name                    
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
 1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
 2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
 3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
 4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
 6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
 7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
 8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
 9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
# … with 336,766 more rows

3.4 Defining the key columns

The default, by = NULL, uses all variables that appear in both tables, the so called natural join.

flights2 %>% 
  left_join(weather)
Joining, by = c("year", "month", "day", "hour", "origin")
# A tibble: 336,776 × 18
    year month   day  hour origin dest  tailnum carrier  temp  dewp humid
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
 2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
 3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
 4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
 5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
 6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
 7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
 8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
 9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
# … with 336,766 more rows, and 7 more variables: wind_dir <dbl>,
#   wind_speed <dbl>, wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>

If the two tables contain columns with the same names but corresponding to different things (such as year in flights2 and planes) you have to manually define the key or the keys.

flights2 %>% 
  left_join(planes, by = "tailnum")
# A tibble: 336,776 × 16
   year.x month   day  hour origin dest  tailnum carrier year.y type     manuf…¹
    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>    <chr>  
 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed w… BOEING 
 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed w… BOEING 
 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed w… BOEING 
 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed w… AIRBUS 
 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed w… BOEING 
 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed w… BOEING 
 7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed w… AIRBUS…
 8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed w… CANADA…
 9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed w… AIRBUS 
10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>     <NA>   
# … with 336,766 more rows, 5 more variables: model <chr>, engines <int>,
#   seats <int>, speed <int>, engine <chr>, and abbreviated variable name
#   ¹​manufacturer

If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: by = c("a" = "b"). This will match variable a in table x to variable b in table y.

flights2 %>% 
  left_join(airports, c("dest" = "faa"))
# A tibble: 336,776 × 15
    year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
 2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
 4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
 6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
 7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
 8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
 9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
# … with 336,766 more rows, and 3 more variables: tz <dbl>, dst <chr>,
#   tzone <chr>

If 2 columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix .x and .y because all column names must be different in the output table.

flights2 %>% 
  left_join(airports, c("dest" = "faa")) %>% 
  left_join(airports, c("origin" = "faa"))
# A tibble: 336,776 × 22
    year month   day  hour origin dest  tailnum carrier name.x lat.x lon.x alt.x
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
 2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
 4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
 6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
 7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
 8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
 9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
# … with 336,766 more rows, and 10 more variables: tz.x <dbl>, dst.x <chr>,
#   tzone.x <chr>, name.y <chr>, lat.y <dbl>, lon.y <dbl>, alt.y <dbl>,
#   tz.y <dbl>, dst.y <chr>, tzone.y <chr>

You can change the suffix using the option suffix

flights2 %>% 
  left_join(airports, by = c("dest" = "faa")) %>% 
  left_join(airports, by = c("origin" = "faa"), suffix = c(".dest",".origin"))
# A tibble: 336,776 × 22
    year month   day  hour origin dest  tailnum carrier name.d…¹ lat.d…² lon.d…³
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>      <dbl>   <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA      George …    30.0   -95.3
 2  2013     1     1     5 LGA    IAH   N24211  UA      George …    30.0   -95.3
 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami I…    25.8   -80.3
 4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>        NA      NA  
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Hartsfi…    33.6   -84.4
 6  2013     1     1     5 EWR    ORD   N39463  UA      Chicago…    42.0   -87.9
 7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort La…    26.1   -80.2
 8  2013     1     1     6 LGA    IAD   N829AS  EV      Washing…    38.9   -77.5
 9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlando…    28.4   -81.3
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chicago…    42.0   -87.9
# … with 336,766 more rows, 11 more variables: alt.dest <dbl>, tz.dest <dbl>,
#   dst.dest <chr>, tzone.dest <chr>, name.origin <chr>, lat.origin <dbl>,
#   lon.origin <dbl>, alt.origin <dbl>, tz.origin <dbl>, dst.origin <chr>,
#   tzone.origin <chr>, and abbreviated variable names ¹​name.dest, ²​lat.dest,
#   ³​lon.dest

3.5 Filtering joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y.
top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
flights %>% 
  semi_join(top_dest)
Joining, by = "dest"
# A tibble: 141,145 × 19
    year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
   <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
 1  2013     1     1      542        540       2     923     850      33 AA     
 2  2013     1     1      554        600      -6     812     837     -25 DL     
 3  2013     1     1      554        558      -4     740     728      12 UA     
 4  2013     1     1      555        600      -5     913     854      19 B6     
 5  2013     1     1      557        600      -3     838     846      -8 B6     
 6  2013     1     1      558        600      -2     753     745       8 AA     
 7  2013     1     1      558        600      -2     924     917       7 UA     
 8  2013     1     1      558        600      -2     923     937     -14 UA     
 9  2013     1     1      559        559       0     702     706      -4 B6     
10  2013     1     1      600        600       0     851     858      -7 B6     
# … with 141,135 more rows, 9 more variables: flight <int>, tailnum <chr>,
#   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#   minute <dbl>, time_hour <dttm>, and abbreviated variable names
#   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

3.6 Set operations

These expect the x and y inputs to have the same variables, and treat the observations like sets:

  • intersect(x, y): return only observations in both x and y.
  • union(x, y): return unique observations in x and y.
  • setdiff(x, y): return observations in x, but not in y.

3.7 See you in R.7: String & RegExp