library(tidyverse)
6 R.6: tidydata
6.1 Introduction
Until now we have worked with data already formatted in a nice way. In the tidyverse
data formatted in a nice way are called tidy The goal of this session is to understand how to transform an ugly blob of information into a tidy data set.
6.1.1 Tidydata
There are three interrelated rules which make a dataset tidy:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
Doing this kind and transformation is often called data wrangling, due to the feeling that we have to wrangle (struggle) with the data to force them into a tidy format.
But once this step is finish most of the subsequent analysis will be really fast to do !
As usual we will need the tidyverse
library.
Solution
For this session, we are going to use the table*
set of datasets which demonstrate multiple ways to layout the same tabular data.
Use the help to know more about table1
dataset.
Solution
?table1
table1
, table2
, table3
, table4a
, table4b
, and table5
all display the number of TB (Tuberculosis) cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000. The data contains values associated with four variables (country, year, cases, and population), but each table organizes the values in a different layout.
6.2 Pivoting data
6.2.1 pivot longer
<- tibble(
wide_example X1 = c("A", "B"),
X2 = c(1, 2),
X3 = c(0.1, 0.2),
X4 = c(10, 20)
)
If you have a wide dataset, such as wide_example
, that you want to make longer, you will use the pivot_longer()
function.
You have to specify the names of the columns you want to pivot into longer format (X2, X3, X4):
%>%
wide_example pivot_longer(c(X2, X3, X4))
… or the reverse selection (-X1):
%>%
wide_example pivot_longer(-X1)
You can specify the names of the columns where the data will be tidy (by default, it is names
and value
):
<- wide_example %>%
long_example pivot_longer(-X1, names_to = "V1", values_to = "V2")
6.2.1.1 Exercice
Visualize the table4a
dataset (you can use the View()
function).
View(table4a)
Is the data tidy ? How would you transform this dataset to make it tidy ?
Solution
We have information about 3 variables in table4a
: country
, year
and number of cases
. However, the variable information (year
) is stored as column names. We want to pivot the horizontal column year, vertically and make the table longer.
You can use the pivot_longer
function to make your table longer and have one observation per row and one variable per column.
For this we need to :
- specify which column to select (all except
country
). - give the name of the new variable (
year
) - give the name of the variable stored in the cells of the columns years (
case
)
%>%
table4a pivot_longer(-country,
names_to = "year",
values_to = "case"
)
# A tibble: 6 × 3
country year case
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
6.2.2 pivot wider
If you have a long dataset, that you want to make wider, you will use the pivot_wider()
function.
You have to specify which column contains the name of the output column (names_from
), and which column contains the cell values from (values_from
).
%>% pivot_wider(
long_example names_from = V1,
values_from = V2
)
6.2.2.1 Exercice
Visualize the table2
dataset.
Is the data tidy ? How would you transform this dataset to make it tidy ? (you can now make also make a guess from the name of the subsection)
Solution
The column count
store two types of information: the population
size of the country and the number of cases
in the country.
You can use the pivot_wider
function to make your table wider and have one observation per row and one variable per column.
%>%
table2 pivot_wider(
names_from = type,
values_from = count
)
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
6.3 Merging data
6.3.1 Relational data
To avoid having a huge table and to save space, information is often split between different tables.
In our flights
dataset, information about the carrier
or the airports
(origin and dest) are saved in a separate table (airlines
, airports
).
library(nycflights13)
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
airlines
# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
airports
# A tibble: 1,458 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# ℹ 1,448 more rows
weather
# A tibble: 26,115 × 15
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
# visib <dbl>, time_hour <dttm>
<- flights %>%
flights2 select(year:day, hour, origin, dest, tailnum, carrier)
6.3.2 Relational schema
Relationships between tables can be displayed in a relational graph. The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
6.3.3 Joints
If you have to combine data from two tables in a new one, you will use *_joint
functions.
There are several types of joints depending of what you want to get.
Small concrete examples:
6.3.3.1 inner_joint()
Keeps observations in x
AND y
%>%
flights2 inner_join(airlines)
Joining with `by = join_by(carrier)`
# A tibble: 336,776 × 9
year month day hour origin dest tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
# ℹ 336,766 more rows
6.3.3.2 left_joint()
Keeps all observations in x
%>%
flights2 left_join(airlines)
Joining with `by = join_by(carrier)`
# A tibble: 336,776 × 9
year month day hour origin dest tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
# ℹ 336,766 more rows
6.3.3.3 right_joint()
Keeps all observations in y
%>%
flights2 right_join(airlines)
Joining with `by = join_by(carrier)`
# A tibble: 336,776 × 9
year month day hour origin dest tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
# ℹ 336,766 more rows
6.3.3.4 full_joint()
Keeps all observations in x
and y
%>%
flights2 full_join(airlines)
Joining with `by = join_by(carrier)`
# A tibble: 336,776 × 9
year month day hour origin dest tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
# ℹ 336,766 more rows
6.3.4 Defining the key columns
The default, by = NULL
, uses all variables that appear in both tables, the so called natural join.
%>%
flights2 left_join(weather)
Joining with `by = join_by(year, month, day, hour, origin)`
# A tibble: 336,776 × 18
year month day hour origin dest tailnum carrier temp dewp humid
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
# ℹ 336,766 more rows
# ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
# precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
If the two tables contain columns with the same names but corresponding to different things (such as year
in flights2
and planes
) you have to manually define the key or the keys.
%>%
flights2 left_join(planes, by = "tailnum")
# A tibble: 336,776 × 16
year.x month day hour origin dest tailnum carrier year.y type
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult…
2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult…
3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult…
4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult…
5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult…
6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult…
7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult…
8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult…
9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult…
10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
# seats <int>, speed <int>, engine <chr>
If you want to join by data that are in two columns with different names, you must specify the correspondence with a named character vector: by = c("a" = "b")
. This will match variable a
in table x
to variable b
in table y
.
%>%
flights2 left_join(airports, c("dest" = "faa"))
# A tibble: 336,776 × 15
year month day hour origin dest tailnum carrier name lat lon alt
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
If two columns have identical names in the input tables but are not used in the join, they are automatically renamed with the suffix .x
and .y
because all column names must be different in the output table.
%>%
flights2 left_join(airports, c("dest" = "faa")) %>%
left_join(airports, c("origin" = "faa"))
# A tibble: 336,776 × 22
year month day hour origin dest tailnum carrier name.x lat.x lon.x alt.x
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
# ℹ 336,766 more rows
# ℹ 10 more variables: tz.x <dbl>, dst.x <chr>, tzone.x <chr>, name.y <chr>,
# lat.y <dbl>, lon.y <dbl>, alt.y <dbl>, tz.y <dbl>, dst.y <chr>,
# tzone.y <chr>
You can change the suffix using the option suffix
%>%
flights2 left_join(airports, by = c("dest" = "faa")) %>%
left_join(airports, by = c("origin" = "faa"), suffix = c(".dest", ".origin"))
# A tibble: 336,776 × 22
year month day hour origin dest tailnum carrier name.dest lat.dest
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA George Bush In… 30.0
2 2013 1 1 5 LGA IAH N24211 UA George Bush In… 30.0
3 2013 1 1 5 JFK MIA N619AA AA Miami Intl 25.8
4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA
5 2013 1 1 6 LGA ATL N668DN DL Hartsfield Jac… 33.6
6 2013 1 1 5 EWR ORD N39463 UA Chicago Ohare … 42.0
7 2013 1 1 6 EWR FLL N516JB B6 Fort Lauderdal… 26.1
8 2013 1 1 6 LGA IAD N829AS EV Washington Dul… 38.9
9 2013 1 1 6 JFK MCO N593JB B6 Orlando Intl 28.4
10 2013 1 1 6 LGA ORD N3ALAA AA Chicago Ohare … 42.0
# ℹ 336,766 more rows
# ℹ 12 more variables: lon.dest <dbl>, alt.dest <dbl>, tz.dest <dbl>,
# dst.dest <chr>, tzone.dest <chr>, name.origin <chr>, lat.origin <dbl>,
# lon.origin <dbl>, alt.origin <dbl>, tz.origin <dbl>, dst.origin <chr>,
# tzone.origin <chr>
6.3.5 Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
semi_join(x, y)
keeps all observations inx
that have a match iny
.anti_join(x, y)
drops all observations inx
that have a match iny
.
<- flights %>%
top_dest count(dest, sort = TRUE) %>%
head(10)
%>%
flights semi_join(top_dest)
Joining with `by = join_by(dest)`
# A tibble: 141,145 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 542 540 2 923 850
2 2013 1 1 554 600 -6 812 837
3 2013 1 1 554 558 -4 740 728
4 2013 1 1 555 600 -5 913 854
5 2013 1 1 557 600 -3 838 846
6 2013 1 1 558 600 -2 753 745
7 2013 1 1 558 600 -2 924 917
8 2013 1 1 558 600 -2 923 937
9 2013 1 1 559 559 0 702 706
10 2013 1 1 600 600 0 851 858
# ℹ 141,135 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
6.3.6 Set operations
These expect the x
and y
inputs to have the same variables, and treat the observations like sets:
intersect(x, y)
: return only observations in bothx
andy
.union(x, y)
: return unique observations inx
andy
.setdiff(x, y)
: return observations inx
, but not iny
.
See you in R.7: String & RegExp
License: FIXME.
Made with Quarto.