R.8: Factors

Laurent Modolo laurent.modolo@ens-lyon.fr

2022

https://can.gitbiopages.ens-lyon.fr/R_basis/

1 Introduction

In this session, you will learn more about the factor type in R. Factors can be very useful, but you have to be mindful of the implicit conversions from simple vector to factor ! They are the source of loot of pain for R programmers.

As usual we will need the tidyverse library.

Solution

library(tidyverse)

2 Creating factors

Imagine that you have a variable that records month:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Using a string to record this variable has two problems:

  1. There are only twelve possible months, and there’s nothing saving you from typos:
x2 <- c("Dec", "Apr", "Jam", "Mar")
  1. It doesn’t sort in a useful way:
sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"

You can fix both of these problems with a factor.

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

And any values not in the set will be converted to NA:

y2 <- parse_factor(x2, levels = month_levels)
Warning: 1 parsing failure.
row col           expected actual
  3  -- value in level set    Jam
y2
[1] Dec  Apr  <NA> Mar 
attr(,"problems")
# A tibble: 1 × 4
    row   col expected           actual
  <int> <int> <chr>              <chr> 
1     3    NA value in level set Jam   
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data.

f2 <- x1 %>% factor() %>% fct_inorder()
f2
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
levels(f2)
[1] "Dec" "Apr" "Jan" "Mar"

3 General Social Survey

gss_cat %>%
  count(race)
# A tibble: 3 × 2
  race      n
  <fct> <int>
1 Other  1959
2 Black  3129
3 White 16395

By default, ggplot2 will drop levels that don’t have any values. You can force them to display with:

ggplot(gss_cat, aes(x = race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

4 Modifying factor order

It’s often useful to change the order of the factor levels in a visualisation.

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
ggplot(relig_summary, aes(x = tvhours, y = relig)) + geom_point()

It is difficult to interpret this plot because there’s no overall pattern. We can improve it by reordering the levels of relig using fct_reorder(). fct_reorder() takes three arguments:

  • f, the factor whose levels you want to modify.
  • x, a numeric vector that you want to use to reorder the levels.
  • Optionally, fun, a function that’s used if there are multiple values of x for each value of f. The default value is median.
ggplot(relig_summary, aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
  geom_point()

As you start making more complicated transformations, I’d recommend moving them out of aes() and into a separate mutate() step. For example, you could rewrite the plot above as:

relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(x = tvhours, y = relig)) +
    geom_point()

5 fct_reorder2()

Another type of reordering is useful when you are colouring the lines on a plot. fct_reorder2() reorders the factor by the y values associated with the largest x values. This makes the plot easier to read because the line colours line up with the legend.

by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))
ggplot(by_age, aes(x = age, y = prop, colour = marital)) +
  geom_line(na.rm = TRUE)

ggplot(by_age, aes(x = age, y = prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "marital")

6 Materials

There are lots of material online for R and more particularly on tidyverse and Rstudio

You can find cheat sheet for all the packages of the tidyverse on this page: https://www.rstudio.com/resources/cheatsheets/

The Rstudio websites are also a good place to learn more about R and the meta-package maintenained by the Rstudio community:

For example rmarkdown is a great way to turn your analyses into high quality documents, reports, presentations and dashboards:

In addition most packages will provide vignettes on how to perform an analysis from scratch. On the bioconductor.org website (specialised on R packages for biologists), you will have direct links to the packages vignette.

Finally, don’t forget to search the web for your problems or error in R websites like stackoverflow contains high quality and well-curated answers.