Workshop on R

May 2018

Hosted by Virginia Education Science Training (VEST) Program at UVA

Overview
Schedule
Getting started
Modules
Data

Reshaping data

  

Reshaping data is a common data wrangling task. Whether going from wide to long format or long to wide, this can be a painful process. Though you can reshape data frames using base R commands, the best way I know to reshape data in R is by using functions in the tidyr library.

## library
library(tidyverse)
── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0

── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

Create toy data

For clarity, we’ll use toy data for this example. Let’s say these data represent average school-wide test scores on three tests, math, reading, and science, for four schools (A, B, C, and D), in 2013.

df <- data.frame(schid = c('A','B','C','D'),
                 year = 2013,
                 math = round(rnorm(4, 500, 10)),
                 read = round(rnorm(4, 300, 20)),
                 science = round(rnorm(4, 800, 15)),
                 stringsAsFactors = FALSE) %>%
    tbl_df()

This data structure should be wide and look like this:

schid year math read science
A 2013 496 306 814
B 2013 504 280 786
C 2013 507 322 788
D 2013 486 307 807
## confirm that it is wide
df
# A tibble: 4 x 5
  schid  year  math  read science
  <chr> <dbl> <dbl> <dbl>   <dbl>
1 A      2013   496   306     814
2 B      2013   504   280     786
3 C      2013   507   322     788
4 D      2013   486   307     807

Wide –> long

To start, the data are wide. While this setup can be efficient for storage, it’s not always the best for analysis or even just browsing. What we want is for the data to be long. Instead of each test having its own column, there should be one column for the test subject (test) and another column that gives the value (score). It should look like this:

schid year test score
A 2013 math 496
A 2013 read 306
A 2013 science 814
B 2013 math 504
B 2013 read 280
B 2013 science 786
C 2013 math 507
C 2013 read 322
C 2013 science 788
D 2013 math 486
D 2013 read 307
D 2013 science 807

To go from wide to long format, use the gather(key, value) function, where key is a new column that will hold all the variable names that were columns in the wide data frame and value is a new column that will hold their values. In our case, the key will be test (we can call it what we want), because that’s what we’re gathering in. The value will be score and will hold the test score values.

We don’t want to gather every column, though. Since we want the schid and year columns to remain associated with their rows, we ignore them (-c(...)). By ignoring them, gather() will repeat their values down the rows as necessary.

Finally, to tidy up, we arrange() the data by school ID and the variable name.

df_long <- df %>%
    gather(test, score, -c(schid, year)) %>%
    arrange(schid, test)

## show
df_long
# A tibble: 12 x 4
   schid  year test    score
   <chr> <dbl> <chr>   <dbl>
 1 A      2013 math      496
 2 A      2013 read      306
 3 A      2013 science   814
 4 B      2013 math      504
 5 B      2013 read      280
 6 B      2013 science   786
 7 C      2013 math      507
 8 C      2013 read      322
 9 C      2013 science   788
10 D      2013 math      486
11 D      2013 read      307
12 D      2013 science   807

Quick exercise

What happens if you don’t include -c(schid, year) in the gather() function? Try it.

Long –> wide

To go in the opposite direction, use the spread(var, value) function, which makes columns for every unique var and assigns the value that was in the var’s row. In our case, these are test and score, respectively. Unlike gather(), we don’t have explicitly say to ignore columns that want to ignore.

df_wide <- df_long %>%
    spread(test, score) %>%
    arrange(schid)

## show
df_wide
# A tibble: 4 x 5
  schid  year  math  read science
  <chr> <dbl> <dbl> <dbl>   <dbl>
1 A      2013   496   306     814
2 B      2013   504   280     786
3 C      2013   507   322     788
4 D      2013   486   307     807

In theory, our new df_wide data frame should be the same as the one we started with. Let’s check:

## confirm that df_wide == df
identical(df, df_wide)
[1] TRUE

Success!

Quick exercise

Reshape this long data frame wide and then back:

df <- data.frame(id = rep(c('A','B','C','D'), each = 4),
                 year = paste0('y', rep(2000:2003, 4)),
                 score = rnorm(16),
                 stringsAsFactors = FALSE) %>%
      tbl_df()