Wrangling data I

Being able to read, manipulate, and save data, that is, wrangle data, is a key part of any analysis. In fact (as you’re probably already aware), building and cleaning data usually takes more time and lines of code than the actual analysis.

In this module, I’m going to show you some data wrangling procedures, using only base R functions. There’s much to be said for the tidyverse way of doing things (which we’ll cover in the next module), but I think it’s still good to know how to use core commands for those edge cases where tidyverse functions don’t quite work the way you want.

Data for this module come from the public release files of the NCES Education Longitudinal Study of 2002. For descriptions of the variables, see the codebook.

Data wrangling with base R

First things first, let’s read in the data. Base R can load() its own data formats, .rda and .RData, as well as read flat files like .txt, .csv, and .tsv files. (We’ll discuss how to read in data files from other languages later.)

Since the data come in a CSV file, we could use the special command read.csv(). The more generic function read.table() works just as well, though, as long as we tell R that items in each row of our data file are separated by a , using the sep = ',' argument. By default, R assumes that the data just begin, but since our file has the variable names in the first row, we also need to use header = TRUE. We won’t talk about factors until later, but let’s read in the data keeping string values as character vectors as well.

## read in the data, making sure that first line is read as column names
df <- read.table('../data/els_plans.csv', sep = ',', header = TRUE,
                 stringsAsFactors = FALSE)

…error!

If you tried to read the data and got an error that looked like this

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
  cannot open file '../data/els_plans.csv': No such file or directory

then you have one of three issues:

You haven’t downloaded the data (get it from the link at top of page)
You don’t have your directory/folder structure set up correctly (go back to the main module page for information about how you should set things up)
You don’t have R set to the correct working directory

To check your current working directory, use the getwd() function.

## check current directory
getwd()

If the output isn’t the scripts subdirectory, then your working directory isn’t correct. If you aren’t in the right directory, this means that R can’t find your data because even though it has directions on where to find it—in the path given to the read.table() function—it’s starting in the wrong spot. The directions are worthless!

If you know where your scripts directory/folder is, you can use the setwd() function to change your working directory to the right one.

## set the working directory, uncomment and change <path>/<to> as needed
## setwd('<path>/<to>/rworkshop/scripts')

Since we’re using RStudio, you can also correctly set the working directory by opening the script using Rstudio’s Files menu, which is in the bottom right-hand window. Once you’ve found your script, use the More menu option “Set As Working Directory” to set the scripts directory to the working directory. Try reading the data again.

Viewing data

Let’s look at the first few rows and the variable names using the head() function. RStudio makes it easy to see your data by using its viewer, (which really just calls the View() function around the data object).

## show the first few rows (or view in RStudio's view)
head(df)

  stu_id sch_id strat_id   psu f1sch_id  bystuwt  bysex
101101   1011      101 psu 1     1011 178.9513 female
101102   1011      101 psu 1     1011  28.2951 female
101104   1011      101 psu 1     1011 589.7248 female
101105   1011      101 psu 1     1011 235.7822 female
101106   1011      101 psu 1     1011 178.9513 female
101107   1011      101 psu 1     1011 256.9656   male
                                    byrace bydob_p
               hispanic, race specified  198512
asian, hawaii/pac. islander,non-hispanic  198605
                    white, non-hispanic  198601
black or african american, non-hispanic  198607
            hispanic, no race specified  198511
            hispanic, no race specified  198510
                             bypared                           bymothed
attended college, no 4-year degree         did not finish high school
attended college, no 4-year degree attended college, no 4-year degree
graduated from high school or ged  graduated from high school or ged
graduated from high school or ged  graduated from high school or ged
       did not finish high school         did not finish high school
graduated from high school or ged  graduated from high school or ged
                            byfathed         byincome byses1 byses2
attended college, no 4-year degree  $50,001-$75,000  -0.25  -0.23
attended college, no 4-year degree $75,001-$100,000   0.58   0.69
graduated from high school or ged  $50,001-$75,000  -0.85  -0.68
graduated from high school or ged   $1,000 or less  -0.80  -0.89
       did not finish high school  $15,001-$20,000  -1.41  -1.28
       did not finish high school  $35,001-$50,000  -1.07  -0.93
                                   bystexp bynels2m bynels2r    f1qwt
attend or complete 2-year college/school    47.84    39.04 152.9769
obtain phd, md, or other advanced degree    55.30    36.35  25.3577
                           {don^t know}    66.24    42.68 709.4246
                  graduate from college    35.33    27.86 199.7193
                  graduate from college    29.97    13.07 152.9769
attend college, 4-year degree incomplete    24.28    11.70 205.2692
   f1pnlwt                               f1psepln
155.6312 don^t know or planning but unspecified
25.4906        four-year college or university
725.6926        four-year college or university
205.1919             two-year community college
155.6312        four-year college or university
211.4690             two-year community college
                               f2ps1sec
{Survey component legitimate skip/NA}
             Public, 4-year or above
             Public, 4-year or above
                      Public, 2-year
                      Public, 2-year
           {Item legitimate skip/NA}

Remember that we can also use names() to see just the variable names.

## show the column names
names(df)

 [1] "stu_id"   "sch_id"   "strat_id" "psu"      "f1sch_id" "bystuwt" 
 [7] "bysex"    "byrace"   "bydob_p"  "bypared"  "bymothed" "byfathed"
[13] "byincome" "byses1"   "byses2"   "bystexp"  "bynels2m" "bynels2r"
[19] "f1qwt"    "f1pnlwt"  "f1psepln" "f2ps1sec"

Add variables

Add a column by giving it a name and assigning what you want. R will repeat the values as necessary to fill the number of rows. You can also use data from other columns. R will assign values row by row, using the right-hand side values that align.

## add a column of ones (the 1 will repeat and fill each row)
df$ones <- 1

## add sum of test scores (bynels2r + bynels2m)
df$sum_test <- df$bynels2r + df$bynels2m

## check names
names(df)

 [1] "stu_id"   "sch_id"   "strat_id" "psu"      "f1sch_id" "bystuwt" 
 [7] "bysex"    "byrace"   "bydob_p"  "bypared"  "bymothed" "byfathed"
[13] "byincome" "byses1"   "byses2"   "bystexp"  "bynels2m" "bynels2r"
[19] "f1qwt"    "f1pnlwt"  "f1psepln" "f2ps1sec" "ones"     "sum_test"

Quick exercise

Create a new column that is the average of the test scores.

Drop variables

Drop variables by assigning NULL to the column name.

## drop follow up one panel weight
df$f1pnlwt <- NULL

## check names
names(df)

 [1] "stu_id"   "sch_id"   "strat_id" "psu"      "f1sch_id" "bystuwt" 
 [7] "bysex"    "byrace"   "bydob_p"  "bypared"  "bymothed" "byfathed"
[13] "byincome" "byses1"   "byses2"   "bystexp"  "bynels2m" "bynels2r"
[19] "f1qwt"    "f1psepln" "f2ps1sec" "ones"     "sum_test"

Conditionally change values

This can be tricky at first. To conditionally change or assign values, you need to tell R where the conditions apply. There are a couple of ways.

The first way uses brackets, [], after the variable name to set the condition where the assignment is true. For version 1 below, the new variable female is assigned a value of 1 in the rows where it is TRUE that bysex == 'female'. In the rows where that expression is FALSE, R will assign NA since there’s no information. We can back fill 0s using the second line.

The other way is to use the ifelse(test, yes, no) function. Going row by row, the test (bysex == 'female') is performed. If TRUE, the new variable gets a 1; if FALSE, it gets a 0.

## make a numeric column that == 1 if bysex is female, 0 otherwise
## v.1
df$female_v1[df$bysex == 'female'] <- 1 # double == for IS EQUAL TO
df$female_v1[df$bysex != 'female'] <- 0 # != --> NOT EQUAL TO

## v.2
df$female_v2 <- ifelse(df$bysex == 'female', 1, 0)

## the same?
identical(df$female_v1, df$female_v2)

[1] TRUE

Important Note The code above assumes that bysex has only two outcomes, male and female. But since bysex has missing values, which are coded with other string names, the code above could be misinterpreted in later analyses. Though when female == 1 it will always be true that bysex == 'female', when female == 0, bysex could be male or one of the other other string values that indicates missing values. We will leave it as is for now, but will return to missing values later.

Quick exercise

Create a new column called ses_gender that uses byses1 for women and byses2 for men. (HINT: if you use a condition, you need to use it on both sides of the arrow.)

Filter

You can also use brackets to conditionally drop rows, such as those with missing values.

In this data set, each student’s date of birth, bydob_p, is coded as the four-digit year plus two-digit month run together: January 1983 becomes 198301. If the value is missing, it is given a negative number. We can use the less than operator (<) to filter.

## assign as NA if < 0
df$bydob_p[df$bydob_p < 0] <- NA
nrow(df)

[1] 16160

## drop if NA
df <- df[!is.na(df$bydob_p),]
nrow(df)

[1] 15183

Quick exercise

The variable bynels2m also uses negative values to represent missing values. Reassign NAs to values that are less than zero. Next drop observations from the data set if they are missing bynels2m values. (HINT 1: Pay attention to your commas each time!) (HINT 2: Before dropping observations, save your data set object, df in another object, df_hold, just in case things don’t go well the first time…)

Order

Sort the data frame using the order() function as a condition.

## show first few rows of student and base year math scores
df[1:10, c('stu_id','bydob_p')]         # subset columns using c() + names

   stu_id bydob_p
101101  198512
101102  198605
101104  198601
101105  198607
101106  198511
101107  198510
101108  198607
101109  198512
101110  198505
101111  198507

## since a data frame has two dims, notice the comma in the brackets
df <- df[order(df$bydob_p),]

## show again first few rows of ID and DOB
df[1:10, c('stu_id','bydob_p')]

      stu_id bydob_p
133211  198300
195210  198300
195213  198300
225203  198300
268219  198300
268225  198300
274222  198300
280215  198300
324119  198300
327128  198300

Quick exercise

Sort by bydob_p again, but this time from youngest to oldest, that is, reverse order. Type ?order to see the help file for the function and the argument you might need to do that.

Aggregate

To collapse the data, generating a summary statistic in the process, use the aggregate(x, by, FUN), where x is the data frame, by is the grouping variable in a list(), and FUN is the function that you want to use. The function you use can be a base R function or one you create yourself. Let’s get the average math score within each school.

Quick Note Because mean() cannot compute a mean when missing values are present (try it and see what you get), we can add the na.rm = TRUE argument, which tells the function to drop NA values. Normally, the argument needs to be inside the mean() function’s parentheses. The aggregate() function and others like are special in that they will let you just tack on any arguments to the FUN function at the end, separated by commas.

## first, make test score values < 0 ==> NA (if you didn't already)
df$bynels2m[df$bynels2m < 0] <- NA

## create new data frame with mean math scores, dropping NAs
sch_m <- aggregate(df$bynels2m, by = list(df$sch_id), FUN = mean, na.rm = T)

## show
head(sch_m)

  Group.1        x
  1011 45.26387
  1012 43.30400
  1021 28.91529
  1022 38.60290
  1031 40.01636
  1032 35.34429

Quick exercise

Get the average reading score for each school and save it to a different object. Don’t forget to account for missing values.

Merge

Since you can have multiple data frames in memory (as objects) at the same time in R, you may not find yourself merging data sets as often you would in another language (like Stata, where you have to). That said, it still needs to happen. Use the merge() function to do so. Let’s merge the aggregated test score data back into the data set.

## first fix names from aggregated data set
names(sch_m) <- c('sch_id', 'sch_bynels2m')

## merge on school ID variable
df <- merge(df, sch_m, by = 'sch_id')

## show
head(df)

  sch_id stu_id strat_id   psu f1sch_id  bystuwt  bysex
 1011 101126      101 psu 1     1011  28.2951 female
 1011 101105      101 psu 1     1011 235.7822 female
 1011 101106      101 psu 1     1011 178.9513 female
 1011 101132      101 psu 1       -8 192.4304   male
 1011 101116      101 psu 1     1011  30.2245   male
 1011 101131      101 psu 1     1011 620.1837   male
                                    byrace bydob_p
asian, hawaii/pac. islander,non-hispanic  198410
black or african american, non-hispanic  198607
            hispanic, no race specified  198511
            hispanic, no race specified  198611
asian, hawaii/pac. islander,non-hispanic  198612
                    white, non-hispanic  198610
                                  bypared
            did not finish high school
     graduated from high school or ged
            did not finish high school
     graduated from high school or ged
completed master^s degree or equivalent
          graduated from 2-year school
                                 bymothed
            did not finish high school
     graduated from high school or ged
            did not finish high school
     graduated from high school or ged
completed master^s degree or equivalent
     attended 2-year school, no degree
                           byfathed          byincome byses1 byses2
      did not finish high school   $25,001-$35,000  -0.64  -0.67
graduated from high school or ged    $1,000 or less  -0.80  -0.89
      did not finish high school   $15,001-$20,000  -1.41  -1.28
graduated from high school or ged   $50,001-$75,000  -0.16  -0.24
attended 2-year school, no degree $100,001-$200,000   0.99   0.88
    graduated from 2-year school   $35,001-$50,000   0.48   0.67
                                   bystexp bynels2m bynels2r    f1qwt
obtain phd, md, or other advanced degree    66.73    16.65  25.3577
                  graduate from college    35.33    27.86 199.7193
                  graduate from college    29.97    13.07 152.9769
                  graduate from college    41.28    17.34   0.0000
obtain phd, md, or other advanced degree    69.08    45.74  26.0130
                  graduate from college    57.19    28.60 736.6029
                         f1psepln                                f2ps1sec
four-year college or university                          Public, 2-year
    two-year community college                          Public, 2-year
four-year college or university                          Public, 2-year
               {nonrespondent}               {Item legitimate skip/NA}
four-year college or university Private not-for-profit, 4-year or above
    two-year community college                          Public, 2-year
  ones sum_test female_v1 female_v2 sch_bynels2m
  1    83.38         1         1     45.26387
  1    63.19         1         1     45.26387
  1    43.04         1         1     45.26387
  1    58.62         0         0     45.26387
  1   114.82         0         0     45.26387
  1    85.79         0         0     45.26387

We’ll talk more about joins (another word for merge) in the next module, but know now that by default, the merge() function only keeps rows in the x and y data sets (df and sch_m, respectively, in our case) that can be matched. If you want to keep unmatched rows from the x or y data frame, you need to use all.x = TRUE and/or all.y = TRUE arguments as needed. See the merge() help file for more information.

Quick exercise

Merge the average school level reading score data set you created to the full data set.

Write

Finally we can write our new data set to disk. We can save it as an R data file type, but since we may want to share with non-R users, we’ll save it as a csv file again.

It’s not strictly necessary but good practice nonetheless to change the name of the modified file. That way, we still have the untouched raw data in case we need to change how we wrangle new data sets in the future.

write.csv(df, '../data/els_plans_mod.csv', row.names = FALSE)

Workshop on R

May 2018

Overview
Schedule
Getting started
Modules
Data

Wrangling data I

Data wrangling with base R

…error!

Viewing data

Add variables

Quick exercise

Drop variables

Conditionally change values

Quick exercise

Filter

Quick exercise

Order

Quick exercise

Aggregate

Quick exercise

Merge

Quick exercise

Write