Hosted by Virginia Education Science Training (VEST) Program at UVA
Being able to read, manipulate, and save data, that is, wrangle data, is a key part of any analysis. In fact (as you’re probably already aware), building and cleaning data usually takes more time and lines of code than the actual analysis.
In this module, I’m going to show you some data wrangling procedures, using only base R functions. There’s much to be said for the tidyverse way of doing things (which we’ll cover in the next module), but I think it’s still good to know how to use core commands for those edge cases where tidyverse functions don’t quite work the way you want.
Data for this module come from the public release files of the NCES Education Longitudinal Study of 2002. For descriptions of the variables, see the codebook.
First things first, let’s read in the data. Base R can load()
its own
data formats, .rda
and .RData
, as well as read flat files like
.txt
, .csv
, and .tsv
files. (We’ll discuss how to read in data
files from other languages later.)
Since the data come in a CSV file, we could use the special command
read.csv()
. The more generic function read.table()
works just as
well, though, as long as we tell R that items in each row of our data
file are separated by a ,
using the sep = ','
argument. By default,
R assumes that the data just begin, but since our file has the variable
names in the first row, we also need to use header = TRUE
. We won’t
talk about factors until later, but let’s read in the data keeping
string values as character vectors as well.
## read in the data, making sure that first line is read as column names
df <- read.table('../data/els_plans.csv', sep = ',', header = TRUE,
stringsAsFactors = FALSE)
If you tried to read the data and got an error that looked like this
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file '../data/els_plans.csv': No such file or directory
then you have one of three issues:
To check your current working directory, use the getwd()
function.
## check current directory
getwd()
If the output isn’t the scripts
subdirectory, then your working
directory isn’t correct. If you aren’t in the right directory, this
means that R can’t find your data because even though it has directions
on where to find it—in the path given to the read.table()
function—it’s starting in the wrong spot. The directions are
worthless!
If you know where your scripts
directory/folder is, you can use the
setwd()
function to change your working directory to the right one.
## set the working directory, uncomment and change <path>/<to> as needed
## setwd('<path>/<to>/rworkshop/scripts')
Since we’re using RStudio, you can also correctly set the working
directory by opening the script using Rstudio’s Files menu, which is
in the bottom right-hand window. Once you’ve found your script, use the
More menu option “Set As Working Directory” to set the scripts
directory to the working directory. Try reading the data again.
Let’s look at the first few rows and the variable names using the
head()
function. RStudio makes it easy to see your data by using its
viewer,
(which really just calls the View()
function around the data object).
## show the first few rows (or view in RStudio's view)
head(df)
stu_id sch_id strat_id psu f1sch_id bystuwt bysex
1 101101 1011 101 psu 1 1011 178.9513 female
2 101102 1011 101 psu 1 1011 28.2951 female
3 101104 1011 101 psu 1 1011 589.7248 female
4 101105 1011 101 psu 1 1011 235.7822 female
5 101106 1011 101 psu 1 1011 178.9513 female
6 101107 1011 101 psu 1 1011 256.9656 male
byrace bydob_p
1 hispanic, race specified 198512
2 asian, hawaii/pac. islander,non-hispanic 198605
3 white, non-hispanic 198601
4 black or african american, non-hispanic 198607
5 hispanic, no race specified 198511
6 hispanic, no race specified 198510
bypared bymothed
1 attended college, no 4-year degree did not finish high school
2 attended college, no 4-year degree attended college, no 4-year degree
3 graduated from high school or ged graduated from high school or ged
4 graduated from high school or ged graduated from high school or ged
5 did not finish high school did not finish high school
6 graduated from high school or ged graduated from high school or ged
byfathed byincome byses1 byses2
1 attended college, no 4-year degree $50,001-$75,000 -0.25 -0.23
2 attended college, no 4-year degree $75,001-$100,000 0.58 0.69
3 graduated from high school or ged $50,001-$75,000 -0.85 -0.68
4 graduated from high school or ged $1,000 or less -0.80 -0.89
5 did not finish high school $15,001-$20,000 -1.41 -1.28
6 did not finish high school $35,001-$50,000 -1.07 -0.93
bystexp bynels2m bynels2r f1qwt
1 attend or complete 2-year college/school 47.84 39.04 152.9769
2 obtain phd, md, or other advanced degree 55.30 36.35 25.3577
3 {don^t know} 66.24 42.68 709.4246
4 graduate from college 35.33 27.86 199.7193
5 graduate from college 29.97 13.07 152.9769
6 attend college, 4-year degree incomplete 24.28 11.70 205.2692
f1pnlwt f1psepln
1 155.6312 don^t know or planning but unspecified
2 25.4906 four-year college or university
3 725.6926 four-year college or university
4 205.1919 two-year community college
5 155.6312 four-year college or university
6 211.4690 two-year community college
f2ps1sec
1 {Survey component legitimate skip/NA}
2 Public, 4-year or above
3 Public, 4-year or above
4 Public, 2-year
5 Public, 2-year
6 {Item legitimate skip/NA}
Remember that we can also use names()
to see just the variable names.
## show the column names
names(df)
[1] "stu_id" "sch_id" "strat_id" "psu" "f1sch_id" "bystuwt"
[7] "bysex" "byrace" "bydob_p" "bypared" "bymothed" "byfathed"
[13] "byincome" "byses1" "byses2" "bystexp" "bynels2m" "bynels2r"
[19] "f1qwt" "f1pnlwt" "f1psepln" "f2ps1sec"
Add a column by giving it a name and assigning what you want. R will repeat the values as necessary to fill the number of rows. You can also use data from other columns. R will assign values row by row, using the right-hand side values that align.
## add a column of ones (the 1 will repeat and fill each row)
df$ones <- 1
## add sum of test scores (bynels2r + bynels2m)
df$sum_test <- df$bynels2r + df$bynels2m
## check names
names(df)
[1] "stu_id" "sch_id" "strat_id" "psu" "f1sch_id" "bystuwt"
[7] "bysex" "byrace" "bydob_p" "bypared" "bymothed" "byfathed"
[13] "byincome" "byses1" "byses2" "bystexp" "bynels2m" "bynels2r"
[19] "f1qwt" "f1pnlwt" "f1psepln" "f2ps1sec" "ones" "sum_test"
Quick exercise
Create a new column that is the average of the test scores.
Drop variables by assigning NULL
to the column name.
## drop follow up one panel weight
df$f1pnlwt <- NULL
## check names
names(df)
[1] "stu_id" "sch_id" "strat_id" "psu" "f1sch_id" "bystuwt"
[7] "bysex" "byrace" "bydob_p" "bypared" "bymothed" "byfathed"
[13] "byincome" "byses1" "byses2" "bystexp" "bynels2m" "bynels2r"
[19] "f1qwt" "f1psepln" "f2ps1sec" "ones" "sum_test"
This can be tricky at first. To conditionally change or assign values, you need to tell R where the conditions apply. There are a couple of ways.
The first way uses brackets, []
, after the variable name to set the
condition where the assignment is true. For version 1 below, the new
variable female
is assigned a value of 1 in the rows where it is
TRUE
that bysex == 'female'
. In the rows where that expression is
FALSE
, R will assign NA
since there’s no information. We can back
fill 0
s using the second line.
The other way is to use the ifelse(test, yes, no)
function. Going row
by row, the test
(bysex == 'female'
) is performed. If TRUE
, the
new variable gets a 1; if FALSE
, it gets a 0.
## make a numeric column that == 1 if bysex is female, 0 otherwise
## v.1
df$female_v1[df$bysex == 'female'] <- 1 # double == for IS EQUAL TO
df$female_v1[df$bysex != 'female'] <- 0 # != --> NOT EQUAL TO
## v.2
df$female_v2 <- ifelse(df$bysex == 'female', 1, 0)
## the same?
identical(df$female_v1, df$female_v2)
[1] TRUE
Important Note The code above assumes that bysex
has only two
outcomes, male
and female
. But since bysex
has missing values,
which are coded with other string names, the code above could be
misinterpreted in later analyses. Though when female == 1
it will
always be true that bysex == 'female'
, when female == 0
, bysex
could be male
or one of the other other string values that indicates
missing values. We will leave it as is for now, but will return to
missing values later.
Quick exercise
Create a new column called
ses_gender
that usesbyses1
for women andbyses2
for men. (HINT: if you use a condition, you need to use it on both sides of the arrow.)
You can also use brackets to conditionally drop rows, such as those with missing values.
In this data set, each student’s date of birth, bydob_p
, is coded as
the four-digit year plus two-digit month run together: January 1983
becomes 198301. If the value is missing, it is given a negative number.
We can use the less than operator (<
) to filter.
## assign as NA if < 0
df$bydob_p[df$bydob_p < 0] <- NA
nrow(df)
[1] 16160
## drop if NA
df <- df[!is.na(df$bydob_p),]
nrow(df)
[1] 15183
Quick exercise
The variable
bynels2m
also uses negative values to represent missing values. ReassignNA
s to values that are less than zero. Next drop observations from the data set if they are missingbynels2m
values. (HINT 1: Pay attention to your commas each time!) (HINT 2: Before dropping observations, save your data set object,df
in another object,df_hold
, just in case things don’t go well the first time…)
Sort the data frame using the order()
function as a condition.
## show first few rows of student and base year math scores
df[1:10, c('stu_id','bydob_p')] # subset columns using c() + names
stu_id bydob_p
1 101101 198512
2 101102 198605
3 101104 198601
4 101105 198607
5 101106 198511
6 101107 198510
7 101108 198607
8 101109 198512
9 101110 198505
10 101111 198507
## since a data frame has two dims, notice the comma in the brackets
df <- df[order(df$bydob_p),]
## show again first few rows of ID and DOB
df[1:10, c('stu_id','bydob_p')]
stu_id bydob_p
1589 133211 198300
4286 195210 198300
4288 195213 198300
5578 225203 198300
7528 268219 198300
7532 268225 198300
7782 274222 198300
8055 280215 198300
10046 324119 198300
10203 327128 198300
Quick exercise
Sort by
bydob_p
again, but this time from youngest to oldest, that is, reverse order. Type?order
to see the help file for the function and the argument you might need to do that.
To collapse the data, generating a summary statistic in the process, use
the aggregate(x, by, FUN)
, where x
is the data frame, by
is the
grouping variable in a list()
, and FUN
is the function that you want
to use. The function you use can be a base R function or one you create
yourself. Let’s get the average math score within each school.
Quick Note Because mean()
cannot compute a mean when missing
values are present (try it and see what you get), we can add the na.rm
= TRUE
argument, which tells the function to drop NA
values.
Normally, the argument needs to be inside the mean()
function’s
parentheses. The aggregate()
function and others like are special in
that they will let you just tack on any arguments to the FUN
function
at the end, separated by commas.
## first, make test score values < 0 ==> NA (if you didn't already)
df$bynels2m[df$bynels2m < 0] <- NA
## create new data frame with mean math scores, dropping NAs
sch_m <- aggregate(df$bynels2m, by = list(df$sch_id), FUN = mean, na.rm = T)
## show
head(sch_m)
Group.1 x
1 1011 45.26387
2 1012 43.30400
3 1021 28.91529
4 1022 38.60290
5 1031 40.01636
6 1032 35.34429
Quick exercise
Get the average reading score for each school and save it to a different object. Don’t forget to account for missing values.
Since you can have multiple data frames in memory (as objects) at the
same time in R, you may not find yourself merging data sets as often you
would in another language (like Stata, where you have to). That said, it
still needs to happen. Use the merge()
function to do so. Let’s merge
the aggregated test score data back into the data set.
## first fix names from aggregated data set
names(sch_m) <- c('sch_id', 'sch_bynels2m')
## merge on school ID variable
df <- merge(df, sch_m, by = 'sch_id')
## show
head(df)
sch_id stu_id strat_id psu f1sch_id bystuwt bysex
1 1011 101126 101 psu 1 1011 28.2951 female
2 1011 101105 101 psu 1 1011 235.7822 female
3 1011 101106 101 psu 1 1011 178.9513 female
4 1011 101132 101 psu 1 -8 192.4304 male
5 1011 101116 101 psu 1 1011 30.2245 male
6 1011 101131 101 psu 1 1011 620.1837 male
byrace bydob_p
1 asian, hawaii/pac. islander,non-hispanic 198410
2 black or african american, non-hispanic 198607
3 hispanic, no race specified 198511
4 hispanic, no race specified 198611
5 asian, hawaii/pac. islander,non-hispanic 198612
6 white, non-hispanic 198610
bypared
1 did not finish high school
2 graduated from high school or ged
3 did not finish high school
4 graduated from high school or ged
5 completed master^s degree or equivalent
6 graduated from 2-year school
bymothed
1 did not finish high school
2 graduated from high school or ged
3 did not finish high school
4 graduated from high school or ged
5 completed master^s degree or equivalent
6 attended 2-year school, no degree
byfathed byincome byses1 byses2
1 did not finish high school $25,001-$35,000 -0.64 -0.67
2 graduated from high school or ged $1,000 or less -0.80 -0.89
3 did not finish high school $15,001-$20,000 -1.41 -1.28
4 graduated from high school or ged $50,001-$75,000 -0.16 -0.24
5 attended 2-year school, no degree $100,001-$200,000 0.99 0.88
6 graduated from 2-year school $35,001-$50,000 0.48 0.67
bystexp bynels2m bynels2r f1qwt
1 obtain phd, md, or other advanced degree 66.73 16.65 25.3577
2 graduate from college 35.33 27.86 199.7193
3 graduate from college 29.97 13.07 152.9769
4 graduate from college 41.28 17.34 0.0000
5 obtain phd, md, or other advanced degree 69.08 45.74 26.0130
6 graduate from college 57.19 28.60 736.6029
f1psepln f2ps1sec
1 four-year college or university Public, 2-year
2 two-year community college Public, 2-year
3 four-year college or university Public, 2-year
4 {nonrespondent} {Item legitimate skip/NA}
5 four-year college or university Private not-for-profit, 4-year or above
6 two-year community college Public, 2-year
ones sum_test female_v1 female_v2 sch_bynels2m
1 1 83.38 1 1 45.26387
2 1 63.19 1 1 45.26387
3 1 43.04 1 1 45.26387
4 1 58.62 0 0 45.26387
5 1 114.82 0 0 45.26387
6 1 85.79 0 0 45.26387
We’ll talk more about joins (another word for merge) in the next module,
but know now that by default, the merge()
function only keeps rows in
the x
and y
data sets (df
and sch_m
, respectively, in our case)
that can be matched. If you want to keep unmatched rows from the x
or
y
data frame, you need to use all.x = TRUE
and/or all.y = TRUE
arguments as needed. See the
merge()
help file for more information.
Quick exercise
Merge the average school level reading score data set you created to the full data set.
Finally we can write our new data set to disk. We can save it as an R data file type, but since we may want to share with non-R users, we’ll save it as a csv file again.
It’s not strictly necessary but good practice nonetheless to change the name of the modified file. That way, we still have the untouched raw data in case we need to change how we wrangle new data sets in the future.
write.csv(df, '../data/els_plans_mod.csv', row.names = FALSE)