vignettes/crosswalkr.Rmd
crosswalkr.Rmd
Researchers often must compile master data sets from a number of smaller data sets that are not consistent in terms of variable names or value encodings. This can be especially true for large administrative data sets that span multiple years and/or departments. Other times, teams of researchers must work together to maintain a master data set and it is important for replicability and future collaboration that the team rely on consistent naming and encoding conventions.
For example, let’s say there are three flat files of student information that need to be merged into a single large data set for analysis.
s_id | name | sta | score |
---|---|---|---|
7 | Tyler | North Dakota | 91 |
8 | Grant | South Dakota | 82 |
9 | Adams | Illinois | 89 |
It is clear that these files contain the same basic information, but
neither the names nor encodings for state
| st
| sta
are consistent.
One solution is to just fix these one at a time before joining them. For example:
df1 <- file_1 %>%
rename(id = sid,
last_name = lname,
stabbr = stat,
score = t_score)
df2 <- file_2 %>%
rename(id = stu_id,
stabbr = st,
score = test_score) %>%
mutate(stabbr = as.character(stabbr))
df3 <- file_3 %>%
rename(id = s_id,
stabbr = sta,
last_name = name)
df <- rbind(df1, df2, df3)
df
## id last_name stabbr score
## 1 1 Jackson VA 74
## 2 2 Harrison KY 86
## 3 3 Nixon IL 78
## 4 4 Washington 35 92
## 5 5 Roosevelt 11 82
## 6 6 Taylor 47 89
## 7 7 Tyler North Dakota 91
## 8 8 Grant South Dakota 82
## 9 9 Adams Illinois 89
The problem, of course, is there is a lot of room for error since the renaming process has to be repeated for each data frame.
Instead, it makes more sense to create a crosswalk data set that
aligns old (or raw) column names with new (or clean) column names and,
if desired, labels. The crosswalk
to join these files could
be:
clean | label | file_1_raw | file_2_raw | file_3_raw |
---|---|---|---|---|
id | Student ID | sid | stu_id | s_id |
last_name | Student last name | lname | last_name | name |
stabbr | State abbreviation | stat | st | sta |
score | Test score | t_score | test_score | score |
The crosswalk file (cw_file
) could be:
'./path/to/crosswalk.csv'
) of a flat file of one of the
following types:*.csv
)*.tsv
)*.txt
) with delimiter
option set to delimiter string (e.g.,
delimiter = '|'
)*.xls
or *.xlsx
) with
sheet
option set to sheet number or string name (defaulting
to the first sheet)*.rdata
, *.rda
,
*.rds
)*.dta
)If given a string to the cw_file
argument,
renamefrom()
and encodefrom()
determine the
type of file by its ending.
To rename using the renamefrom()
command:
df1 <- renamefrom(file_1, cw_file = crosswalk, raw = file_1_raw, clean = clean, label = label)
df2 <- renamefrom(file_2, cw_file = crosswalk, raw = file_2_raw, clean = clean, label = label)
df3 <- renamefrom(file_3, cw_file = crosswalk, raw = file_3_raw, clean = clean, label = label)
df <- rbind(df1, df2, df3)
df
## id last_name stabbr score
## 1 1 Jackson VA 74
## 2 2 Harrison KY 86
## 3 3 Nixon IL 78
## 4 4 Washington 35 92
## 5 5 Roosevelt 11 82
## 6 6 Taylor 47 89
## 7 7 Tyler North Dakota 91
## 8 8 Grant South Dakota 82
## 9 9 Adams Illinois 89
And check out the labels:
var_label(df)
## $id
## [1] "Student ID"
##
## $last_name
## [1] "Student last name"
##
## $stabbr
## [1] "State abbreviation"
##
## $score
## [1] "Test score"
As new raw data files are added to the project, they could simply be given a new column in the crosswalk file that mapped their raw column names to the clean versions.
These same example files have inconsistent encodings for state: one
uses two-letter abbreviations, another the FIPS code, and another the
full name. Again, instead of fixing each one at a time, a separate
crosswalk for encoding these values could be used. The
crosswalkr
package includes a state-level crosswalk,
stcrosswalk
:
data(stcrosswalk)
stcrosswalk
## # A tibble: 51 × 7
## stfips stabbr stname cenreg cenregnm cendiv cendivnm
## <int> <chr> <chr> <int> <chr> <int> <chr>
## 1 1 AL Alabama 3 South 6 East South Central
## 2 2 AK Alaska 4 West 9 Pacific
## 3 4 AZ Arizona 4 West 8 Mountain
## 4 5 AR Arkansas 3 South 7 West South Central
## 5 6 CA California 4 West 9 Pacific
## 6 8 CO Colorado 4 West 8 Mountain
## 7 9 CT Connecticut 1 Northeast 1 New England
## 8 10 DE Delaware 3 South 5 South Atlantic
## 9 11 DC District of Columbia 3 South 5 South Atlantic
## 10 12 FL Florida 3 South 5 South Atlantic
## # ℹ 41 more rows
The encodefrom()
function works much like
renamefrom()
. The only difference is that a vector of
encoded values is returned that can be added to an existing
dataframe.
encodefrom()
returns either base R factors or labels
depending on whether the input data frame is a tibble.
df1$state <- encodefrom(file_1, var = stat, stcrosswalk, raw = stabbr, clean = stfips, label = stname)
df1
## id last_name stabbr score state
## 1 1 Jackson VA 74 Virginia
## 2 2 Harrison KY 86 Kentucky
## 3 3 Nixon IL 78 Illinois
sapply(df1, class)
## id last_name stabbr score state
## "integer" "character" "character" "numeric" "factor"
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
df1$state <- encodefrom(file_1_, var = stat, stcrosswalk, raw = stabbr,
clean = stfips, label = stname)
as_factor(df1)
## id last_name stabbr score state
## 1 1 Jackson VA 74 Virginia
## 2 2 Harrison KY 86 Kentucky
## 3 3 Nixon IL 78 Illinois
zap_labels(df1)
## id last_name stabbr score state
## 1 1 Jackson VA 74 51
## 2 2 Harrison KY 86 21
## 3 3 Nixon IL 78 17
dplyr
chain
The renamefrom()
and encodefrom()
functions
can be combined in a dplyr
chain.
df <- rbind(file_1 %>%
tbl_df() %>%
renamefrom(., crosswalk, file_1_raw, clean, label) %>%
mutate(stabbr = encodefrom(., stabbr, stcrosswalk, stabbr, stfips, stname)),
## append file 2
file_2 %>%
tbl_df() %>%
renamefrom(., crosswalk, file_2_raw, clean, label) %>%
mutate(stabbr = encodefrom(., stabbr, stcrosswalk, stfips, stfips, stname)),
## append file 3
file_3 %>%
tbl_df() %>%
renamefrom(., crosswalk, file_3_raw, clean, label) %>%
mutate(stabbr = encodefrom(., stabbr, stcrosswalk, stname, stfips, stname)))
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## ℹ Please use `tibble::as_tibble()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
df
## # A tibble: 9 × 4
## id last_name stabbr score
## <int> <chr> <int+lbl> <dbl>
## 1 1 Jackson 51 [Virginia] 74
## 2 2 Harrison 21 [Kentucky] 86
## 3 3 Nixon 17 [Illinois] 78
## 4 4 Washington 35 [New Mexico] 92
## 5 5 Roosevelt 11 [District of Columbia] 82
## 6 6 Taylor 47 [Tennessee] 89
## 7 7 Tyler 38 [North Dakota] 91
## 8 8 Grant 46 [South Dakota] 82
## 9 9 Adams 17 [Illinois] 89
as_factor(df)
## # A tibble: 9 × 4
## id last_name stabbr score
## <int> <chr> <fct> <dbl>
## 1 1 Jackson Virginia 74
## 2 2 Harrison Kentucky 86
## 3 3 Nixon Illinois 78
## 4 4 Washington New Mexico 92
## 5 5 Roosevelt District of Columbia 82
## 6 6 Taylor Tennessee 89
## 7 7 Tyler North Dakota 91
## 8 8 Grant South Dakota 82
## 9 9 Adams Illinois 89