In some situations, you may want to use encodefrom()
to
collapse values, that is, group unique raw values into a smaller set of
clean values / labels. For example, say you have the following data set,
which gives each state’s census division number and name:
id | state | cendiv | cendiv_name |
---|---|---|---|
1 | AL | 6 | East South Central |
2 | AK | 9 | Pacific |
3 | AZ | 8 | Mountain |
4 | AR | 7 | West South Central |
5 | CA | 9 | Pacific |
6 | CO | 8 | Mountain |
7 | CT | 1 | New England |
8 | DE | 5 | South Atlantic |
10 | FL | 5 | South Atlantic |
12 | HI | 9 | Pacific |
14 | IL | 3 | East North Central |
15 | IN | 3 | East North Central |
16 | IA | 4 | West North Central |
31 | NJ | 2 | Middle Atlantic |
33 | NY | 2 | Middle Atlantic |
Rather than using the nine census divisions, you would rather group states by their regions. You have the following crosswalk:
cendiv | cenreg | cenregnm |
---|---|---|
1 | 1 | Northeast |
2 | 1 | Northeast |
3 | 2 | Midwest |
4 | 2 | Midwest |
5 | 3 | South |
6 | 3 | South |
7 | 3 | South |
8 | 4 | West |
9 | 4 | West |
As long as
raw
values are unique in the crosswalkclean
and label
columns have a 1:1
matchThen you can use encodefrom()
to collapse categories as
you move from raw to clean values.
## data
df <- tibble(id = c(1:8,10,12,14:16,31,33),
state = c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','HI',
'IL','IN','IA','NJ','NY'),
cendiv = c(6,9,8,7,9,8,1,5,5,9,3,3,4,2,2),
cendiv_name = c('East South Central','Pacific','Mountain',
'West South Central','Pacific','Mountain','New England',
'South Atlantic','South Atlantic','Pacific',
'East North Central','East North Central',
'West North Central','Middle Atlantic','Middle Atlantic'))
## crosswalk
cw <- tibble(cendiv = 1:9,
cenreg = c(1,1,2,2,3,3,3,4,4),
cenregnm = c('Northeast','Northeast','Midwest','Midwest',
'South','South','South','West','West'))
## encode new column
df <- df %>%
mutate(cenreg = encodefrom(., var = cendiv, cw_file = cw, raw = cendiv,
clean = cenreg, label = cenregnm))
df
## # A tibble: 15 × 5
## id state cendiv cendiv_name cenreg
## <dbl> <chr> <dbl> <chr> <dbl+lbl>
## 1 1 AL 6 East South Central 3 [South]
## 2 2 AK 9 Pacific 4 [West]
## 3 3 AZ 8 Mountain 4 [West]
## 4 4 AR 7 West South Central 3 [South]
## 5 5 CA 9 Pacific 4 [West]
## 6 6 CO 8 Mountain 4 [West]
## 7 7 CT 1 New England 1 [Northeast]
## 8 8 DE 5 South Atlantic 3 [South]
## 9 10 FL 5 South Atlantic 3 [South]
## 10 12 HI 9 Pacific 4 [West]
## 11 14 IL 3 East North Central 2 [Midwest]
## 12 15 IN 3 East North Central 2 [Midwest]
## 13 16 IA 4 West North Central 2 [Midwest]
## 14 31 NJ 2 Middle Atlantic 1 [Northeast]
## 15 33 NY 2 Middle Atlantic 1 [Northeast]