Convert a column of unique but restricted IDs into a set of new IDs using secure (SHA-2) hashing algorithm. Users have the option of saving a crosswalk between the old and new IDs in case observations need to reidentified at a later date.
deid_dua( df, id_col = NULL, new_id_name = "id", id_length = 64, existing_crosswalk = NULL, write_crosswalk = FALSE, crosswalk_filename = NULL )
df | Data frame |
---|---|
id_col | Column name with IDs to be replaced. By default it is
|
new_id_name | New hashed ID column name, which must be different from old name. |
id_length | Length of new hashed ID; cannot be fewer than 12 characters (default is 64 characters). |
existing_crosswalk | File name of existing crosswalk. If
existing crosswalk is used, then |
write_crosswalk | Write crosswalk between old ID and new hash
ID to console (unless |
crosswalk_filename | Name of crosswalk file with path; defaults to generic name with current date (YYYYMMDD) appended. |
## -------------- ## Setup ## -------------- ## set DUA crosswalk dua_cw <- system.file('extdata', 'dua_cw.csv', package = 'duawranglr') set_dua_cw(dua_cw)#>#>## read in data admin <- system.file('extdata', 'admin_data.csv', package = 'duawranglr') df <- read_dua_file(admin) ## -------------- ## show identified data df#> # A tibble: 9 x 10 #> sid sname dob gender raceeth tid tname zip mathscr readscr #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 000-00-0001 Schaefer 19900114 0 2 1 Smith 22906 515 496 #> 2 000-00-0002 Hodges 19900225 0 1 1 Smith 22906 488 489 #> 3 000-00-0003 Kirby 19900305 0 4 1 Smith 22906 522 498 #> 4 000-00-0004 Estrada 19900419 0 3 1 Smith 22906 516 524 #> 5 000-00-0005 Nielsen 19900530 1 2 1 Smith 22906 483 509 #> 6 000-00-0006 Dean 19900621 1 1 2 Brown 22906 503 523 #> 7 000-00-0007 Hickman 19900712 1 1 2 Brown 22906 539 509 #> 8 000-00-0008 Bryant 19900826 0 2 2 Brown 22906 499 490 #> 9 000-00-0009 Lynch 19900902 1 3 2 Brown 22906 499 493## deidentify df <- deid_dua(df, id_col = 'sid', new_id_name = 'id', id_length = 12) ## show deidentified data df#> # A tibble: 9 x 10 #> id sname dob gender raceeth tid tname zip mathscr readscr #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 a14856cc5d00 Schaefer 199001… 0 2 1 Smith 22906 515 496 #> 2 a141ce13114a Hodges 199002… 0 1 1 Smith 22906 488 489 #> 3 de520f632a2c Kirby 199003… 0 4 1 Smith 22906 522 498 #> 4 889d833f94ed Estrada 199004… 0 3 1 Smith 22906 516 524 #> 5 2993f4bda3cd Nielsen 199005… 1 2 1 Smith 22906 483 509 #> 6 86c8de9a8d63 Dean 199006… 1 1 2 Brown 22906 503 523 #> 7 cdb300787c0b Hickman 199007… 1 1 2 Brown 22906 539 509 #> 8 ef91ae029e71 Bryant 199008… 0 2 2 Brown 22906 499 490 #> 9 0fb2736cec2c Lynch 199009… 1 3 2 Brown 22906 499 493if (FALSE) { ## save crosswalk between old and new ids for future deid_dua(df, write_crosswalk = TRUE) ## use existing crosswalk (good for panel datasets that need link) deid_dua(df, existing_crosswalk = './crosswalk/master_crosswalk.csv') }