Collapsing

2019-12-30

In some situations, you may want to use encodefrom() to collapse values, that is, group unique raw values into a smaller set of clean values / labels. For example, say you have the following data set, which gives each state’s census division number and name:

Data

id state cendiv cendiv_name
1 AL 6 East South Central
2 AK 9 Pacific
3 AZ 8 Mountain
4 AR 7 West South Central
5 CA 9 Pacific
6 CO 8 Mountain
7 CT 1 New England
8 DE 5 South Atlantic
10 FL 5 South Atlantic
12 HI 9 Pacific
14 IL 3 East North Central
15 IN 3 East North Central
16 IA 4 West North Central
31 NJ 2 Middle Atlantic
33 NY 2 Middle Atlantic

Rather than using the nine census divisions, you would rather group states by their regions. You have the following crosswalk:

Crosswalk

cendiv cenreg cenregnm
1 1 Northeast
2 1 Northeast
3 2 Midwest
4 2 Midwest
5 3 South
6 3 South
7 3 South
8 4 West
9 4 West

As long as

  1. raw values are unique in the crosswalk
  2. clean and label columns have a 1:1 match

Then you can use encodefrom() to collapse categories as you move from raw to clean values.

library(crosswalkr)
library(dplyr)
library(haven)
## data
df <- tibble(id = c(1:8,10,12,14:16,31,33),
             state = c('AL','AK','AZ','AR','CA','CO','CT','DE','FL','HI',
                       'IL','IN','IA','NJ','NY'),
             cendiv = c(6,9,8,7,9,8,1,5,5,9,3,3,4,2,2),
             cendiv_name = c('East South Central','Pacific','Mountain',
                             'West South Central','Pacific','Mountain','New England',
                             'South Atlantic','South Atlantic','Pacific',
                             'East North Central','East North Central',
                             'West North Central','Middle Atlantic','Middle Atlantic'))
             
## crosswalk
cw <- tibble(cendiv = 1:9,
             cenreg = c(1,1,2,2,3,3,3,4,4),
             cenregnm = c('Northeast','Northeast','Midwest','Midwest',
                          'South','South','South','West','West'))
## encode new column
df <- df %>%
    mutate(cenreg = encodefrom(., var = cendiv, cw_file = cw, raw = cendiv,
                               clean = cenreg, label = cenregnm))
df
## # A tibble: 15 x 5
##       id state cendiv cendiv_name               cenreg
##    <dbl> <chr>  <dbl> <chr>                  <dbl+lbl>
##  1     1 AL         6 East South Central 3 [South]    
##  2     2 AK         9 Pacific            4 [West]     
##  3     3 AZ         8 Mountain           4 [West]     
##  4     4 AR         7 West South Central 3 [South]    
##  5     5 CA         9 Pacific            4 [West]     
##  6     6 CO         8 Mountain           4 [West]     
##  7     7 CT         1 New England        1 [Northeast]
##  8     8 DE         5 South Atlantic     3 [South]    
##  9    10 FL         5 South Atlantic     3 [South]    
## 10    12 HI         9 Pacific            4 [West]     
## 11    14 IL         3 East North Central 2 [Midwest]  
## 12    15 IN         3 East North Central 2 [Midwest]  
## 13    16 IA         4 West North Central 2 [Midwest]  
## 14    31 NJ         2 Middle Atlantic    1 [Northeast]
## 15    33 NY         2 Middle Atlantic    1 [Northeast]