The sdcHierarchies package allows to create, modify and
export nested hierarchies that are used for example to define tables in
statistical disclosure control (SDC) software such as sdcTable.
Before using, the package needs to be loaded:
hier_create() allows to create a hierarchy. Argument
root specifies the name of the root node. Optionally, it is
possible to add some nodes to the top-level by listing their names in
argument node_labs. Also, hier_display() shows
the hierarchical structure of the current tree:
## Total
## ├─A
## ├─B
## ├─C
## ├─D
## └─E
Once such an object is created, it can be modified by the following functions:
hier_add(): allows to add nodes to the hierarchyhier_delete(): allows to delete nodes from the
treehier_rename(): allows to rename nodesThese functions can be applied as shown below:
## adding nodes below the node specified in argument `node`
h <- hier_add(h, root = "A", nodes = c("a1", "a2"))
h <- hier_add(h, root = "B", nodes = c("b1", "b2"))
h <- hier_add(h, root = "b1", nodes = c("b1_a", "b1_b"))
# deleting one or more nodes from the hierarchy
h <- hier_delete(h, nodes = c("a1", "b2"))
h <- hier_delete(h, nodes = c("a2"))
# rename nodes
h <- hier_rename(h, nodes = c("C" = "X", "D" = "Y"))
hier_display(h)## Total
## ├─A
## ├─B
## │ └─b1
## │ ├─b1_a
## │ └─b1_b
## ├─X
## ├─Y
## └─E
We note that the underlying data.tree package allows to modify the objects on reference so no explicit assignment is required.
Function hier_info() returns metadata for specific nodes
provided in the nodes argument. If this argument is
omitted, the function returns information for all nodes in the
hierarchy.
info is a named list where each list element refers to a
queried node. The results for level b1 could be extracted
as shown below:
## $name
## [1] "b1"
##
## $is_rootnode
## [1] FALSE
##
## $level
## [1] 3
##
## $is_leaf
## [1] FALSE
##
## $siblings
## character(0)
##
## $contributing_codes
## [1] "b1_a" "b1_b"
##
## $children
## [1] "b1_a" "b1_b"
##
## $parent
## [1] "B"
##
## $is_bogus
## [1] TRUE
##
## $parent_bogus
## [1] "B"
Function hier_convert() takes a hierarchy and allows to
convert the network based structure to different formats while
hier_export() does the conversion and writes the results to
a file on the disk. The following formats are currently supported:
df: a “@;label”-based format used in sdcTabledt: the same as df, but the result is
returned as a data.tableargus: also a “@;label”-based format used to create
hrc-files for \(\tau\)-argusjson: a json-encoded stringcode: the required R-code to re-build the current
hierarchysdc: a list which is a suitable input for
sdcTable## level name
## 1 @ Total
## 2 @@ A
## 3 @@ B
## 4 @@@ b1
## 5 @@@@ b1_a
## 6 @@@@ b1_b
## 7 @@ X
## 8 @@ Y
## 9 @@ E
The required code to create this hierarchy could be computed using:
## library(sdcHierarchies)
## tree <- hier_create(root = 'Total', nodes = c('A', 'B', 'X', 'Y', 'E'))
## tree <- hier_add(tree = tree, root = 'B', nodes = 'b1')
## tree <- hier_add(tree = tree, root = 'b1', nodes = c('b1_a', 'b1_b'))
## print(tree)
Using hier_export(), one can write the results to a
file. This is for example useful if one wants to create
hrc-files that could be used as input for \(\tau\)-argus which can be achieved as
follows:
hier_import() returns a network-based hierarchy given
either a data.frame (in @;labs-format),
json, code or from a \(\tau\)-argus compatible
hrc-file. For example, if we want to create a hierarchy
based on res_df:
## Total
## ├─A
## ├─B
## │ └─b1
## │ ├─b1_a
## │ └─b1_b
## ├─X
## ├─Y
## └─E
Using hier_import(inp = "hierarchy.hrc", from = "argus")
one could create a sdc hierarchy object directly from a
hrc-file.
Often it is the case, the the nested hierarchy information in encoded
in a string. Function hier_compute() allows to transform
such strings into hierarchy objects. One can distinguish two cases: The
first case is where all input codes have the same length while in the
latter case the length of the codes differs. Let’s assume we have a
geographic code given in geo_m where digits 1-2 refer to
the first level, digit 3 to the second and digits 4-5 to the third level
of the hierarchy.
geo_m <- c(
"01051", "01053", "01054", "01055", "01056", "01057", "01058", "01059", "01060", "01061", "01062",
"02000",
"03151", "03152", "03153", "03154", "03155", "03156", "03157", "03158", "03251", "03252", "03254", "03255",
"03256", "03257", "03351", "03352", "03353", "03354", "03355", "03356", "03357", "03358", "03359", "03360",
"03361", "03451", "03452", "03453", "03454", "03455", "03456",
"10155")Often, hierarchical information is encoded within character strings
(e.g., geographic or sector codes). The hier_compute()
function allows you to transform such vectors into hierarchy objects.
The method argument provides two ways to define how these
levels are encoded:
endpos: Requires a numeric vector in
dim_spec defining the end position (index)
of each level within the string.len: Requires a numeric vector in
dim_spec defining the number of characters
(width) allocated to each level.If the overall total is not explicitly encoded in the input strings,
the root argument can be used to provide a name for the
top-level node. Additionally, the as parameter specifies
the output format. For example, setting as = "df" returns
the result as a data.frame in the @; label
format.
As shown below, these two methods are interchangeable and yield identical hierarchies:
# Using end positions (e.g., level 1 ends at index 2, level 2 at 3, level 3 at 5)
v1 <- hier_compute(
inp = geo_m,
dim_spec = c(2, 3, 5),
root = "Tot",
method = "endpos",
as = "df"
)
# Using lengths (e.g., level 1 is 2 chars, level 2 is 1 char, level 3 is 2 chars)
v2 <- hier_compute(
inp = geo_m,
dim_spec = c(2, 1, 2),
root = "Tot",
method = "len",
as = "df"
)
identical(v1, v2)## [1] TRUE
## Tot
## ├─01
## │ └─010
## │ ├─01051
## │ ├─01053
## │ ├─01054
## │ ├─01055
## │ ├─01056
## │ ├─01057
## │ ├─01058
## │ ├─01059
## │ ├─01060
## │ ├─01061
## │ └─01062
## ├─02
## │ └─020
## │ └─02000
## ├─03
## │ ├─031
## │ │ ├─03151
## │ │ ├─03152
## │ │ ├─03153
## │ │ ├─03154
## │ │ ├─03155
## │ │ ├─03156
## │ │ ├─03157
## │ │ └─03158
## │ ├─032
## │ │ ├─03251
## │ │ ├─03252
## │ │ ├─03254
## │ │ ├─03255
## │ │ ├─03256
## │ │ └─03257
## │ ├─033
## │ │ ├─03351
## │ │ ├─03352
## │ │ ├─03353
## │ │ ├─03354
## │ │ ├─03355
## │ │ ├─03356
## │ │ ├─03357
## │ │ ├─03358
## │ │ ├─03359
## │ │ ├─03360
## │ │ └─03361
## │ └─034
## │ ├─03451
## │ ├─03452
## │ ├─03453
## │ ├─03454
## │ ├─03455
## │ └─03456
## └─10
## └─101
## └─10155
If the total is already contained within the string (for
example, in the first 3 positions), the hierarchy can be computed by
including that segment in the dim_spec and omitting the
root argument:
## [1] "Tot01051" "Tot01053" "Tot01054" "Tot01055" "Tot01056" "Tot01057"
v3 <- hier_compute(
inp = geo_m_with_tot,
dim_spec = c(3, 2, 1, 2),
method = "len"
)
hier_display(v3)## Tot
## ├─01
## │ └─010
## │ ├─01051
## │ ├─01053
## │ ├─01054
## │ ├─01055
## │ ├─01056
## │ ├─01057
## │ ├─01058
## │ ├─01059
## │ ├─01060
## │ ├─01061
## │ └─01062
## ├─02
## │ └─020
## │ └─02000
## ├─03
## │ ├─031
## │ │ ├─03151
## │ │ ├─03152
## │ │ ├─03153
## │ │ ├─03154
## │ │ ├─03155
## │ │ ├─03156
## │ │ ├─03157
## │ │ └─03158
## │ ├─032
## │ │ ├─03251
## │ │ ├─03252
## │ │ ├─03254
## │ │ ├─03255
## │ │ ├─03256
## │ │ └─03257
## │ ├─033
## │ │ ├─03351
## │ │ ├─03352
## │ │ ├─03353
## │ │ ├─03354
## │ │ ├─03355
## │ │ ├─03356
## │ │ ├─03357
## │ │ ├─03358
## │ │ ├─03359
## │ │ ├─03360
## │ │ └─03361
## │ └─034
## │ ├─03451
## │ ├─03452
## │ ├─03453
## │ ├─03454
## │ ├─03455
## │ └─03456
## └─10
## └─101
## └─10155
The result is identical to v1 and v2.
hier_compute() is also robust enough to handle input
strings of varying lengths:
## Example with unequal string lengths; overall total provided via 'root'
yae_h <- c(
"1.1.1.", "1.1.2.",
"1.2.1.", "1.2.2.", "1.2.3.", "1.2.4.", "1.2.5.", "1.3.1.",
"1.3.2.", "1.3.3.", "1.3.4.", "1.3.5.",
"1.4.1.", "1.4.2.", "1.4.3.", "1.4.4.", "1.4.5.",
"1.5.", "1.6.", "1.7.", "1.8.", "1.9.", "2.", "3.")
v1 <- hier_compute(
inp = yae_h,
dim_spec = c(2, 2, 2),
root = "Tot",
method = "len"
)
hier_display(v1)## Tot
## ├─1.
## │ ├─1.1.
## │ │ ├─1.1.1.
## │ │ └─1.1.2.
## │ ├─1.2.
## │ │ ├─1.2.1.
## │ │ ├─1.2.2.
## │ │ ├─1.2.3.
## │ │ ├─1.2.4.
## │ │ └─1.2.5.
## │ ├─1.3.
## │ │ ├─1.3.1.
## │ │ ├─1.3.2.
## │ │ ├─1.3.3.
## │ │ ├─1.3.4.
## │ │ └─1.3.5.
## │ ├─1.4.
## │ │ ├─1.4.1.
## │ │ ├─1.4.2.
## │ │ ├─1.4.3.
## │ │ ├─1.4.4.
## │ │ └─1.4.5.
## │ ├─1.5.
## │ ├─1.6.
## │ ├─1.7.
## │ ├─1.8.
## │ └─1.9.
## ├─2.
## └─3.
Alternatively, you can create a hierarchy by setting
method = "list". In this mode, the input should be a named
list where each element’s name is interpreted as a parent
node, and the element’s content represents its child
nodes.
yae_ll <- list()
yae_ll[["Total"]] <- c("1.", "2.", "3.")
yae_ll[["1."]] <- paste0("1.", 1:9, ".")
yae_ll[["1.1."]] <- paste0("1.1.", 1:2, ".")
yae_ll[["1.2."]] <- paste0("1.2.", 1:5, ".")
yae_ll[["1.3."]] <- paste0("1.3.", 1:5, ".")
yae_ll[["1.4."]] <- paste0("1.4.", 1:6, ".")
d <- hier_compute(inp = yae_ll, root = "Total", method = "list") ## Argument 'dim_spec' is ignored when constructing a hierarchy from a nested list.
## Total
## ├─1.
## │ ├─1.1.
## │ │ ├─1.1.1.
## │ │ └─1.1.2.
## │ ├─1.2.
## │ │ ├─1.2.1.
## │ │ ├─1.2.2.
## │ │ ├─1.2.3.
## │ │ ├─1.2.4.
## │ │ └─1.2.5.
## │ ├─1.3.
## │ │ ├─1.3.1.
## │ │ ├─1.3.2.
## │ │ ├─1.3.3.
## │ │ ├─1.3.4.
## │ │ └─1.3.5.
## │ ├─1.4.
## │ │ ├─1.4.1.
## │ │ ├─1.4.2.
## │ │ ├─1.4.3.
## │ │ ├─1.4.4.
## │ │ ├─1.4.5.
## │ │ └─1.4.6.
## │ ├─1.5.
## │ ├─1.6.
## │ ├─1.7.
## │ ├─1.8.
## │ └─1.9.
## ├─2.
## └─3.
The hier_grid() function computes all possible
combinations of codes from multiple hierarchies. This is a crucial step
in building complete tables for Statistical Disclosure Control
(SDC).
A “bogus” chain occurs when a parent node has only a single child. In
such cases, the parent and the child represent the same set of
underlying units, which can cause redundancies in SDC software. In the
example below, both h1 and h2 contain bogus
structures:
h1, the node A has
only one child a1, which in turn has only one child
aa1.h2, the nodes b and
d each have only a single child (b1 and
d1 respectively).h1 <- hier_create("Total", nodes = LETTERS[1:3])
h1 <- hier_add(h1, root = "A", node = "a1")
h1 <- hier_add(h1, root = "a1", node = "aa1")
hier_display(h1)## Total
## ├─A
## │ └─a1
## │ └─aa1
## ├─B
## └─C
h2 <- hier_create("Total", letters[1:5])
h2 <- hier_add(h2, root = "b", node = "b1")
h2 <- hier_add(h2, root = "d", node = "d1")
hier_display(h2)## Total
## ├─a
## ├─b
## │ └─b1
## ├─c
## ├─d
## │ └─d1
## └─e
When calling hier_grid(), setting
add_dups = FALSE automatically prunes these redundant
parent nodes (like A, a1, b, and
d). They are replaced by their most granular descendants
(e.g., aa1, b1, and d1), ensuring
the resulting grid aligns with the granularity of the raw microdata.
# cell_id is a unique string created by concatenating default codes
r <- hier_grid(h1, h2, add_dups = FALSE, add_levs = TRUE)
print(r)## v1 v2 cell_id levs_v1 levs_v2 leaf_id
## <char> <char> <char> <int> <int> <int>
## 1: Total Total 0000000 1 1 NA
## 2: aa1 Total 0111000 4 1 NA
## 3: B Total 0200000 2 1 NA
## 4: C Total 0300000 2 1 NA
## 5: Total a 0000010 1 2 NA
## 6: aa1 a 0111010 4 2 3
## 7: B a 0200010 2 2 1
## 8: C a 0300010 2 2 2
## 9: Total b1 0000021 1 3 NA
## 10: aa1 b1 0111021 4 3 6
## 11: B b1 0200021 2 3 4
## 12: C b1 0300021 2 3 5
## 13: Total c 0000030 1 2 NA
## 14: aa1 c 0111030 4 2 9
## 15: B c 0200030 2 2 7
## 16: C c 0300030 2 2 8
## 17: Total d1 0000041 1 3 NA
## 18: aa1 d1 0111041 4 3 12
## 19: B d1 0200041 2 3 10
## 20: C d1 0300041 2 3 11
## 21: Total e 0000050 1 2 NA
## 22: aa1 e 0111050 4 2 15
## 23: B e 0200050 2 2 13
## 24: C e 0300050 2 2 14
## v1 v2 cell_id levs_v1 levs_v2 leaf_id
## <char> <char> <char> <int> <int> <int>
For large datasets, mapping microdata strings to grid cells using
character matching is computationally expensive. By setting
add_contributing_cells = TRUE, sdcHierarchies
generates an optimized integer-based indexing system:
leaf_id: A unique integer assigned to
every combination of base-level codes (the most
granular codes in the hierarchies).contributing_leaf_ids: A list-column
containing the integers of all base-level codes that contribute to a
specific cell (e.g., all codes falling under a “Total” or
“Sub-total”).# Create an SDC-optimized grid
r_sdc <- hier_grid(h1, h2, add_dups = FALSE, add_contributing_cells = TRUE)
# Genrate microdata using base-level codes for region and sector
# Note: 'aa1', 'b1', and 'd1' are the granular leaf nodes
microdata <- data.table(
region = c("aa1", "B", "C", "aa1", "B"),
sector = c("a", "b1", "c", "d1", "e"),
val = c(10, 20, 30, 40, 50)
)
# Map microdata to base-level IDs using a named list
microdata[, leaf_id := hier_create_ids(
data = microdata,
dims = list("region" = h1, "sector" = h2)
)]
print(microdata)## region sector val leaf_id
## <char> <char> <num> <int>
## 1: aa1 a 10 3
## 2: B b1 20 4
## 3: C c 30 8
## 4: aa1 d1 40 12
## 5: B e 50 13
# Fast aggregation: Summing 'Total_Total' using integer lookups
total_ids <- r_sdc[v1 == "Total" & v2 == "Total", contributing_leaf_ids[[1]]]
print(total_ids)## [1] 1 2 3 7 8 9 13 14 15 4 5 6 10 11 12
## [1] 150
The leaf_id column serves as a built-in classifier to
distinguish between different cell types in the grid:
leaf_id contains an
integer, the row represents a unique combination of base-level codes.
These are the “internal” cells where microdata is directly mapped.leaf_id is
NA, the row represents an aggregate cell.This allows for extremely fast filtering during SDC tasks, such as isolating primary cells for sensitivity testing:
The sdcHierarchies package includes a
Shiny-based interactive application accessible via
hier_app(). This interface is designed for users who prefer
a visual approach to building or refining complex structures.
The app accepts either a raw character vector (to be converted using
hier_compute() logic) or an existing hierarchy object. For
example, to modify the hierarchy created in the previous section:
dim_spec and method arguments.R code
required to reproduce the current state of the hierarchy is updated in
real-time and can be copied or saved.hrc file.Because hier_app() returns the modified hierarchy object
upon closing, it is recommended to assign the function call to an object
(as shown above) to capture your interactive changes for further use in
your SDC pipeline.
The sdcHierarchies package provides a robust framework
for hierarchical data management. In case you have any suggestions or
improvements, please feel free to file an issue at our
issue tracker or contribute by filing a pull
request.