In this example we will show how reclin2
can be used in
combination with machine learning to perform record linkage. We will use
the same example as in the introduction vignette and will skip over some
of the initial steps in the linkage project. We will use plain logistic
regression. Not the most sophisticated machine learning algorithm, but
for the simplistic example more than enough. Other algorithms are easily
substituted.
When performing record linkage, we will compare combinations of records from both datasets. After comparison we end up with a large dataset of pairs with properties of these pairs (the comparison vectors). The goal of record linkage is to divide these pairs into two groups: one group with pairs where both records in the pair belong to the same object, the matching set, and one group where both records in the pair do not belong to the same object, the unmatched set. Record linkage is, therefore, a classification problem and when we know for some of the pairs if they belong to the matching set or the unmatching set, we can use that to train a supervised classification method.
First we have to generate all pairs and compare these. This is similar as in regular probabilistic linkage.
> library(reclin2)
> data("linkexample1", "linkexample2")
> print(linkexample1)
id lastname firstname address sex postcode1 1 Smith Anna 12 Mainstr F 1234 AB
2 2 Smith George 12 Mainstr M 1234 AB
3 3 Johnson Anna 61 Mainstr F 1234 AB
4 4 Johnson Charles 61 Mainstr M 1234 AB
5 5 Johnson Charly 61 Mainstr M 1234 AB
6 6 Schwartz Ben 1 Eaststr M 6789 XY
> print(linkexample2)
id lastname firstname address sex postcode1 2 Smith Gearge 12 Mainstreet <NA> 1234 AB
2 3 Jonson A. 61 Mainstreet F 1234 AB
3 4 Johnson Charles 61 Mainstr F 1234 AB
4 6 Schwartz Ben 1 Main M 6789 XY
5 7 Schwartz Anna 1 Eaststr F 6789 XY
> pairs <- pair_blocking(linkexample1, linkexample2, "postcode")
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ inplace = TRUE, comparators = list(lastname = cmp_jarowinkler(),
+ firstname = cmp_jarowinkler(), address = cmp_jarowinkler()))
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex<int> <int> <num> <num> <num> <lgcl>
1: 1 1 1.000000 0.4722222 0.9230769 NA
2: 1 2 0.000000 0.5833333 0.8641026 TRUE
3: 1 3 0.447619 0.4642857 0.9333333 TRUE
4: 2 1 1.000000 0.8888889 0.9230769 NA
5: 2 2 0.000000 0.0000000 0.8641026 FALSE
6: 2 3 0.447619 0.5396825 0.9333333 FALSE
7: 3 1 0.447619 0.4722222 0.8641026 NA
8: 3 2 0.952381 0.5833333 0.9230769 TRUE
9: 3 3 1.000000 0.4642857 1.0000000 TRUE
10: 4 1 0.447619 0.6428571 0.8641026 NA
11: 4 2 0.952381 0.0000000 0.9230769 FALSE
12: 4 3 1.000000 1.0000000 1.0000000 FALSE
13: 5 1 0.447619 0.5555556 0.8641026 NA
14: 5 2 0.952381 0.0000000 0.9230769 FALSE
15: 5 3 1.000000 0.8492063 1.0000000 FALSE
16: 6 4 1.000000 1.0000000 0.6111111 TRUE
17: 6 5 1.000000 0.5277778 1.0000000 FALSE
On of the things we run into, is that the variable sex
has missing values. We could set these to FALSE
(this is
what is done when calling problink_em
during estimation of
the model), but with machine learning we could also include these as a
separate category. For that we first need to define a custom comparison
function.
> na_as_class <- function(x, y) {
+ factor(
+ ifelse(is.na(x) | is.na(y), 2L, (y == x)*1L),
+ levels = 0:2, labels = c("eq", "uneq", "mis"))
+ }
We then remove the old variable sex
(otherwise
compare_pairs
will complain that we cannot assign a factor
to a logical vector) and compare the pairs again with the new comparison
function.
> pairs[, sex := NULL]
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address<int> <int> <num> <num> <num>
1: 1 1 1.000000 0.4722222 0.9230769
2: 1 2 0.000000 0.5833333 0.8641026
3: 1 3 0.447619 0.4642857 0.9333333
4: 2 1 1.000000 0.8888889 0.9230769
5: 2 2 0.000000 0.0000000 0.8641026
6: 2 3 0.447619 0.5396825 0.9333333
7: 3 1 0.447619 0.4722222 0.8641026
8: 3 2 0.952381 0.5833333 0.9230769
9: 3 3 1.000000 0.4642857 1.0000000
10: 4 1 0.447619 0.6428571 0.8641026
11: 4 2 0.952381 0.0000000 0.9230769
12: 4 3 1.000000 1.0000000 1.0000000
13: 5 1 0.447619 0.5555556 0.8641026
14: 5 2 0.952381 0.0000000 0.9230769
15: 5 3 1.000000 0.8492063 1.0000000
16: 6 4 1.000000 1.0000000 0.6111111
17: 6 5 1.000000 0.5277778 1.0000000
> compare_pairs(pairs, on = c("lastname", "firstname", "address", "sex"),
+ inplace = TRUE, comparators = list(lastname = cmp_jarowinkler(),
+ firstname = cmp_jarowinkler(), address = cmp_jarowinkler(), sex = na_as_class))
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex<int> <int> <num> <num> <num> <fctr>
1: 1 1 1.000000 0.4722222 0.9230769 mis
2: 1 2 0.000000 0.5833333 0.8641026 uneq
3: 1 3 0.447619 0.4642857 0.9333333 uneq
4: 2 1 1.000000 0.8888889 0.9230769 mis
5: 2 2 0.000000 0.0000000 0.8641026 eq
6: 2 3 0.447619 0.5396825 0.9333333 eq
7: 3 1 0.447619 0.4722222 0.8641026 mis
8: 3 2 0.952381 0.5833333 0.9230769 uneq
9: 3 3 1.000000 0.4642857 1.0000000 uneq
10: 4 1 0.447619 0.6428571 0.8641026 mis
11: 4 2 0.952381 0.0000000 0.9230769 eq
12: 4 3 1.000000 1.0000000 1.0000000 eq
13: 5 1 0.447619 0.5555556 0.8641026 mis
14: 5 2 0.952381 0.0000000 0.9230769 eq
15: 5 3 1.000000 0.8492063 1.0000000 eq
16: 6 4 1.000000 1.0000000 0.6111111 uneq
17: 6 5 1.000000 0.5277778 1.0000000 eq
In order to estimate the model we need some pairs for which we know the truth. One way of obtaining this information is by reviewing some of the pairs. The number of pairs will generally grow with O(N2) with N the size of the smallest dataset. The number of matches in these pairs is usually O(N). Therefore, the fraction of matches in the pairs is O(1/N) and therefore usually very small. Therefore, when sampling records for review it is usually a good idea to not sample the pairs completely random, but, for example, oversample pairs that agree on more variables.
Another way of getting a training dataset is when additional
information is available. For example, when linking a dataset to a
population register for some of the records in the dataset an official
id might be available. For these records the true match status can be
determined. This is what we will simulate in the example below. Let’s
assume we know from three of the records in linkexample2
the id
:
> linkexample2$known_id <- linkexample2$id
> linkexample2$known_id[c(2,5)] <- NA
> setDT(linkexample2)
We the know for these records the true match status in the pairs. Below we add this to the pairs:
> compare_vars(pairs, "y", on_x = "id", on_y = "known_id", y = linkexample2, inplace = TRUE)
Note that we supply y = linkexample2
in the call. This
is needed as the copy of linkexample2
stored with
pairs
does not contain the known_id
column. We
can also add the true status for all records to measure the performance
of the linkage in the end
> compare_vars(pairs, "y_true", on_x = "id", on_y = "id", inplace = TRUE)
> print(pairs)
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex y y_true<int> <int> <num> <num> <num> <fctr> <lgcl> <lgcl>
1: 1 1 1.000000 0.4722222 0.9230769 mis FALSE FALSE
2: 1 2 0.000000 0.5833333 0.8641026 uneq NA FALSE
3: 1 3 0.447619 0.4642857 0.9333333 uneq FALSE FALSE
4: 2 1 1.000000 0.8888889 0.9230769 mis TRUE TRUE
5: 2 2 0.000000 0.0000000 0.8641026 eq NA FALSE
6: 2 3 0.447619 0.5396825 0.9333333 eq FALSE FALSE
7: 3 1 0.447619 0.4722222 0.8641026 mis FALSE FALSE
8: 3 2 0.952381 0.5833333 0.9230769 uneq NA TRUE
9: 3 3 1.000000 0.4642857 1.0000000 uneq FALSE FALSE
10: 4 1 0.447619 0.6428571 0.8641026 mis FALSE FALSE
11: 4 2 0.952381 0.0000000 0.9230769 eq NA FALSE
12: 4 3 1.000000 1.0000000 1.0000000 eq TRUE TRUE
13: 5 1 0.447619 0.5555556 0.8641026 mis FALSE FALSE
14: 5 2 0.952381 0.0000000 0.9230769 eq NA FALSE
15: 5 3 1.000000 0.8492063 1.0000000 eq FALSE FALSE
16: 6 4 1.000000 1.0000000 0.6111111 uneq TRUE TRUE
17: 6 5 1.000000 0.5277778 1.0000000 eq NA FALSE
We now have all of the information needed to estimate our (machine learning) model. Note that this will give a bunch of warnings as we estimating six parameters with only eleven observations and the parameters will not be reliably estimated.
> m <- glm(y ~ lastname + firstname + address + sex, data = pairs, family = binomial())
And then we can add the prediction to pairs
and check
how well we have done:
> pairs[, prob := predict(m, type = "response", newdata = pairs)]
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex y y_true prob<int> <int> <num> <num> <num> <fctr> <lgcl> <lgcl> <num>
1: 1 1 1.000000 0.4722222 0.9230769 mis FALSE FALSE 2.220446e-16
2: 1 2 0.000000 0.5833333 0.8641026 uneq NA FALSE 1.000000e+00
3: 1 3 0.447619 0.4642857 0.9333333 uneq FALSE FALSE 7.317210e-12
4: 2 1 1.000000 0.8888889 0.9230769 mis TRUE TRUE 1.000000e+00
5: 2 2 0.000000 0.0000000 0.8641026 eq NA FALSE 2.220446e-16
6: 2 3 0.447619 0.5396825 0.9333333 eq FALSE FALSE 2.220446e-16
7: 3 1 0.447619 0.4722222 0.8641026 mis FALSE FALSE 2.220446e-16
8: 3 2 0.952381 0.5833333 0.9230769 uneq NA TRUE 2.214629e-12
9: 3 3 1.000000 0.4642857 1.0000000 uneq FALSE FALSE 2.220446e-16
10: 4 1 0.447619 0.6428571 0.8641026 mis FALSE FALSE 1.665098e-11
11: 4 2 0.952381 0.0000000 0.9230769 eq NA FALSE 2.220446e-16
12: 4 3 1.000000 1.0000000 1.0000000 eq TRUE TRUE 1.000000e+00
13: 5 1 0.447619 0.5555556 0.8641026 mis FALSE FALSE 2.220446e-16
14: 5 2 0.952381 0.0000000 0.9230769 eq NA FALSE 2.220446e-16
15: 5 3 1.000000 0.8492063 1.0000000 eq FALSE FALSE 4.477438e-11
16: 6 4 1.000000 1.0000000 0.6111111 uneq TRUE TRUE 1.000000e+00
17: 6 5 1.000000 0.5277778 1.0000000 eq NA FALSE 2.220446e-16
> pairs[, select := prob > 0.5]
: 6 records
First data set: 5 records
Second data set: 17 pairs
Total number of pairs: 'postcode'
Blocking on
.x .y lastname firstname address sex y y_true prob<int> <int> <num> <num> <num> <fctr> <lgcl> <lgcl> <num>
1: 1 1 1.000000 0.4722222 0.9230769 mis FALSE FALSE 2.220446e-16
2: 1 2 0.000000 0.5833333 0.8641026 uneq NA FALSE 1.000000e+00
3: 1 3 0.447619 0.4642857 0.9333333 uneq FALSE FALSE 7.317210e-12
4: 2 1 1.000000 0.8888889 0.9230769 mis TRUE TRUE 1.000000e+00
5: 2 2 0.000000 0.0000000 0.8641026 eq NA FALSE 2.220446e-16
6: 2 3 0.447619 0.5396825 0.9333333 eq FALSE FALSE 2.220446e-16
7: 3 1 0.447619 0.4722222 0.8641026 mis FALSE FALSE 2.220446e-16
8: 3 2 0.952381 0.5833333 0.9230769 uneq NA TRUE 2.214629e-12
9: 3 3 1.000000 0.4642857 1.0000000 uneq FALSE FALSE 2.220446e-16
10: 4 1 0.447619 0.6428571 0.8641026 mis FALSE FALSE 1.665098e-11
11: 4 2 0.952381 0.0000000 0.9230769 eq NA FALSE 2.220446e-16
12: 4 3 1.000000 1.0000000 1.0000000 eq TRUE TRUE 1.000000e+00
13: 5 1 0.447619 0.5555556 0.8641026 mis FALSE FALSE 2.220446e-16
14: 5 2 0.952381 0.0000000 0.9230769 eq NA FALSE 2.220446e-16
15: 5 3 1.000000 0.8492063 1.0000000 eq FALSE FALSE 4.477438e-11
16: 6 4 1.000000 1.0000000 0.6111111 uneq TRUE TRUE 1.000000e+00
17: 6 5 1.000000 0.5277778 1.0000000 eq NA FALSE 2.220446e-16
select<lgcl>
1: FALSE
2: TRUE
3: FALSE
4: TRUE
5: FALSE
6: FALSE
7: FALSE
8: FALSE
9: FALSE
10: FALSE
11: FALSE
12: TRUE
13: FALSE
14: FALSE
15: FALSE
16: TRUE
17: FALSE
> table(pairs$select > 0.5, pairs$y_true)
FALSE TRUE
FALSE 12 1
TRUE 1 3
Given the small size of the dataset we have to estimate the model on, this is not too bad.
We now know which pairs are to be linked, but we still have to
actually link them. link
does that (the optional arguments
all_x
and all_y
control the type of
linkage):
> linked_data_set <- link(pairs, selection = "select", all_y = TRUE)
> print(linked_data_set)
: 5 pairs
Total number of pairs
: <.y>
Key
.y .x id.x lastname.x firstname.x address.x sex.x postcode.x id.y<int> <int> <int> <fctr> <fctr> <fctr> <fctr> <fctr> <int>
1: 1 2 2 Smith George 12 Mainstr M 1234 AB 2
2: 2 1 1 Smith Anna 12 Mainstr F 1234 AB 3
3: 3 4 4 Johnson Charles 61 Mainstr M 1234 AB 4
4: 4 6 6 Schwartz Ben 1 Eaststr M 6789 XY 6
5: 5 NA NA <NA> <NA> <NA> <NA> <NA> 7
lastname.y firstname.y address.y sex.y postcode.y<fctr> <fctr> <fctr> <fctr> <fctr>
1: Smith Gearge 12 Mainstreet <NA> 1234 AB
2: Jonson A. 61 Mainstreet F 1234 AB
3: Johnson Charles 61 Mainstr F 1234 AB
4: Schwartz Ben 1 Main M 6789 XY
5: Schwartz Anna 1 Eaststr F 6789 XY