We are going to work with the dataset town_name
included
in the package. The dataset contains a collection of town names as
observed in administrative dataset. The first column name
contains the names as observed. The second column
official_name
the official town name. We are going to
assume that the second column is not available (or only for a part of
the observations). The goal is to recode the 584 town names into a
smaller set of town names knowing that most of the observed town names
are actually misspelled versions of a smaller set of town names. We
could also have solved the problem differently by linking the observed
town names to a dataset containing all official town names. Often
cleaning up these kind of misspellings is a first step in an actual
linkage process. By first cleaning up the town names, subsequent use of
the variable is easier and can lead to better quality linkage.
> library(reclin2)
: data.table
Loading required package
> data(town_names)
> head(town_names)
name official_name1 alblasserdam Alblasserdam
2 amsterdam Amsterdam
3 amsterdam-z.o. Amsterdam
4 amsterdam-zuidoost Amsterdam
5 amsterdam z-o Amsterdam
6 amsterdam z.o. Amsterdam
When performing deduplication we will link a dataset to itself and
will try to link different records belonging to the same object. When a
dataset to itself, it is not necessary to both compare record i
to j and j to i and we certainly do not want
to compare a record to itself. The option deduplication
of
the pair_
functions makes sure that only the needed pairs
are generated. This is a small dataset so we can easily generate all
pairs:
> pairs <- pair(town_names, deduplication = TRUE)
> print(pairs)
: 584 records
First data set: 584 records
Second data set: 170 236 pairs
Total number of pairs
: <.x, .y>
Key
.x .y<int> <int>
1: 1 2
2: 1 3
3: 1 4
4: 1 5
5: 1 6
---
170232: 581 583
170233: 581 584
170234: 582 583
170235: 582 584
170236: 583 584
We will compare the records on name
and use a string
similarity function.
> compare_pairs(pairs, on = "name",
+ comparators = list(cmp_jarowinkler()),
+ inplace = TRUE)
> print(pairs)
: 584 records
First data set: 584 records
Second data set: 170 236 pairs
Total number of pairs
: <.x, .y>
Key
.x .y name<int> <int> <num>
1: 1 2 0.6679894
2: 1 3 0.5753968
3: 1 4 0.5383598
4: 1 5 0.5882173
5: 1 6 0.5753968
---
170232: 581 583 0.5219298
170233: 581 584 0.5228070
170234: 582 583 0.5351852
170235: 582 584 0.7092593
170236: 583 584 0.9296296
Now comes the difficult part: selecting a threshold. The problem is
that it is not really possible to say beforehand what an appropriate
threshold is. That depends on the exact problem and also depends on the
number of different objects that are expected. To explain that, first a
short explanation how the deduplicate_equivalence
function
that we are going to use later works. Let’s assume we have two actual
town names and using our string similarity function we select pairs that
differ one letter from each other, so we end up with the following set
of pairs as an example
rotterdam -> rottrdam
rotterdam -> rotterdm
rotterdm -> rottrdm
rtterdam -> rotterdam
amsterdam -> amstrdam
amstrdam -> amstdam
amsterdm -> amsterdam
That means that we are saying that rotterdam
is the same
object as rottrdam
which is the same object as
rottrdm
. Therefore, rotterdam
and
rottrdm
are the same object although we didn’t select a
pair rotterdam -> rottrdm
. So all names
rotterdam
, rottrdam
, rotterdm
,
rottrdm
and rtterdam
are going to be in one
class. When the number of misspelled names increases and when the number
of actual town names increases, the likelihood that two names that do
not belong to the same object are linked by a chain of pairs increases.
This is a bit like the game where you have to change one word into
another in a given number of steps by changing one letter at a time (the
words in between have to be valid words). When the vocabulary is bigger
this becomes easier. Therefore, the optimal threshold depends on the
number of actual town names and the number of misspellings.
We have the official names and can therefore measure how many errors
we make. We make an error when we put two records from x
in
the same group while they actually belong to different object (official
town names). First we add a variable indicating whether two pairs have
the same official name:
> compare_vars(pairs, "true", on_x = "official_name",
+ inplace = TRUE)
In practice this information is not available, but it might be available for a subset of records, for example, after manual inspection of a subset of the pairs. We now round the similarity scores and count how many errors we make for each value of the similarity score threshold:
> pairs$threshold <- trunc(pairs$name/0.05)*0.05
> thresholds <- pairs[, .(ftrue = mean(true)), by = threshold]
> print(thresholds[order(ftrue)])
: 16 pairs
Total number of pairs
threshold ftrue<num> <num>
1: 0.25 0.000000000
2: 0.30 0.001727116
3: 0.35 0.014810045
4: 0.50 0.034417054
5: 0.45 0.036163208
6: 0.00 0.055555556
7: 0.40 0.055960708
8: 0.55 0.057551227
9: 0.60 0.176160406
10: 0.65 0.365251342
11: 0.70 0.573084217
12: 0.75 0.772509347
13: 0.80 0.895522388
14: 0.85 0.941742311
15: 0.90 0.995657100
16: 0.95 1.000000000
For a threshold of 0.95 and 1.00 we make no errors. Below that we start making errors. So let’s work with a threshold of 0.95 for now
> select_threshold(pairs, "select", "name", threshold = 0.95,
+ inplace = TRUE)
> res <- deduplicate_equivalence(pairs, "group", "select")
> print(res)
name official_name group<fctr> <fctr> <int>
1: alblasserdam Alblasserdam 541
2: amsterdam Amsterdam 427
3: amsterdam-z.o. Amsterdam 427
4: amsterdam-zuidoost Amsterdam 558
5: amsterdam z-o Amsterdam 427
---
580: amsterdam (duivendrecht) Amsterdam-Duivendrecht 580
581: amsterdam zzuidoost Amsterdam 558
582: hoovliet (rotterdam) Hoogvliet Rotterdam 568
583: roettrdam Rotterdam 496
584: srotterdam Rotterdam 496
With deduplicate_equivalence
we take all selected pairs
(indicated by the column select
) and put them in the same
group.
res
now contains the original dataset with a
group
column added that indicates the unique objects (towns
in this case). We can see how many towns we have in the resulting
dataset:
> length(unique(res$group))
1] 162 [
This is quite large. We started with 584 town names and reduced that to 162 while there are actually 19 town names. We can measure the quality by counting how often we have more than one official town name in one group:
> qual <- res[, .(errors = length(unique(official_name))-1, n = .N), by = group]
> qual$ferrors <- qual$errors/qual$n
> qual[errors > 0]
data.table (0 rows and 4 cols): group,errors,n,ferrors Empty
So we have a large number of groups and no errors: no town names have been classified in the same group while actually being different towns. We can check what happens when we decrease the threshold. We will probably introduce some errors while we decrease the number of groups:
> # Create a sequence of thresholds and initialise the result vectors
> thresholds <- seq(0.5, 1, by = 0.02)
> sizes <- numeric(length(thresholds))
> nerrors <- numeric(length(thresholds))
> for (i in seq_along(thresholds)) {
+ threshold <- thresholds[i]
+ # Perform deduplication with the given threshold
+ select_threshold(pairs, "select", "name", threshold = threshold, inplace = TRUE)
+ res <- deduplicate_equivalence(pairs, "group", "select")
+ # Count the number of unique groups
+ sizes[i] <- length(unique(res$group))
+ # Count the number of errors
+ qual <- res[, .(errors = length(unique(official_name))-1, n = .N), by = group]
+ nerrors[i] <- sum(qual$errors)
+ }
The results are plotted in the figure below.
We can see that as the threshold decreases the number of errors increases and the number of groups decreases. We cannot get much less than the 161 groups we found without introducing some errors. How many errors and/or groups are acceptable depends on the application and the amount of time one s willing to spend in manually merging the groups. In this case manually inspecting the groups and merging them will probably take only a few hours and
With a threshold of 0.9 we should get approximately 100 groups and 5 errors which seems a reasonable trade-off. So, let’s rerun some of the previous code with a threshold of 0.90.
> select_threshold(pairs, "select", "name", threshold = 0.9,
+ inplace = TRUE)
> res <- deduplicate_equivalence(pairs, "group", "select")
> qual <- res[, .(errors = length(unique(official_name))-1, n = .N), by = group]
> qual$ferrors <- qual$errors/qual$n
> qual[errors > 0]
group errors n ferrors<int> <num> <int> <num>
1: 543 1 256 0.00390625
2: 571 1 7 0.14285714
3: 441 1 78 0.01282051
4: 578 1 10 0.10000000
5: 296 1 6 0.16666667
One way of assigning names to the groups we derived, is to use the most frequent name used in the group. Assuming that most people will correctly spell the town names this should give us the official town name belonging to each group. In this example dataset each town name occurs only once so can’t use that trick. However, we can use the most frequent official name. We first define a function that returns the most frequent value of a vector and use that to derive the name of the group.
> most_frequent <- function(x) {
+ t <- table(x)
+ t <- sort(t)
+ tail(names(t), 1)
+ }
> res[, assigned_name := most_frequent(official_name), by = group]
> print(res)
name official_name group<fctr> <fctr> <int>
1: alblasserdam Alblasserdam 453
2: amsterdam Amsterdam 543
3: amsterdam-z.o. Amsterdam 543
4: amsterdam-zuidoost Amsterdam 543
5: amsterdam z-o Amsterdam 543
---
580: amsterdam (duivendrecht) Amsterdam-Duivendrecht 580
581: amsterdam zzuidoost Amsterdam 543
582: hoovliet (rotterdam) Hoogvliet Rotterdam 156
583: roettrdam Rotterdam 441
584: srotterdam Rotterdam 441
assigned_name<char>
1: Alblasserdam
2: Amsterdam
3: Amsterdam
4: Amsterdam
5: Amsterdam
---
580: Amsterdam-Duivendrecht
581: Amsterdam
582: Hoogvliet Rotterdam
583: Rotterdam
584: Rotterdam
We can now also look at the errors:
> print(res[assigned_name != official_name])
name official_name group assigned_name<fctr> <fctr> <int> <char>
1: rotterdam hoogvliet Hoogvliet Rotterdam 441 Rotterdam
2: rotterdam-hoogvliet Hoogvliet Rotterdam 441 Rotterdam
3: nw. amsterdam Nieuw Amsterdam 543 Amsterdam
4: rotterdam - hoogvliet Hoogvliet Rotterdam 441 Rotterdam
5: amsterdam (driemond) Amsterdam 578 Diemen
6: nw-amsterdam Nieuw Amsterdam 543 Amsterdam
7: nw amsterdam Nieuw Amsterdam 543 Amsterdam
8: amsterdam- driemond Amsterdam 578 Diemen
9: amsterdam-driemond Amsterdam 578 Diemen
10: rotterdam -hoogvliet Hoogvliet Rotterdam 441 Rotterdam
11: rotterdam/hoogvliet Hoogvliet Rotterdam 441 Rotterdam
12: rotterdam hoogvliet Hoogvliet Rotterdam 441 Rotterdam
13: nw.amsterdam Nieuw Amsterdam 543 Amsterdam
14: amsterdam driemond Amsterdam 578 Diemen
15: veerdam Veendam 571 Leerdam
16: rotterdam(hoogvliet) Hoogvliet Rotterdam 441 Rotterdam
17: rotterdam (hoogvliet) Hoogvliet Rotterdam 441 Rotterdam
18: rotterdam-hoogvlie Hoogvliet Rotterdam 441 Rotterdam
19: rotterdam (prinsen Rotterdam 296 Pernis Rotterdam
20: rotterdam- hoogvliet Hoogvliet Rotterdam 441 Rotterdam
21: rotterdam hoogvlie Hoogvliet Rotterdam 441 Rotterdam
22: amsterdam-driemand Amsterdam 578 Diemen
name official_name group assigned_name
We see that we make a lot of errors with the town of
Hoogvliet Rotterdam
. The problem we have is a difficult
one. For example, rotterdam charlois
should be called
Rotterdam
while rotterdam hoogvliet
should be
called Hoogvliet Rotterdam
. We can’t really expect that a
computer is able to distinguish between these two without additional
information. One other way of solving this problem is actually consider
this as a linkage problem: we want to link a set of written town names
to an official set of town names.