-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path2_record_linker.tex
514 lines (406 loc) · 36.3 KB
/
2_record_linker.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
\begin{savequote}[75mm]
--No matter what he does, every person on earth plays a central role in the history of the world. And normally he doesn't know it.
\qauthor{Paulo Coelho, The Alchemist}
\end{savequote}
% % % % % % % % % % % % % % % % % % % % % %
\chapter{Record Linkage Pipeline}
\label{ch:record_linker}
\chaplettrine{I}{n this chapter}, we will give an overview of the record linkage system that was developed to solve issues that were discussed in \cref{sec:problem_definition}.
It is also the basis for the experiments described in \cref{ch:experiments}.
The methodology described is not much different from techniques that historians would use in a non-automated manner.
Conceptually, it involves the extraction of occurrences of names, the grouping of promising candidate pairs, the comparison of the records therein, followed by tagging each candidate pair as a \emph{match} or \emph{non-match}.
This chapter describes how these steps are performed in an automated fashion in the order they are processed in the pipeline.
The strength of an automated system lies in its ability to both extract the required statistics from the data easily and use it to its advantage rapidly in the classification step.
Viewing the dataset as a whole can provide further insight, whereas this would be infeasible for humans to perform.
As we will see, the record linkage procedure is constructed around the idea that matching records should be similar, i.e., they contain similar values.
The classification step utilizes this assumption by attempting to cluster records that are similar with respect to their individual fields.
However, since the data is assumed to be unannotated, the references must first be extracted from text and segmented.
After the classification of record pairs the final results are analyzed in order to find more links and resolve inconsistencies.
Many choices can be made regarding the individual components and their parameters, causing a combinatorial explosion of possible settings.
To keep things in perspective, we have chosen to give an overview of methods that are fundamentally different in their approach.
These methods are then evaluated in isolation in \cref{ch:experiments} so as to study their applicability for entity disambiguation in a historical context.
% % % % % % % % % % % % % % % % % % % % % %
\section{Named-entity extraction}
\label{sec:segmentation}
Traditional record linkage techniques naturally assume that the input consist of structured records (i.e. tuples).
Our goal, however, is to perform entity resolution in documents consisting of natural language.
The first step is therefore to extract structured records from the input data.
We have considered using existing tools from the field of Natural Language Processing (NLP), but these posed several issues.
The task is often referred to as \emph{Named-Entity Recognition} (NER) and is usually solved with either grammars or statistical (machine learning) approaches.
One of the problems with existing grammar-based approaches is that they often assume modern English and have problems with older English, making them less suitable for a generic system.
Statistical techniques generally rely on the availability of labeled data for training, which is often not the case.
Luckily, the variant of NER we are trying to achieve here is rather simple as it only involves the detection of people in text and not, for example, place names.
Also, person names have a well-defined structure and use capitalization, making it a lot easier to detect them.
The method that we settled on is a simplified version of a context-free grammar-based approach and mainly requires a list of first names as a \emph{seed}.
The input is processed as a sequence of tokens, which is generated simply by splitting the text on spaces after removing punctuation; no further processing on the tokens is performed.
After that, tokens can be searched for first names, ignoring tokens that are not recognized as such.
First names are used as anchor points in the grammar by applying rules that try to match tokens based on their relative position to the names.
By applying grammar rules in this way, surnames can effectively be learned.
These in turn could be used as an anchor point, similar to the first names, which yields a new set of first names that were unknown before.
Repeated application of this procedure can consequently be used to expand a relatively small list of known names into a bigger one in a process called \emph{bootstrapping}.
Caution should be taken, however, to prevent errors made in one iteration of the bootstrapping, since they can propagate to a next iteration.
It is therefore best to manually inspect the learned names after each iteration if the size of the list allows for it.
An advantage of a manually crafted grammar, like the one proposed, is that tokens are classified as part of the parsing, which effectively segments the references into structured records.
These can then be further processed using traditional record linkage techniques.
% % % % % % % % % % % % % % % % % % % % % %
\section{Blocking}
\label{sec:blocking}
As stated in \cref{sec:challenges}, for larger datasets it becomes rapidly infeasible to compare all possible pairs of records within a single database, since it grows quadratically.
The total number of possible record links, and thus comparisons in a naive setting, within a dataset of size $n$ is equal to $n\times(n-1)/2$.
Because the similarity between records that we are computing is symmetric, it only needs to be computed once.
Since the comparison step is often the most expensive step of the record linkage pipeline, a technique called \emph{blocking} is often deployed.
The goal of blocking is to efficiently partition the set of all possible pairs into several mutually exclusive sets.
Only the records within a set are compared, effectively reducing the number of comparisons that is required.
Note that even though the goal of record linkage itself is to create such a partitioning, in blocking this process is rather coarse-grained, and involves computationally cheap operations.
The effectiveness of blocking depends on its ability to partition the references in blocks of roughly equal size, which is often a problem.
Another problem arises when the blocking function is unable to group records together that should have been grouped.
In this case the records will not be compared and are therefore incorrectly classified as non-matches and therefore reduces the system's recall.
Blocking also indirectly affects the precision of the system since it allows for more expensive, and thus more thorough, candidate pair comparisons, which in theory yield more accurate results \citep{Baxter2003}.
Though it is not the main focus of this thesis to study blocking techniques, it is often an important step in the process and we will therefore discuss it shortly.
As we shall see in \cref{sec:standard_blocking}, simple blocking techniques can easily be implemented with often only a slight reduction in recall while vastly reducing the number of required comparisons.
% % % % % % % % % % % % % % % % % % % % % %
\subsection{Standard blocking}
\label{sec:standard_blocking}
A straight-forward approach to blocking has already been touched on and is often referred to as \emph{standard blocking}.
This method of blocking is comparable to a hash function that maps a set of elements to distinct partitions indexed with a \emph{key}.
In the case of standard blocking, the function is based on the value that is contained in a field.
For example, we can consider the very simple function that takes the first $k$ characters of a \texttt{first name} field and uses it as key.
The operation executed by the function is inexpensive and will map first names to a fairly large number of blocks.
Assuming that the least errors occur at the beginning of a name, it may also exclude a large number of relevant records when a high setting for $k$ is chosen.
If this is not the case, however, it may result in a lower specificity.
Depending on the content and quality of the data and the cost of a missing a true match, the user of the system should therefore make a decision on the setting of $k$.
Various blocking techniques can be parametrized as to allow a user to choose an appropriate trade-off between sensitivity and specificity.
\begin{figure}
\centering
%\input{plots/blocking_example.tex}
\includegraphics[width=\textwidth,height=\gratio\textwidth]{plots/blocking_example.tikz}
\caption{The number of comparisons grows quadratically with the number of records and can greatly be reduced using a blocking technique. The red curve shows theoretical optimum of blocking technique that uses the first letter of the first name as a key if it results in equally sized blocks. The black curve is computed using a realistic dataset of the same size.}
\label{fig:blocking_example}
\end{figure}
In \cref{fig:blocking_example} the number of candidate pair comparisons is plotted against the number of input records on a logarithmic scale.
If no blocking scheme is utilized, the number of required comparisons quickly grows, resulting in the blue line.
Even with the very simple blocking scheme that uses the first character of the name, we can expect the number of comparisons to be much less than without a blocking method.
In the case of a uniform spread of starting characters, which is rather optimistic, we obtain the red line.
The black line shows the number of comparisons against the total number of input records as computed using an empirical distribution of starting characters derived from the Gascon Rolls \citep{GasconRolls}.
It can be seen that for moderately large to large datasets, even with this simple blocking scheme, the number of comparisons are reduced by an order of magnitude.
Blocking keys can also be combined into more specific keys by concatenating them \citep{Christen2012} or by using different keys in multiple passes \citep{Baxter2003}.
Parameter settings for the individual functions can be set to more conservative so that larger blocks are obtained.
The idea is that by looking at various aspects of the records in such a way, errors caused by too specific blocking settings can be mitigated.
A disadvantage is that many different combinations are often possible, making it hard to decide on a setting.
Trying too many settings defeats the purpose of blocking, so other strategies might have to be considered.
% % % % % % % % % % % % % % % % % % % % % %
\subsection{Sorted neighborhood}
\label{sec:sorted_neighborhood}
The previously described blocking strategy relies on a mapping function that maps records that are ``close enough'' to the same key.
Constructing such a function can be hard to accomplish and alternatives may be required.
Consider, for example, a function that is to map age values to a key, and assume that there is an error caused by the submitter of the data caused by the guessing of the age.
In this case it is natural to compare records that contain approximately the same age value, but constructing a single key is not a solution.
The \emph{sorted neighborhood} blocking technique \citep{Hernandez1995} makes use of the fact that it is much easier to detect records of approximately the same age if the records are first sorted on this field.
Constructing blocks of records of similar ages can then be achieved by moving a window of size $w$ over the sorted records.
After each iteration, the window moves one record further, requiring only the newly observed record to be compared with the previous $w-1$ records.
Completing a full pass thus requires $\mathcal{O}(wn)$ comparisons.
Since for large databases it holds that $w \ll n$, the computational complexity of this process is $\mathcal{O}(n)$.
Sorting of the databases can be achieved in $\mathcal{O}(n \log n)$.
A problem with the sorted neighborhood approach is that some values may occur very frequently, causing records that have the same sorting key to fall outside the window and are therefore not matched.
An example of this is given in \cref{tab:blocking_example}, where the first row containing the last name ``Asimov'' falls outside the window that contains the last mention.
One simple solution to overcome this problem is to group records that have the same sorting key together using an inversed index.
The window is then moved over the inverted index instead \citep{Christen2012}.
In this case the value of $w$ must be reconsidered, because many more records are now covered within a window.
Experimental evaluation of this modification has shown that it can lead to more true matches being included in the set of candidate record pairs and thus to a higher recall \citep{Christen2012}.
\begin{table}
\caption{Graphical illustration of the sorted neighborhood blocking method. Here, the records were first sorted by last name and then a window of length three was moved across them.}
\label{tab:blocking_example}
\centering
\begin{tabular}{l l l l}
\toprule
\textbf{First name} & \textbf{Middle name(s)} & \textbf{Last name} & \textbf{Birth year}\\
\midrule
Douglas & Noel & Adams & 1952 \\
Franklin & Robert & Adams & 1932 \\
Robert & & Adams & \\
Isaac & & Asimov & 1920\tikzmark[xshift=3.5em]{asimov1} \\
\tikzmark[xshift=-8pt,yshift=1ex]{window_begin}Isaak & & Asimov & \\
Janet & & Asimov & 1926 \\
\tikzmark[xshift=-8pt,yshift=1ex]{window_end}I & & Asimov & 1926 \\
Elizabeth & & Bear & 1971 \\
Gregory & Dale & Bear & 1951 \\
Philip & Kindred & Dick & 1928 \\
William & Ford & Gibson & 1948 \\
Robert & Anson & Heinlein & 1907 \\
Aldous & Leonard & Huxley & 1894 \\
Ann & & Leckie & 1966 \\
Isaak & Yudovich & Ozimov & \tikzmark[xshift=3.5em]{asimov2} \\
Sarah & Bear Elizabeth & Wishnevsky & 1971 \\
\bottomrule
\end{tabular}
\drawcurvedarrow[bend left=60,-stealth]{asimov1}{asimov2}
\drawbrace[brace mirrored, thick]{window_begin}{window_end}
\annote[left]{brace-1}{Window}
%\annote[right]{arrow-1}{ }
\end{table}
% % % % % % % % % % % % % % % % % % % % % %
\section{Field-based comparisons}
\label{sec:field_comparisons}
The fundamental principle of the entity resolution techniques described in this thesis are based on the assumption that references to the same entity are relatively similar.
Naturally, if the references are exactly the same we are confident that indeed the references refer to the same real-world entity.
In practice, however, it is not always this obvious, since there are many circumstances in which the references differ, even though the pair should be classified as a match.
There are a number of reasons that cause these differences, e.g. an error was made during transcription or a name was abbreviated.
To be able to perform entity resolution under such conditions we can make use of approximate string matching techniques.
These work by measuring the proximity of one string to another which gives an intuition about the likeliness of the equivalence of the string.
A distance function $d(\cdot, \cdot)$ maps a pair of strings $\sigma_1$ and $\sigma_2$ to a real number $r$, with a greater value of $r$ indicating a smaller similarity between $\sigma_1$ and $\sigma_2$ \citep{Cohen2003}.
Conversely, a similarity function $s(\cdot, \cdot)$ outputs numbers $r$ with a greater value indicating a larger similarity.
These values are often normalized such that their value is between $0 \leq r \leq 1$.
Distance and similarity functions for strings must intuitively have a number of properties, which are in general \citep{Christen2012}:
\begin{itemize}
\item $s(\sigma_1, \sigma_1)=1$: the similarity is maximal when a string is compared to itself.
\item $s(\sigma_1, \sigma_2)=0$: the strings are `completely different', meaning that $\sigma_1$ and $\sigma_2$ have no characters in common, resulting in the minimum similarity value of $0$.
\item $0 < s(\sigma_1, \sigma_2) < 1$: a value between the two extremes indicates that the strings are `somewhat similar'.
\end{itemize}
With similarity values between $0$ and $1$, it is easy to convert to distance values by subtracting them from $1$, and vice versa.
It depends on the calculation of the proximity measure whether it makes more sense to talk about one or the other.
For the record linker it does not make a huge difference, as the interpretation of the numbers is simply inversed.
We will see that a threshold can then be set for each component, mapping the scores to binary values, or used as-is in the classification of references which will be further discussed in \cref{sec:classification}.
The remaining section describes a number of distance metrics that were used in our record linkage system to perform pairwise comparisons between the fields of a pair of records, which we assume to be of the string data type.
We denote with $\sigma$ some string constructed from individual literals taken from an alphabet $\Sigma$, and use the superscript notation, e.g. $\sigma^{i}$, to refer to its constituent parts.
The notation $\abs{\cdot}$ is used to refer to the length of a string.
Note that we are now considering the field strings in isolation and not the records as a whole, hence a \emph{match} in this context means that the fields are `similar enough'.
% % % % % % % % % % % % % % % % % % % % % %
\subsection{Edit and Levenshtein distance}
\label{sec:edit_distance}
As stated, misspellings and typographical errors are a common source of errors and often turn up as a single character being replaced or perhaps an additional being added.
A natural way to compute a distance between two strings is therefore to use the smallest number of modifications that is needed to turn one string into the other.
This is the underlying principle of the class of edit distances.
In its simplest form, three operations are considered: substitutions, insertions and deletions, each of them associated with a cost of $1$.
This variant of the edit distance is often called the \emph{Levenshtein distance} \citep{Levenshtein1966}.
The edit distance can efficiently be computed in time $\mathcal{O}(\abs{\sigma_1} \times \abs{\sigma_2})$ using a dynamic programming approach.
It breaks up the computation into smaller subproblems, namely the computation of the edit distance between all possible prefixes of one string and all possible prefixes of the second.
The algorithm for computing edit distances is given in \cref{alg:edit_distance}.
First thing to note is that the first row of the matrix keeps track of the cost of deleting literals from $\sigma_{2}$ while the first column keeps track of deleting literals from $\sigma_{1}$.
These can be filled in without requiring any knowledge of the strings themselves, except for their length.
A cell $D[i,j]$ in the matrix corresponds to the number of edits required to convert the first $i$ characters of the string $\sigma_1$ (shown in the first column of a matrix) into the string comprised of the first $j$ characters of string $\sigma_2$ (shown in top row of a matrix) \citep{Christen2012}.
If we look at the example as depicted in Figure~\ref{fig:edit_distance}, we can see how the algorithm computes the Levenshtein distance (i.e. the edit distance with equal cost of all edit operations) between Owen (British) and Owain (Welsh).
The computation starts in the top left corner with an initial score of $0$.
Aligning the literal `O' of Owain with the empty string results in a mismatch and induces a cost of $1$, therefore $D[0, 1]=1$.
A vertical move down requires similar reasoning and also results in a cost of $1$.
Aligning both starting letters results in a match and the diagonal move introduces no cost, therefore $D[1, 1]=0$.
This process is repeated until all cells of the matrix have been computed.
The edit distance can be found in the lower right corner and is $2$ in this case.
Shown in bold is a construction path that shows one of the possible alignments associated with the computed distance.
It is constructed by backtracking from the lower right corner to the top left each of the steps that led to the minimum value, i.e. the lowest value of the vertical, horizontal and diagonal moves must be selected.
From the example it can be seen that a choice sometimes occurs, since multiple alignments can exist for a certain edit distance.
\begin{algorithm}
\input{algorithms/edit_distance.tex}
\caption{Computes the edit distance between two strings $\sigma_1$ and $\sigma_2$}
\label{alg:edit_distance}
\end{algorithm}
\begin{figure}
\centering
\begin{minipage}{.65\textwidth}
\centering
\input{tables/edit_distance_matrix.tex}
\end{minipage}%
\begin{minipage}{.35\textwidth}
\centering
\begin{tabular}{c||c|c|c|c|c}
$\sigma_1$ & O & w & a & i & n \\\hline
$\sigma_2$ & O & w & e & \_ & n
\end{tabular}
\end{minipage}
\caption[Example of edit distance]{The matrix in the left shows the computation of the edit distance between Owen and Owain. The construction path, presented in bold, can be seen as substituting the `a' for an `e' and removing the `i' from Owain as to obtain Owen, as depicted on the right.}
\label{fig:edit_distance}
\end{figure}
% % % % % % % % % % % % % % % % % % % % % %
\subsection{Q-gram string matching}
\label{subsec:qgram}
Another class of approximate string similarity functions are based on \emph{$q$-grams}\citep{Ukkonen1992}, sometimes called \emph{$n$-grams}.
A $q$-gram is a sub-sequence of $q$ characters.
Often selected values for $q$ are $q=2$ (called bigrams or digrams) and $q=3$ (called trigrams).
They can be obtained from an input string by moving a sliding window of width $q$ over the string and counting occurrences of each q-gram encountered, effectively computing a multiset of $q$-grams.
The general approach of $q$-gram string matching is to measure similarity of the obtained multisets using the number of $q$-grams that they have in common.
Many of ways exist for comparing multisets, but three are most often used.
\begin{align}
\text{Overlap coefficient: }sim_{overlap}(\sigma_1, \sigma_2) &= \frac{c_{common}}{min(c_1, c2)} \\
\text{Jaccard coefficient: }sim_{jaccard}(\sigma_1, \sigma_2) &= \frac{c_{common}}{c_1 + c_2 - c_{common}} \\
\text{Dice coefficient: }sim_{dice}(\sigma_1, \sigma_2) &= \frac{2 \times c_{common}}{c_1 + c_2}
\end{align}
\noindent A comprehensive overview of multiset distances can be found in \citep{Kosters2008}.
An advantage of $q$-gram based functions is that they are efficiently computable.
Computation involves the extraction of $q$-grams from both strings, which can be computed in time $\mathcal{O}(\vert \sigma_1 \vert + \vert \sigma_2 \vert)$, followed by the computation of the intersection, computed in $\mathcal{O}$.
Compared to the time complexity of $\mathcal{O}(\vert \sigma_1 \vert \times \vert \sigma_2 \vert)$ of the edit distance this an improvement.
A disadvantage is that, due to the nature of a set, information is lost about the ordering of the $q$-grams in the string.
For example, the strings ``abc'' and ``bca'' both include the bigram ``bc'', but in a different place.
While a method based on edit distance would penalize accordingly, a $q$-gram based method will simply disregard this fact.
% % % % % % % % % % % % % % % % % % % % % %
\subsection{Jaro and Jaro-Winkler similarity}
\label{subsec:jaro}
The family of Jaro similarity functions\citep{Jaro1989} combine some of the advantages of both the edit distance and $q$-gram similarity.
It counts the number of characters $c$ that are in common within a window that is of size $\lfloor \frac{\text{max}(\SizeOf{\sigma_1}, \SizeOf{\sigma_2})}{2}\rfloor-1$.
Moreover, the Jaro similarity function also accounts for the number of transpositions $t$, i.e., adjacent characters that are swapped in the two strings.
Put together, the metric is defined as follows:
\begin{equation}
\mathrm{s_{jaro}}(\sigma_1, \sigma_2) =
\begin{dcases}
\frac{1}{3} \left( \frac{c}{\vert \sigma_1 \vert} + \frac{c}{\vert \sigma_2 \vert} + \frac{c-t}{c} \right) & \text{if } m = 0 \\
0 & \text{otherwise}
\end{dcases}
\end{equation}
In the context of record linkage it makes sense to spend time on improving approximate similarity functions on a specific use case, namely for names.
Names regularly appear abbreviated in text, but this does not change the prefix.
The Jaro-Winkler\citep{Winkler1990} similarity function takes this into account by increasing the similarity value based on the number of agreeing characters at the beginning of the two strings.
It is computed as follows:
\begin{equation}
\mathrm{s_{winkler}}(\sigma_1, \sigma_2)=\mathrm{s_{jaro}}(\sigma_1, \sigma_2) + (1-\mathrm{s_{jaro}}(\sigma_1, \sigma_2)\frac{p}{10})
\end{equation}
\noindent with $p\in\{0,\dots,4\}$ being the length of the longest common prefix of at most $4$.
\begin{table}
\centering
\begin{tabular}{?c | ^c | ^c | ^c | ^c | ^c | ^c | ^c |}
\rowstyle{\itshape}
Index & 0 & 1 & 2 & 3 & 4 & 5 & 6 \\\hline
$\sigma_1$ & \textbf{j} & \textbf{o} & n & e & s & & \\
Matches & 0 & 1 & 4 & - & 3 & - & - \\
$\sigma_2$ & \textbf{j} & \textbf{o} & h & s & n & o & n \\\hline
\end{tabular}
\caption[Example matching]{This table shows the matching of individual characters between string ``jones'' and ``johsnon''. The longest sequence of matching starting characters is $2$ and is printed in bold face.}
\label{tab:jaro_winkler}
\end{table}
\cref{tab:jaro_winkler} shows how individual characters in the strings ``jones'' and ``johsnon'' are matched.
We can see that four characters can be matches and there is one transposition, since one character must be swapped to turn the list of indexes into a sorted array.
The Jaro similarity can be computed as follows.
\begin{equation*}
\mathrm{s_{jaro}}(\text{``jones''}, \text{``johsnon''}) = \frac{1}{3} \left(\frac{4}{5} + \frac{4}{7} + \frac{3}{4} \right) = 0.70714
\end{equation*}
\noindent The number of matching starting character is $2$, which is used by the Jaro-Winker distance to increase the similarity as follows:
\begin{align*}
\mathrm{s_{winkler}}(\text{``jones''}, \text{``johsnon''}) &= \mathrm{s_{jaro}}(\text{``jones''}, \text{``johsnon''}) + \\
&+(1-\mathrm{s_{jaro}}(\text{``jones''}, \text{``johsnon''}))\frac{p}{10} \\
&= 0.70714 + (1 - 0.70714)\times\frac{2}{10} \\
&= 0.76571
\end{align*}
% % % % % % % % % % % % % % % % % % % % % %
\subsection{Soundex and Editex phonetic similarity}
\label{subsec:soundex}
Another class of approximate string similarities is the \emph{Soundex}\citep{Soundex} phonetic similarity.
It aims to take into account the phonetics, i.e., the auditory properties of the strings, with emphasis on the \emph{homophones}.
These are words that are pronounced in the same way, but are written differently, such as road/rode/rowed or seas/sees/seize.
Ambiguity caused by homophones in conversation are often resolved because of context.
This does not apply to names, however, which may cause a name to be misspelled.
An illustrative example of a special case of homophones can be found in datasets SC8 \citep{SC8} and C53 \citep{C53}.
In C53 there is a mentioned of ``Walter de Gloucestre'' and in SC8 a mention of ``Walter de Gloucester''.
If these are indeed the same people -- this is unfortunately impossible to know for certain -- then this misspelling was probably caused by a cross-language homophone.
Such differences in spelling can be dealt with by devising a phonetic string comparison function, such as soundex.
A soundex encoding of a string consists of the first character of the string, followed by three digits that encode the remaining consonants.
Consonants that are often pronounced similarly are mapped to the same character.
In total, procedure is as follows:
\begin{enumerate}
\item The starting character of the string is used as the first character of the encoding.
\item Remove all occurrences of `a', `e', `i', `o', `u', `y', `h' and `w'.
\item Substitute consonants after the first character with digits as follows:\\
\begin{align*}
b, f, p, v &\rightarrow 1 \\
c, g, j, k, q, s, x, z &\rightarrow 2 \\
d, t &\rightarrow 3 \\
l &\rightarrow 4 \\
m, n &\rightarrow 5 \\
r &\rightarrow 6
\end{align*}
\end{enumerate}
Additionally, the following rules are applied \citep{Soundex}:
\begin{itemize}
\item \textbf{Names with double letters}: whenever the surname has any double letters, they should be treated as one letter.
\item \textbf{Names with letters side-by-side that have the same soundex code number}: if the surname has different letters side-by-side that have the same number in the soundex coding guide, they should be treated as one letter.
\item \textbf{Names with prefixes}: if a surname has a prefix, such as `van', `con', `de', `di', `la', or `le', code both with and without the prefix because the surname might be listed under either code.
\item \textbf{Consonant separators}: if a vowel separates two consonants that have the same soundex code, the consonant to the right of the vowel is coded.
\end{itemize}
A drawback of the Soundex encoding is that the rules are specifically geared towards a certain language, English, in this case.
To capture the phonetics of another language a different mapping must be developed and tested.
Variants for the Indian language are available \citep{Shah2014}.
Soundex encodings can be used in approximate string matching by checking the encodings of the input strings for equality, resulting in a binary number.
A different approach is to apply the idea of treating phonetically similar values equivalently in way comparable to the edit distance.
The similarity function \emph{Editex} modifies the cost function of the edit distance slightly.
\begin{equation}
c_{editex}(a_i, a_j) =
\begin{dcases}
0 & \text{if } a_i = a_j \\
1 & \text{if } a_i \neq a_j \wedge a_i \equiv_s a_j \\
2 & \text{otherwise}
\end{dcases}
\end{equation}
\noindent where $\equiv_s$ is the relation that two characters belong to the same equivalence class in Soundex, i.e., they map to the same character, such as `b' and `f'.
This approach has proven to be quite successful for matching names\citep{Zobel1996}.
Since it is computed analogous to the edit distance it is also computed in time $\mathcal{O}(\SizeOf{\sigma_1} \times \SizeOf{\sigma_2})$.
% % % % % % % % % % % % % % % % % % % % % %
\section{Candidate pair classification}
\label{sec:classification}
The last step in the pipeline is to classify candidates pairs, i.e., pairs of occurrences within the same block, into \emph{matches} and \emph{non-matches}.
The classifier bases its decision on roughly two criteria.
\begin{figure}
\centering
\begin{subfigure}[b]{\linewidth}
\centering
\begin{tabular}{| l | c | c | c |}
\cline{2-4}
\multicolumn{1}{l|}{ } & \textit{First name} & \textit{Article} & \textit{Last name} \\\hline
$p$ & $0.182$ & $0.917$ & $0.00214$ \\\hline
$o_1$ & John & de & Engelfield \\\hline
\multicolumn{1}{c}{ } & \multicolumn{1}{c}{$0.0 \updownarrow$} & \multicolumn{1}{c}{ } & \multicolumn{1}{c}{$0.13 \updownarrow$} \\\hline
$o_2$ & John & & Englefield \\\hline
$p$ & $0.182$ & & $0.00321$\\\hline
\multicolumn{1}{l|}{ } & \textit{First name} & \textit{Article} & \textit{Last name}\\\cline{2-4}
\end{tabular}
\subcaption{}
\label{fig:classify1}
\end{subfigure}
\par\bigskip
\begin{subfigure}[b]{\linewidth}
\centering
\begin{tabular}{| l | c | c | c |}
\cline{2-4}
\multicolumn{1}{l|}{ } & \textit{First name} & \textit{Article} & \textit{Last name} \\\hline
$p'$ & $0.182$ & & $0.0535$ \\\hline
$E_q$ & $1$ & $0$ & $1$\\\hline
\end{tabular}
\subcaption{}
\label{fig:classify2}
\end{subfigure}
\caption{\subref{fig:classify1} shows the pairwise comparison of two occurrences, $o_1$ and $o_2$, based on three fields. The probability $p$ of each value is given if it is known, and the string distance is shown next to the arrow. The first names match exactly and the last names are quite similar, with a distance of $0.13$. \subref{fig:classify2} shows the aggregated probabilities $p'$ and the equivalence status $E_q$.}
\end{figure}
First, the equivalence of the values of each attribute is established in a pair-wise fashion, using the output of a string distance function.
An example of this is shown in \cref{fig:classify1} where the references ``John de Engelfield'' and ``John Englefield'' are compared.
The first name matches exactly, while the last name is a partial match, with a distance of $0.13$.
By setting a threshold on the distance, an \emph{equivalence function} is obtained that outputs $1$ if the values are considered equivalent and $0$ otherwise, while empty strings are never considered equivalent.
This is shown in \cref{fig:classify2}, where the threshold has been set at $0.2$, resulting in a binary vector of length $3$.
Note that different equivalence functions could be set for each of the individual fields, depending on the type of value, such as string, categorical and ordinal, and the thresholds could be varied.
Secondly, if the values are equivalent, then the prior probabilities of these value occurring is retrieved.
In these best case, these statistics are readily available, for example from a similar dataset or as computed on a census.
In many cases, these statistics are not available and the statistics can be computed from the occurrences in the dataset itself.
This is however sub-optimal, since it introduces bias, based on how many times a person was referenced in the dataset.
If the values were not equal, however, this leaves a choice whether to choose the probability of the value from either the first or second occurrence, or both.
Since the equivalence function determined that the values are equivalent, we decided to therefore sum the two probabilities, effectively treating the two values as if they were equivalent.
In the same example it can be seen that the probabilities of ``Engelfield'' and ``Englefield'' have been summed to a value of $0.0535$.
To compute an overall confidence score, the probabilities are aggregated into a single number.
One way of doing this is by assuming that the attributes are independent and multiplying the probabilities in order to obtain the posterior probability given the attribute values.
When multiplied with a population size estimate $\alpha$, the estimate number of people having the described properties can be found.
\begin{equation}
c_{\pi}(P) = \alpha \prod_{i=0}^{m} p_i
\end{equation}
\noindent with $m$ the number of fields considered.
Even though the population size of certain areas might be retrievable from other historical sources, it can be assumed that the number of people contained within a document $\alpha$ is unknown, since determining this is one of the goals of entity disambiguation.
For the purposes of entity disambiguation it is not relevant, however, as it is simply a scaling factor, i.e., if candidate pairs are ranked based on their score, omitting $\alpha$ does not influence the ranking.
Instead of multiplying probabilities, the sum of logarithms can also be used.
\begin{equation}
c_{\mathrm{log}}(P) = -\sum_{i=0}^{m} \log{p_i}
\label{eq:confidence1}
\end{equation}
\noindent While the $\alpha$ parameter makes the $c_{\pi}$ score easier to interpret, here it is of no use and simply omitted.
The $c_{\mathrm{log}}$ score has the advantage that it is more efficient to compute and has a better numerical stability.
Multiplying a large amount of small numbers can lead to underflow errors, which is resolved by instead summing the logarithms.
Another advantage is of more practical nature: they are much easier to display in tables and plots, which is why all confidence scores will be presented as $c_{\mathrm{log}}$ scores in this thesis.
\begin{algorithm}[tb]
\caption{Classifies candidate pair $(a_i, a_j$) as a match or non-match}
\label{alg:record_linker}
\input{algorithms/record_linker.tex}
\end{algorithm}
To classify candidate pairs, a threshold value is set on the confidence score.
All candidate pairs that have a confidence score greater than or equal to the threshold are classified as matches, and all others as non-matches.
The complete algorithm is shown in \cref{alg:record_linker}.