4_experiments.tex

\begin{savequote}[75mm] 
That which can be asserted without evidence, can be dismissed without evidence.
\qauthor{Christopher Hitchens}
\end{savequote}


% % % % % % % % % % % % % % % % % % % % % %


\chapter{Experimental Evaluation}
\label{ch:experiments}

\chaplettrine{S}{ince} the aim of this project is to aid historians in their research, we found it important to evaluate the described techniques on a dataset that is exemplary of historical research in which entity disambiguation plays an important role.
To this extent, our methods were applied to a dataset that was made available to us by The National Archives.
In this chapter, the followed procedure is described and the results are compared to the ones that were manually obtained by historians.
With the experiments, we try to point out the advantages and difficulties of the system and provide a basis for future research.

As has been pointed out in \cref{sec:challenges}, it is impossible to be certain about the results because of a lack of a ``golden standard'', i.e., a dataset that can be assumed to be absolutely correct.
The experiments must be seen as a best effort approach to assess the validity of the system.
Errors have been discovered both on the part of the record linkage system and the data as provided by the historians, some of which will be investigated in this chapter.


% % % % % % % % % % % % % % % % % % % % % %


\section{Overview of the dataset}
\label{sec:dataset}

The dataset that has been used for validation of the system is that of \emph{The Gascon Rolls project} \citep{GasconRolls}.
The Gascon Rolls are records that were drawn up by the English royal administration of Aquitaine-Gascony (south-western France) between 1273 and 1468 and contain grants of land, oaths of treaties and other important documents.
Until 1453, the region of Aquitaine was under English rule and the rolls served as a means of communicating the situation to the government in England.
This tradition continued until 1468 even though the French annexed it at the end of the Hundred Year's War in 1453.
The rolls are considered to be of high historical importance because the detailed
descriptions of events prove an invaluable basis for the biographies of people mentioned.
People have a central role, since most of the text concerns agreements, making it an ideal candidate for record linkage.

\begin{listing}[tbp]  % minted's listing (float) environment overrides the default placement options
    \inputminted[
        breaklines,
        bgcolor=codebg,
        tabsize=2,
        fontsize=\small
    ]{xml}{xml/gascon_example.xml}
    \caption{An extract from the Gascon Rolls XML source data.}
    \label{lst:gascon_xml}
\end{listing}

In 2009, a project began to produce an on-line calendar, i.e., a descriptive summary of an original archive document in which all elements that can be of historical importance are recorded, of the rolls.
The project aims to provide an easy to access substitute of the rolls, opening up research to anyone who might be interested.
The data thus consists of a translated and summarized extract of the original parchment rolls in digital XML format.
Historians have thoroughly annotated this data to supply additional information and correct errors, while also providing most of the original text.
The documents were mostly written in Latin, with only a few parts in French, and have been translated to modern English.

An example of this is given in \cref{lst:gascon_xml}, where an extract of the original data in XML format can be seen.
The text concerns a grant of the office, that was formerly held by a Master Bernat de Rions, who had deceased, to Master Per-Arnaut de Taller.
The text has been split into three sections using separate tags: \texttt{head}, \texttt{body} and \texttt{closer}.
The header usually includes the the recipient, while the closer contains the phrase ``By K.'', which originally stated ``per ipsum regem'', indicating the letter was sent under the authority of the reigning king.
Within the \texttt{opener} tag, the date and location from which the letter was signed are usually included.
The example also shows most of the different styles of annotations that were provided by the historians.
Person names are enclosed within a \texttt{name} tag of type \texttt{person}, while $53$ other types exist for names of things such as places, castles and titles.
Each of these is also accompanied with a key that uniquely identifies them throughout the dataset.
Other tags that might be of use to a human reader are: \texttt{orig}, that displays the original spelling of a word; \texttt{supplied}, that gives more information than was originally present; and \texttt{note}, which gives additional information about something present on the original document that might be of interest to the reader, such as a note in the margin or a reference to another text.

Even though the annotations provided in the data are invaluable for the evaluation of the record linker, they were not used in the linkage itself.
Since the goal is to develop a system that can automatically provide these tags, we have proceeded as if no information were present except for the plain text.
This ensures the tools developed can be applied to a wide range of datasets.

\begin{table}
    \centering
    \caption{Overview of the size of the Gascon Rolls dataset.}
    \label{tab:data_overview}
    \begin{tabular}{l r}
        \toprule
        & Size \\
        \midrule
        Number of rolls & $66$ \\
        Number of membranes & $943$ \\
        Number of sections & $1153$ \\
        Number of occurrences & $30426$ \\
        Sum of file sizes & $27$ MB \\
        \bottomrule
    \end{tabular}
\end{table}

Not all $112$ existing Gascon rolls are yet available in XML format, but $66$ of them are.
Out of the $1053$ membranes, $943$ are made available, although some of them appear to be unfinished.
To give an impression of the size of the dataset, some of its characteristics are displayed in \cref{tab:data_overview}.
The elements were counted using the provided annotations, e.g., the number of occurrences were computed by counting the XML \texttt{name} elements of type \texttt{person}.
The size of the data is quite modest compared to typical record linkage tasks that can often involve millions of records.
However, there are over $30$ thousand occurrences that were not only identified in text, but also disambiguated, i.e., each of them has a unique identifier.
This makes the Gascon rolls dataset an invaluable resource for the task at hand.


% % % % % % % % % % % % % % % % % % % % % %


\section{Occurrence extraction}
\label{sec:occurrence_extraction}

The record linkage system works on a list of person references, while the source data is in XML format.
The first step, therefore, is to remove the provided XML tags and parse the data using the methods described in \cref{sec:segmentation}.
The identifiers that were contained within the original tags were stored by their starting and ending position in text, so that they could be copied to the corresponding occurrences after segmentation.
Only occurrences in paragraphs enclosed in \texttt{ab} tags were considered, to make sure the calendar itself was parsed, and not the additional text provided by researchers, such as introductory text.
The amount of annotated occurrences contained herein is \num{25836}.
Using a compact grammar, we were able to find \num{22206} occurrences.
Note that this is not necessarily a subset of the larger set, as not all names were annotated and the grammar is likely to return some errors. 
In total, we were unable to match \num{1684} occurrences found by the parser with annotated occurrences, so these were removed as to obtain \num{20522} occurrences, each with a unique identifier.

\begin{table}
    \footnotesize
    \centering
    \caption{Overview of the occurrence fields as extracted from the source text.}
    \label{tab:occurrence_overview}
    \begin{tabularx}{\textwidth}{l X r}
        \toprule
        \textbf{Field} & \textbf{Description} & \textbf{Example} \\
        \midrule
        id & As provided in the source document. & $5603$ \\
        forename & & John \\
        article & Word(s) between the forename and surname. & de la \\
        surname & & Somerby \\
        provenance & Origin of a person. & England \\
        title & Used to address a person. & Master \\
        role & Ascribed to person. & esquire \\
        regnal\_number & Used to distinguish monarchs. & \emph{III} \\
%       father & & \\
        fileId & Filename & C61\_33.xml \\
        sectionId & Section identifier. & it082\_43\_07f\_074-2 \\
        pos & Start position in section. & $104$ \\
        endpos & End position in section. & $120$ \\
        words & Words and their frequencies in section. & \\
        orig & Original text. & John of France, knight \\
        \bottomrule
    \end{tabularx}
\end{table}

In \cref{tab:occurrence_overview}, an overview of the occurrence fields can be found.
Some of the fields are the result of applying the grammar to the plain text, while other fields contain metadata such as the name of the file in which the occurrence was found.
The \texttt{words} field contains a set of words that appear in the section of the text in which the occurrence appears, along with their frequencies.
These are used by the record linker to determine whether an element of a miki was mentioned in the same section as the occurrence.


% % % % % % % % % % % % % % % % % % % % % %


\section{Miki extraction}
\label{sec:miki_evaluation}

In \cref{sec:miki} it was discussed how the inclusion of contextual information might improve the performance of the record linker.
Assuming that a person appears within a similar context, the goal is to determine the ``topics'' discussed therein using miki's.
The first question that arises is what the context of the occurrence is.
The level was chosen to be at that of the sections, because topics are very consistent throughout them and can naturally be regarded as the context.
One alternative is the sentence level, but this would greatly reduce the probability of an item of a MIKI occurring in the same sentence as an occurrence.
Another option is to capture information on a membrane or roll level, but since most of the information contained therein is unrelated, it unlikely to be of relevance to an occurrence.

To compute a MIKI for the Gascon rolls corpus, the sections are first \emph{tokenized} by splitting sentences on the whitespace character and converting all words to lowercase.
The latter step ensures no distinction is made between capitalized and non-capitalized tokens.
These lists of tokens are then converted to multisets by computing the frequency of each token, and stored in the \texttt{words} field of each occurrence that occurs in the same section, for reference.

The time required to compute a MIKI can be reduced by discarding uninformative tokens beforehand.
Three strategies were used:

\begin{enumerate}
    \item \emph{Infrequent tokens} are of low entropy, hence they are unlikely to contribute much to the joint entropy of a MIKI. Therefore, any tokens that appear less than $5$ times are discarded.
    \item Conversely, \emph{stop words}, such as ``the'' and ``and'', are highly frequent tokens that are of low entropy. They are unrelated to the topics discussed in text and are very likely to appear, thus not contributing much to the joint entropy of an itemset. All tokens appearing in a fixed list of $319$ stop words were ignored.
    \item Tokens that are part of \emph{occurrences}, such as first names, were ignored. The aim is to extract tokens that represent a certain topic and relate them to occurrences. The constituent parts of an occurrence directly relate to the person, instead of capturing contextual information, and are used in the field-based comparisons described in \cref{sec:field_comparisons}.
\end{enumerate}

An index is constructed by scanning over the entire corpus once and taking note of the token frequencies.
Infrequent tokens were then removed using the previously mentioned condition and stop words were simply ignored, resulting in an index consisting of \num{3221} entries.
The multisets of tokens were then converted into regular sets by including only tokens that are within the index. 
Following the representation used in \cref{sec:miki}, the bags were transformed into a matrix of size $\num{1153} \times \num{3221}$ on which the MIKIs were computed using the \emph{Forward Selection} algorithm \citep{Knobbe2006}.

\begin{table}
    \centering
    \input{tables/miki_words_gascon.tex}
    \caption{The words of the 20-MIKI found in the Gascon Rolls dataset are shown in the order that they were found in, accompanied with their cumulative entropy and document frequency.}
    \label{tab:miki_words_gascon}
\end{table}

In \cref{tab:miki_words_gascon} the resulting MIKI of size $20$ is shown, along with the cumulative joint entropy of the set of size $k$ and the probability of a token appearing in a section, comparable to the inverse document frequency.
The first item found is the token ``king'', which naturally appears frequently in texts, because many of the texts concern oaths to or grants by the king.
However, since the presence and absence of an item are treated equally in the computation of the entropy, the token would be of low entropy if were to occur to frequently.
In the same table, it can be seen that it occurs in \SI{36}{\percent} of the sections, resulting in a decent entropy of $0.94$.
The second item is the token ``letters'', which appears in texts that contain the phrase ``letters of general attorney'', which indicates that the people mentioned thereafter were joining the principality of the king.
This fact could be used to tell something about the status of a person, though it is unlikely that an occurrence appears twice in contexts containing this phrase.

Thus, a problem with the usage of miki's becomes apparent: although it makes sense to model the topics in a section using miki's, their usefulness is far from guaranteed.
The computation of the miki and the record linkage procedure are treated as two separate processes, so in the current setting it is impossible to communicate findings of one process to the other.
If a person is likely to be mentioned in relation to a certain topic only once, that topic is not useful for entity disambiguation and should not be considered.
However, this implies that the match status of some of the candidate pairs is known, which is not the case in the current set-up, suggesting that an iterative approach might be useful.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth,height=\gratio\textwidth]{plots/miki_gascon.tikz}
    \caption{The cumulative entropy of the MIKI obtained from the Gascon Rolls dataset has a weak slope, indicating that the items contained within them are less orthogonal than would be ideal for the purpose of context modelling.}
    \label{fig:gascon_convergence}
\end{figure}

In \cref{fig:gascon_convergence} the cumulative entropy is shown as it increases with size $k$ of the miki.
It is compared to the cumulative entropy of the miki of a subset of pages of Wikipedia.
A similar approach to the construction of the Gascon rolls dataset was taken to construct it, though the sections in this case consist of entire pages.
The subset was created by starting at the page of Albert Einstein and following links on pages in a breadth-first fashion until a set of \num{1000} pages was collected.
From the figure, it can be seen that the convergence behavior is quite different from that of the Gascon Rolls miki.
The slope of the latter is weaker, indicating that the items found are less orthogonal, or, similarly, there is higher \emph{mutual information} between the items.
Shown as a black dashes line is the theoretical optimal slope that occurs when the items of the MIKI are completely orthogonal (mutual information is $0$).
It is hard to determine whether this is caused by the contents of the dataset or the difference in the definition of a section that was chosen.

The length of Wikipedia pages is usually a lot longer than the length of the section in the Gascon Rolls (respectively \num{2511} vs. \num{60} average words per page) and the vocabulary of words is much larger (respectively \num{52320} vs. \num{9411}).
A shorter section length decreases the probability of a certain word occurring in that section.
Infrequent words will therefore appear in fewer texts, reducing their contribution to the joint entropy of an itemsets.
This is the reason that words that appear in only one text are removed from the vocabulary before the computation of the MIKI.
We can see that the section length determines the size of the \emph{effective vocabulary}, i.e., the vocabulary after removal of frequent (stop) words and infrequent words.
The vocabulary of the Gascon Rolls is thus further reduced, making it less likely that combinations of uncorrelated words are found, which would result in a MIKI with a high joint entropy.

Luckily, however, the MIKI is computed as part of an unsupervised learning strategy and its influence can be controlled by limiting its size.
The cumulative joint entropy, as shown in \cref{fig:gascon_convergence}, can be kept track of by storing the joint entropy of each subsequently larger MIKI and can aid in setting this parameter.
While for the Gascon Rolls we might want to limit the size of the MIKI to perhaps $4$, the joint entropy of the MIKI found for the Wikipedia dataset can be deemed sufficiently large at $8$.


% % % % % % % % % % % % % % % % % % % % % %


\section{Linking entities}
\label{sec:linking_entities}

The record linker uses the data as described in the previous two sections to classify presented candidate pairs.
We will denote with $P$ the set of ``truly matching pairs'' (or positives) and with $N$ the set of ``truly non-matching pairs'' (negatives), according to the annotations.
Note that even though the used dataset was created with a lot of effort, it can never be regarded as a ``ground truth'' since the entities were linked with only limited information and will certainly contain errors.
Since the dataset was parsed and occurrences were segmented, more errors have likely been introduced, so the results of evaluation using the dataset can not be considered as truly reflecting the performance of the linker.
The dataset provides a realistic use case, however, and findings will certainly help in understanding better the problems faced and provide the information needed to improve the linker.

The simple attributes, such as first names, are processed as described in \cref{sec:classification}: if two attributes in the two records are considered equivalent, the logarithm of the probabilities are summed to obtain a confidence score.
The contextual information residing in the \texttt{words} attribute is mapped to $k$ additional binary fields, one for each item in the MIKI.
These attributes are $1$ whenever the corresponding word is contained in the \texttt{words} attribute and $0$ otherwise.
This allows these attributes to be treated in a similar fashion to the other attributes, as described in \cref{sec:miki_utilization}.
Using the logical and function as a similarity function, only the probabilities of the words that appear in both references are retrieved and aggregated using the $c_{\mathrm{log}}$ function to obtain a confidence score.

However, in our experiments we noticed that the use of a MIKI did not prove useful to increase the score of candidate pairs with very low confidence scores.
It might happen, for example, that a candidate record pair contains only the first name ``John'', which occurs frequently and thus results in a relatively low confidence score.
When a MIKI is incorporated in the confidence score computation in these cases, the score is boosted too much, resulting in a lot of false positives.
Therefore we use the MIKI as a \emph{booster}: it is only addressed when the confidence score up until that point is greater than the threshold.
This focuses the application of the MIKI on the more promising pairs, i.e., those with a higher probability of being a truly matching pair.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth,height=\gratio\textwidth]{plots/kde.tikz}
    \caption{A kernel density plot was created using pairs having score $\geq 0.5$. The y-axis has been normalized, such that the maximum is $1$. The plot shows that there is a major overlap between the scores of truly positive and truly negative candidate pairs, suggesting that the linker is unable to clearly distinguish the two classes.}
    \label{fig:kde}
\end{figure}

The aim of the linker is to give elements of the set of truly matching pairs $P$ a relatively higher score than the set non-matching pairs $N$.
\cref{fig:kde} shows the distribution (computed by a kernel density estimation (KDE) \citep{Rosenblatt1956}) based on the scores of truly matching and truly non-matching candidate pairs using the attributes \texttt{title}, \texttt{forename}, \texttt{article}, \texttt{surname}, \texttt{role} and no MIKI.
Only candidate pairs with a score higher than $0.5$ have been taken into account; the other pairs are most likely negatives.
Note that the two curves have been normalized and are plotted on a different scale, allowing us to plot both curves in one figure, even though there are many more negatives than positives.
It can be seen that there is a considerable area related to the negatives, that is overlapping with that of the positives.
The plot suggests that the threshold should be set at a confidence score of approximately $2$ to avoid capturing a lot of negatives, but only a small fraction of the positives can be retrieved this way.
Lowering the threshold, however, will quickly reduce the precision as more and more negatives are retrieved, which might not be acceptable in many cases.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth,height=\gratio\textwidth]{plots/roc.tikz}
    \caption{The ROC curve shows that the trade-off is able to retrieve a large number of truly matching pairs if compared to the number of truly negative pairs that is retrieved. However, this gives an overly optimistic view on the performance of the linker, since the number of truly negative pairs is almost three orders of magnitude larger.}
    \label{fig:roc}
\end{figure}

\cref{fig:roc} shows the receiver operating characteristic (ROC) curve of the system using various thresholds at which MIKIs are used to boost the score, e.g., a threshold of $0.0$ always uses MIKIs, while a threshold of $3.0$ starts taking information from the MIKI into account if the confidence score without use of MIKIs is at least $3.0$.
It appears as if the trade-off between the true positive rate and the false positive rate is not affected by the use of a MIKI, since the ROC curves are identical.
The area under the ROC curve is quite large, meaning a large fraction of the truly matching pairs are found while retrieving only a small fraction of the truly non-matching pairs.
However, as explained in \cref{sec:problem_definition}, since the false positive rate is based on the number of truly non-matching pairs, it is not a very informative number, since $P \lll N$.
In the same figure, it can be seen that there is a very steep slope in which most of the truly matching pairs are retrieved.
The long diagonal moving to the upper left corner is associated with candidates pairs that have a low confidence score, of which many are truly non-matching pairs.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth,height=\gratio\textwidth]{plots/precision_recall_curve.tikz}
   \caption{The precision-recall curve shows that only a small portion of the truly postive pairs can be retrieved with reasonable precision. The steep decline suggests that the scores of truly matching and non-matching pairs is very similar, which prohibits the retrieval of truly matching pairs without also retrieving a large number of truly non-matching pairs, explaining the low precision.}
    \label{fig:precision_recall}
\end{figure}

A more informative plot is perhaps the precision-recall curve shown in \cref{fig:precision_recall}.
It shows a trade-off between the number of truly matching pairs that are matched by the linker (recall) and the fraction of the pairs retrieved that are indeed truly matching pairs (precision).
A linker that has a maximum recall of $1$ can easily be devised by classifying all candidate pairs as matches, but this results in a very low precision.
The area under the precision-recall curve can be used as a measure of the system's ability to optimize both these aspects.
About $30\%$ of the truly matching pairs can be found without too much difficulty, indicating that these pairs have a relatively high score compared to the truly non-matching pairs, making them easy to distinguish by setting the threshold to a high confidence score, as could also be seen in the KDE plot.
For the remaining part, the linker has more problems, suggesting that truly matching and non-matching pairs start to get more similar scores at that point.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth,height=\gratio\textwidth]{plots/f1.tikz}
    \caption{The F1-measure is optimal at a confidence value of approximately $4$ with a value of $0.35$. Varying the threshold value at which the MIKI is taken into does not affect the curve much, although the curve is wider, indicating that F1-measure score is more stable.}
    \label{fig:f1}
\end{figure}

Using the F1-measure of \cref{def:fmeasure} with $\beta = 1$, we aggregate the precision and recall in a single metric in which both aspects are weighted equally.
The latter is a rather arbitrary decision and any weighting could be achieved by varying $\beta$.
The resulting F1-measure is plotted against the various thresholds settings in \cref{fig:f1}.
A clear peak can be seen around a confidence score of $4$ for all different settings of the threshold at which point the MIKI is taken into account.
When related to the precision-recall curve of \cref{fig:precision_recall}, we know that at this threshold, most of the negatives are not retrieved and the precision is high, but the recall is low, explaining the low F1 value of only $0.35$ at the peak.