-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata.tex
246 lines (205 loc) · 17.9 KB
/
data.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
\chapter{Chapter 3\newline Introduction to the data}\label{chap:data}
\newthought{The results that we present} later in this thesis are the outcome of many experiments that we ran, on different data sets, with different settings. We used multiple data sets to investigate the effect of performance in cross-domain settings, and did this for two languages: English and Dutch, in particular the Dutch that is being spoken in Flanders.\footnote{Since the experiments on improving automatic speech recognition are carried out at the KU Leuven, we use their acoustic models. These acoustic models are trained on the Flemish-Dutch part of the Dutch spoken corpus (introduced later in this section), and are part of an automatic speech recogniser for Flemish-Dutch speech (see~\cref{ch:speech}).}
In this section we introduce these data sets, and throughout the thesis we will refer back to this section for the details of the data sets.
\section{English data}
For the experiments on English we use four corpora: one large generic mixed-domain news corpus and two smaller domain-specific corpora. %Although the two generic corpora comprise encyclopedic and news texts, we consider them less domain-specific.\footnote{waarom vinden we dit?}
\subsection{News commentary: One billion word benchmark (\obw)}
The first generic corpus is the Google 1 billion word benchmark for measuring progress in statistical language modelling (henceforth \obw). It comprises a shuffled web corpus of 769 million tokens.\autocite{chelba2013one} For training we use sets 1 through 100, out of the 101 available training sets; for testing we use all available 50 test sets (8 million tokens).
The corpus is based on the corpora provided for the sixth workshop on machine translation (WMT11). Chelba and colleagues only used the English part of the News Crawl data.\footnote{reference} For postprocessing they removed duplicate sentences, sentences were shuffled into a random order\footnote{Whilst maintaining the order within the sentences}, and a word type threshold of 3 was used. The word types that did not make the threshold were mapped to an UNK-token.
This resulted in a corpus of 800 million tokens with sentences markers\footnote{1 begin of sentence and 1 end of sentence marker}, and around 793 thousand word types.
\subsection{Encyclopaedic articles: Wikipedia corpus (\wp)}
The second generic corpus, is a Wikipedia snapshot (henceforth \wp) of November 2013 as used and provided by \textcite{pickhardt2014generalized}.
The data consists of Wikipedia articles which are not postprocessed, except for cleaning of the markup, and tokenisation\footnote{``We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks.''}.
However, we find that some of the tokenisation is too rigorous, as this has introduced errors such as separating abbreviations:
\begin{quote}The vacuum state is defined as the state with no particle or antiparticle , i . e . , and . </s>\footnote{This example also lacks the formulae before and after `and': ``$a_k |0\rangle=0$ and $b_k |0\rangle=0$''.}\end{quote}
\begin{quote}All of Solzhenitsyn's sons became U . S . citizens . </s>\footnote{Especially when the final dot of U . S . is also the sentence final punctuation, this is misleading.}\end{quote}
Other frequently occurring errors are runons such as
\begin{quote}Stephen C Kleene defined as his nowfamousThesis Iknown as the ChurchTuring thesis . </s>\end{quote}
where dashes and quotation marks have been stripped, as well as whitespace; and boxed content is not always separated from the main text, which causes a lot of tables and other marked-up context to be in the data.
In total the data comprises 1.4 billion words for training and 367 million words for testing.
\subsection{Legislative texts: JRC-Acquis corpus (\jrc)}
The first domain-specific corpus is from JRC-Acquis v3.0\autocite{steinberger2006jrc}, which contains legislative text of the European Union. It is part of the joint research centre collection of the Acquis Communautaire\footnote{The Acquis Communautaire (AC) is the total body of European Union (EU) law applicable in the the EU Member States.} in 22 of the official languages of the European Union. The collection comprises selected texts written between the 1950s and 2007.
We only use the English part of this parallel-aligned multilingual corpus. We shuffle the sentences and create a training (31 million words, 288 thousand types) and test split (8 million words).
\subsection{Biomedical texts: European Medicines Agency corpus (\emea)}
The second domain-specific corpus consists of converted\footnote{The composers of the corpus note that there are known problems with tables and multi-column layouts, as the documents are converted with \texttt{pdftotext}.} PDF documents from the European Medicines Agency, EMEA\autocite{tiedemann2009news}. Similar to \jrc, \emea is the English part of the multilingual aligned parallel corpus.
We shuffled all sentences, and selected 20\% of them as the test set (3 million tokens), and the remaining 80\% was reserved for training (12 million tokens, 61 thousand types).
\subsection{Data handling}
Since the hierarchical Pitman-Yor process language model (HPYPLM, see \cref{chap:introblm}) uses a substantial amount of memory, even with histogram-based sampling,\footnote[][-2em]{The seating and table configuration of the Chinese restaurant process is not explicitely tracked, but a more compact histogram notation is used. \cite{blunsom2009note}} but we cannot model the complete 1bw data set without thresholding the patterns in the model.\footnote[][]{\label{fn:shareghi}This area is still under investigation, as this topic of memory expensiveness is addressed in recent papers such as: \autocite{shareghi2017compressed}} We used a high occurrence threshold of 100 on the unigrams,\footnote{Compared the original threshold of 3 used in the corpus.} yielding 99,553 types that occur above this threshold. We use all $n$-grams and skipgrams that occurred at least twice, consisting of the included unigrams as focus words, with UNKs\footnote{We map their UNK type to our \emph{\{?\}}.} occupying the positions of words not in the vocabulary. Note that because these settings are different from models competing on this benchmark, the results in this thesis cannot be compared to those results.\footnote{Recent work such as mentioned in \cref{fn:shareghi} do not report on the number of types, making it hard to compare our memory usage to theirs.}
% hit rates:
% determine: colibri-patternmodeller -i /vol/tensusers/lonrust/slm/models/1bw_4S_W100_t2_T2_s10_p0_v2_train.patternmodel -f /vol/tensusers/lonrust/slm/models/1bw_4S_W100_t2_T2_s10_p0_v2_train.dat -c /vol/tensusers/lonrust/slm/models/1bw_4S_W100_t2_T2_s10_p0_v2_train.cls -l 4 -m 4 -u -P | cut -f 1 > /scratch/lonrust/train-1bw.4gram
% and colibri-extract
%
%rate: for train in emea jrc 1bw; do for test in emea jrc 1bw; do for i in 2 3 4; do TOTAL=$(wc -l ./test-${test}.${i}gram | cut -d' ' -f1); PART=$(fgrep -o -x -f ./test-${test}.${i}gram ./train-${train}.${i}gram | wc -l); echo $train $test $TOTAL $PART; done; done; done
%
% see: https://docs.google.com/spreadsheets/d/1425o5GMjbUaCoK_XKvpinRzgkT8YGcqU4ucHzJJVghE/edit#gid=462900132
% NO SEE: https://docs.google.com/spreadsheets/d/1-dAbMnq1vHzRA4i-d_4gj2vTvFjEEK-KO7Y8Y8k54uA/edit#gid=0
% AND /vol/tensusers/lonrust/overlapcount
\begin{table}%
\centering
\subfloat[Unigrams]{{\begin{tabular}{llll}
& \obw & \emea & \jrc \\
\obw & 1.5 & 9.9 & 5.9 \\
\emea & 18.3 & 0.1 & 10.1 \\
\jrc & 12.3 & 14.9 & 0.4 \\
\end{tabular}\label{tab:ex1ample1} }}%
\qquad
\subfloat[Bigrams]{{\begin{tabular}{lll}
\obw & \emea & \jrc \\
98.5 & 92.5 & 93.5 \\
46.2 & 98.3 & 59.9 \\
59.7 & 60.5 & 95.5 \\
\end{tabular}\label{tab:ex1ample2} }}%
\qquad
\subfloat[Trigrams]{{\begin{tabular}{lllll}
\obw & 85.6 & 69.8 & 66.0 & \\
\emea & 17.0 & 94.8 & 26.6 & \\
\jrc & 26.8 & 32.1 & 88.3 & \\
\end{tabular}\label{tab:ex1ample3} }}%
\qquad
\subfloat[4-grams]{{\begin{tabular}{llll}
62.5 & 42.2 & 33.3 & \\
5.8 & 91.2 & 10.0 & \\
9.2 & 14.1 & 78.3 & \\
\end{tabular}\label{tab:ex1ample4} }}%
\caption{Out-of-vocabulary rates (in \%) for the unigrams; hit rates (in \%) for bi-, tri- and 4-grams. A test pattern 'hits' when it is also in the training data. The training corpora on the rows, and test sets in the columns.}%
\label{tab:ex1ample}%
\setfloatalignment{b}
\end{table}
\section{Analysis of cross-corpora OOV and hit rates}
Most corpora comprise a coherent set of texts that share certain properties. The composition of the corpus and properties of the texts may be selected or sampled because of a specific research question concerning a phenomenon, such as genre-specific fixed phrases, or diachronic change of word meanings. With the inherent differences in usage for the corpora, the most obvious differences are the words.
\begin{figure*}
% \centering
\subfloat[\obw]{%\hspace*{-4em}
\begin{scaletikzpicturetowidth}{0.3*\textwidth}
\tikzsetnextfilename{wf1}
\begin{tikzpicture}
\begin{axis}[
xmode=log,
ymode=log,
title=Comparable word frequency,
xlabel=rank,
ylabel=frequency,
width=5cm, %0.5\textwidth,
legend cell align=left,
legend pos=outer north east,
legend style={draw=none}]
\addplot +[mark=none] table [y=tr1val, x=tr1pos]{data/tr1.dat};
\addplot +[mark=none] table [y=te1val, x=tr1pos]{data/tr1.dat};
%\addlegendentry{1-grams}
\addplot +[mark=none] table [y=tejval, x=tr1pos]{data/tr1.dat};
%\addlegendentry{2-grams}
\addplot +[mark=none] table [y=teeval, x=tr1pos]{data/tr1.dat};
\end{axis}
\end{tikzpicture}
\end{scaletikzpicturetowidth}}%\qquad
\subfloat[\jrc]{ \begin{scaletikzpicturetowidth}{0.3*\textwidth}
\tikzsetnextfilename{wf2}
\begin{tikzpicture}
\begin{axis}[
xmode=log,
ymode=log,
%title=lol,
xlabel=rank,
% ylabel=types,
width=5cm,%0.5\textwidth,
legend cell align=left,
legend pos=outer north east,
legend style={draw=none}]
\addplot +[mark=none]table [y=trjval, x=trjpos]{data/trj.dat};
\addplot +[mark=none]table [y=te1val, x=trjpos]{data/trj.dat};
%\addlegendentry{1-grams}
\addplot +[mark=none]table [y=tejval, x=trjpos]{data/trj.dat};
%\addlegendentry{2-grams}
\addplot +[mark=none]table [y=teeval, x=trjpos]{data/trj.dat};
\end{axis}
\end{tikzpicture}
\end{scaletikzpicturetowidth}}%\qquad
\subfloat[\emea]{\begin{scaletikzpicturetowidth}{0.3*\textwidth}
\tikzsetnextfilename{wf3}
\begin{tikzpicture}
\begin{axis}[
xmode=log,
ymode=log,
%title=lol,
xlabel=rank,
% ylabel=types,
width=5cm, %0.5\textwidth,
legend cell align=left,
legend pos=outer north east,
legend style={draw=none}]
\addplot +[mark=none]table [y=treval, x=trepos]{data/tre.dat};
\addplot +[mark=none]table [y=te1val, x=trepos]{data/tre.dat};
%\addlegendentry{1-grams}
\addplot +[mark=none]table [y=tejval, x=trepos]{data/tre.dat};
%\addlegendentry{2-grams}
\addplot +[mark=none]table [y=teeval, x=trepos]{data/tre.dat};
\end{axis}
\end{tikzpicture}
\end{scaletikzpicturetowidth}}
\caption[][2em]{The plots show the difference in word use, by means of their word ranks and frequency. The blue line represents the training data, the red line the test data. These lines are comparable, modulo the difference in size. The other two lines are the other two corpora. Not only do they differ in size, but compared to the training corpus they do not follow it's Zipf's law.}
\end{figure*}
The problem with out-of-vocabulary words is not only that the specific probabilities per OOV word\footnote{Unlike all the other words that are in the vocabulary.} cannot be learned, but also that you have to resort to other means such as a specific OOV probability, or using a hierarchical structure with a character language model.
The second problem is that perplexity is the default evaluation criterion for language models, and OOV words are not considered when computing perplexity. This has many implications, especially since the evaluation can hardly be used for downstream tasks where OOV words are a real issue.
In \cref{tab:ex1ample1} we list the OOV rates per train corpus, and for each test corpus. For each word in the test data, we checked whether it occurred in the train data.\footnote{After thresholding in the cases of \obw.}
Here we see that on the diagonal the OOV rates are the lowest, this being the within-corpus cases. If we stray from the diagonal, the OOV are higher, up to 22.2\%. This means that more than 1 out of 5 words in \wp does not occur in \emea. Although this is not necessarily a problem\footnote{Since \emea is a small corpus with 3 million words, and no one would use it as a general-purpose corpus.}, we have to take this into consideration when we look at the perplexity scores later in this thesis.
For \obw the cross-corpus OOV rates are lower, which strengthens our idea that it is a general-domain text corpus.
Similarly, it is also interesting to look at the hit rates for the higher order $n$-grams. In \cref{tab:ex1ample2,tab:ex1ample3,tab:ex1ample4} we listed the rate of patterns that are also present in the training data. The effects are generally much stronger than just the OOV rate. For example, in \cref{tab:ex1ample3}, we see that about 86\% of the trigrams in the test data of \obw literally also occur in the training data of \obw.
% For the other test corpora this number is obviously much lower.
In the same table it also shows that the coverage with \emea as training corpus is much lower compared than \obw.
\section{Dutch data}
\subsection{Newspaper texts: Mediargus corpus (\mediargus)}
For the experiments on Flemish-Dutch data, we use the Mediargus corpus as training material. It contains 5 years of newspaper texts from 12 Flemish newspapers and magazines, totaling 1.4 billion words over 5 million word types.
Since there is no official paper introducing the Mediargus corpus,\footnote{ESAT, the department of Electrical Engineering at the KU Leuven obtained the texts from Mediargus, the supplier for Southern Dutch news paper texts for language model training. The corpus is also not publicly available.} we briefly introduce it here.
\begin{table}
\begin{tabular}{ll}
Source & \thead{Tokens \\ ($\times 10^6$)} \\ \hline
De Morgen & 135 \\
De Standaard & 118 \\
De Tijd & 98 \\
Gazet van Antwerpen & 240 \\
Het Belang van Limburg & 106 \\
Het Laatste Nieuws & 284 \\
Het Nieuwsblad & 322 \\
Het Volk & 133
\end{tabular}
\caption{The sources of the \mediargus corpus. The numbers are millions of tokens.}
\setfloatalignment{b}
\end{table}
Similarly to the \obw models, we used a threshold on the word types, such that we have a similar size of vocabulary (100 thousand types), which we produced with a threshold of 250. We used the same
%\footnote{Same as for \obw on the English data.}
occurrence threshold of 2 on the $n$- and skipgrams.
\subsection{Transcribed speech: Corpus Gesproken Nederlands (\cgn)}
For testing we use the Flemish part of the Spoken Dutch Corpus (CGN, and henceforth denoted as \cgn)\autocite{oostdijk2000spoken} comprising 3 million words, divided over 15 components (see~\cref{tab:cgn}), ranging from spontaneous speech to read-aloud books. The corpus aimed to be a plausible sample of contemporary standard Dutch as spoken by adult speakers in Flanders and the Netherlands. One third of the data were to be collected in Flanders, the other two thirds were to originate from the Netherlands.
\begin{table}
\begin{tabular}{llllllll}
& \obw & \emea & \jrc & \wp & \mediargus & \cgn & nbest \\ \cline{2-8}
train & 769M & 12M & 31M & 1.4G & 1.3G & & \\
test & 8M & 3M & 8M & 367M & & 3M & 0.5M
\end{tabular}
\caption[][-2em]{Summary of the number of words for each training and test set. M denote millions of words, and G denotes billions ($10^9$) of words. Some pairs are not complete, which means that that particular part is not used in this thesis, for example, we have not trained models on \cgn. }
\setfloatalignment{b}
\end{table}
\begin{table}
\begin{tabular}{llll}
& Component description & \thead{Tokens \\ ($\times 10^3$)} \\ \hline
a & Spontaneous conversations ('face-to-face') & 955 & \\
b & Interviews with teachers of Dutch & 350 & \\
c & Spontaneous telephone dialogues (switchboard) & 559 & \\ %(recorded via a switchboard)
d & Spontaneous telephone dialogues (local interface) & 390 & \\ %(recorded on MD via a local interface)
e & Simulated business negotiations & 0 & \\
f & Interviews, discussions, and debates (broadcast) & 282 & \\
g & (Political) discussions, debates, and meetings (non-broadcast) & 161 & \\
h & Lessons recorded in the classroom & 119 & \\
i & Live commentaries (broadcast) & 89 & \\
j & Newsreports and reportages (broadcast) & 108 & \\
k & News (broadcast) & 93 & \\
l & Commentaries, columns, and reviews (broadcast) & 75 & \\
m & Ceremonious speeches and sermons & 14 & \\
n & Lectures and seminars & 88 & \\
o & Read speech & 407 &
\end{tabular}
\caption{Components in the Spoken Dutch Corpus. The reported number of tokens is for the Flemish part of the corpus. }
\label{tab:cgn}
\setfloatalignment{b}
\end{table}
The second series of test sets are the Flemish parts of the NBest benchmark\autocite{kessens2007n} (475 thousand words; the train, development and test sets). The benchmark was introduced to stimulate further development on Dutch ASR systems, and contain a wide variety of texts, among \mediargus and \cgn in the language models, and \cgn\footnote{Components f, i, j, k, l, c, and d.} in the acoustic models.