-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathFoLiA-stats.1
238 lines (192 loc) · 4.45 KB
/
FoLiA-stats.1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
.TH FoLiA-stats 1 "2020 apr 02"
.SH NAME
FoLiA-stats - gather n-gram statistics from FoLiA files
.SH SYNOPSIS
FoLiA-stats [options] FILE
FoLiA-stats [options] DIR
.SH DESCRIPTION
When a DIR is provided,
.B FoLiA-stats
will process all FoLiA files in DIR and store its results in the current
directory in files called DIR.wordfreqlist.tsv, DIR.lemmafreqlist.tsv etc.
When a FILE is provided,
.B FoLiA-stats
will process that file and store its results in the directory where FILE is
found.
The output format will be 2 or 4 <tab> separated columns (depending on the
.B -p
option)
First column:
the 'word', 'lemma' or 'POS tag' at hand.
Second colum:
the Frequency of the 'word', 'lemma' or 'POS tag'
when -p is provided:
Third column:
the accumulated frequency for all entries up to and including this one.
Fourth column:
the relative presence of the entry: what percentage of the corpus does the
entry belong to?
.SH OPTIONS
.B --clip
number
.RS
clipping factor or frequnecy cut-off. When an item's frequency is lower than 'number', it will not be stored.
.RE
.B -p
.RS
Also output accumulated counts and percentages
.RE
.B --lower
.RS
Lowercase all words.
.RE
.B --separator
sep
.RS
Define a separator value to connect ngrams. Default is an underscore. (_)
.RE
.B --underscore
.RS
Backward compatibility. Preferably use --separator=_
.RE
.B --languages
Lan1,Lan2,...LanN
.RS
specify which languages to consider, based on the language tag as inserted
in the FoLiA xml by the programs Ucto, Frog or FoLiA-langcat.
Lan1 is the default language. Text that is not assigned to Lan1,Lan2,... is
counted as Lan1 (the default language), except when Lan1 equals 'skip'.
In the latter case, text in an undetected language is skipped.
When Lan1 equals 'all', all languages are collected separately.
When Lan1 equals 'none', language information is ignored.
.RE
.B --lang
language
.RS
backward compatibility. Equals
.B --languages=skip,language
meaning: only accept words from 'language'
.RE
.B --aggregate
.RS
create a combined frequency list (per n-gram) per language.
.RE
.B --ngram
count
.RS
extract all n-grams of length 'count' using the separator
.RE
.B --max-ngram
max
.RS
Construct all n-grams up to and including length 'max'
When --ngram is specified too, that is used as the minimum n-gram length.
.RE
.B --mode
value
.RS
Do special actions:
.B string_in_doc
.RS
Collect ALL <str> nodes from the document and handle them as one long Sentence.
.RE
.B word_in_doc
.RS
Collect ALL <word> nodes from the document and handle them as one long Sentence.
.RE
.B lemma_pos
.RS
When processsing nodes, also collect lemma and POS tag information. THIS implies --tags=s
.RE
.RE
.B --tags
tagset
.RS
collect text from all nodes in the list 'tagset'
.RE
.B --skiptags
tagset
.RS
skip text from nodes in the list 'tagset'
.RE
.B -s
.RS
backward compatibility. equals --tags=p
.RE
.B -S
.RS
backward compatibility. equals --mode=string_in_doc
.RE
.B --class
class
.RS
use 'class' as the folia text class of the text nodes to process.
(default is 'current'). You may provide an empty string.
.RE
.B --collect
.RS
collect all n-gram values in one output file. If not specified, the specific n-grams will be gathered in separate files.
.RE
.B --hemp
file
.RS
Create a historical emphasis file. This is based on words consisting of singe space
separated letters. Printers in the past might print-set a name such as 'CAESAR' as 'C A E S A R', for emphasis.
.RE
.B --detokenize
.RS
When processing FoLiA with ucto tokenizer information, UNDO that tokenization.
(default is to keep it)
.RE
.B -t
or
.B --threads
number
.RS
use 'nummber' of threads to run on. You may us --threads="max" to use as many
threads as possible. This will allocate 2 processors less than given by the
$OMP_NUM_THREADS environment variable, leaving some processor power for other
purposes.
.RE
.B -V
or
.B --version
.RS
Show VERSION
.RE
.B -v
or
.B --verbose
.RS
be verbose about what is happening
.RE
.B -e
expr
.RS
when searching for files,
.B
FoLiA-stats
will only considers files that match with the expression 'expr', which may contain wildcards. The 'expr' is only matched against the file part. Not against paths.
.RE
.B -o
outprefix
.RS
use outprefix for all output files.
.RE
.B -R
.RS
when a DIR is provided,
.B FoLiA-stats
will recurse through this DIR and its subdirs to find files.
.RE
.B -h
or
.B --help
.RS
show usage information
.RE
.SH BUGS
possible
.SH AUTHORS
Ko van der Sloot: [email protected]
Martin Reynaert: [email protected]