docs/FoLiA-stats.1

.TH FoLiA-stats 1 "2020 apr 02"

.SH NAME
FoLiA-stats - gather n-gram statistics from FoLiA files

.SH SYNOPSIS
FoLiA-stats [options] FILE

FoLiA-stats [options] DIR

.SH DESCRIPTION

When a DIR is provided,
.B FoLiA-stats
will process all FoLiA files in DIR and store its results in the current
directory in files called DIR.wordfreqlist.tsv, DIR.lemmafreqlist.tsv etc.

When a FILE is provided,
.B FoLiA-stats
will process that file and store its results in the directory where FILE is
found.

The output format will be 2 or 4 <tab> separated columns (depending on the
.B -p
option)

First column:
the 'word', 'lemma' or 'POS tag' at hand.

Second colum:
the Frequency of the 'word', 'lemma' or 'POS tag'

when -p is provided:

Third column:
the accumulated frequency for all entries up to and including this one.

Fourth column:
the relative presence of the entry: what percentage of the corpus does the
entry belong to?

.SH OPTIONS
.B --clip
number
.RS
clipping factor or frequnecy cut-off. When an item's frequency is lower than 'number', it will not be stored.
.RE

.B -p
.RS
Also output accumulated counts and percentages

.RE

.B --lower
.RS
Lowercase all words.
.RE

.B --separator
sep
.RS
Define a separator value to connect ngrams. Default is an underscore. (_)
.RE

.B --underscore
.RS
Backward compatibility. Preferably use --separator=_
.RE

.B --languages
Lan1,Lan2,...LanN
.RS
specify which languages to consider, based on the language tag as inserted
in the FoLiA xml by the programs Ucto, Frog or FoLiA-langcat.

Lan1 is the default language. Text that is not assigned to Lan1,Lan2,... is
counted as Lan1 (the default language), except when Lan1 equals 'skip'.
In the latter case, text in an undetected language is skipped.
When Lan1 equals 'all', all languages are collected separately.
When Lan1 equals 'none', language information is ignored.
.RE

.B --lang
language
.RS
backward compatibility. Equals
.B --languages=skip,language
meaning: only accept words from 'language'
.RE

.B --aggregate
.RS
create a combined frequency list (per n-gram) per language.
.RE


.B --ngram
count
.RS
extract all n-grams of length 'count' using the separator
.RE

.B --max-ngram
max
.RS
Construct all n-grams up to and including length 'max'
When --ngram is specified too, that is used as the minimum n-gram length.
.RE

.B --mode
value
.RS
Do special actions:

.B string_in_doc
.RS
Collect ALL <str> nodes from the document and handle them as one long Sentence.
.RE
.B word_in_doc
.RS
Collect ALL <word> nodes from the document and handle them as one long Sentence.
.RE
.B lemma_pos
.RS
When processsing nodes, also collect lemma and POS tag information. THIS implies --tags=s
.RE
.RE

.B --tags
tagset
.RS
 collect text from all nodes in the list 'tagset'
.RE

.B --skiptags
tagset
.RS
 skip text from nodes in the list 'tagset'
.RE

.B -s
.RS
backward compatibility. equals --tags=p
.RE

.B -S
.RS
backward compatibility. equals --mode=string_in_doc
.RE

.B --class
class
.RS
use 'class' as the folia text class of the text nodes to process.
(default is 'current'). You may provide an empty string.

.RE

.B --collect
.RS
collect all n-gram values in one output file. If not specified, the specific n-grams will be gathered in separate files.
.RE

.B --hemp
file
.RS
Create a historical emphasis file. This is based on words consisting of singe space
separated letters. Printers in the past might print-set a name such as 'CAESAR' as 'C A E S A R', for emphasis.
.RE

.B --detokenize
.RS
When processing FoLiA with ucto tokenizer information, UNDO that tokenization.
(default is to keep it)
.RE

.B -t
or
.B --threads
number
.RS
use 'nummber' of threads to run on. You may us --threads="max" to use as many
threads as possible. This will allocate 2 processors less than given by the
$OMP_NUM_THREADS environment variable, leaving some processor power for other
purposes.
.RE

.B -V
or
.B --version
.RS
Show VERSION
.RE

.B -v
or
.B --verbose
.RS
be verbose about what is happening
.RE

.B -e
expr
.RS
when searching for files,
.B
FoLiA-stats
will only considers files that match with the expression 'expr', which may contain wildcards. The 'expr' is only matched against the file part. Not against paths.
.RE

.B -o
outprefix
.RS
use outprefix for all output files.
.RE

.B -R
.RS
when a DIR is provided,
.B FoLiA-stats
will recurse through this DIR and its subdirs to find files.
.RE

.B -h
or
.B --help
.RS
show usage information
.RE

.SH BUGS
possible

.SH AUTHORS
Ko van der Sloot: lamasoftware@science.ru.nl

Martin Reynaert: reynaert@uvt.nl