January 15, 2025
- For Amharic and Tigrinya, Constraint Grammar rules now perform some morphological disambiguation and some Universal Dependency annotation automatically (see ?? for details).
HornMorpho (HM) is a Python program that performs morphological analysis and generation for various languages of the Horn of Africa. The languages supported in Version 5.3 are Amharic (አማርኛ), Oromo (Afaan Oromoo, Oromiffa), Tigrinya (Tigrigna, ትግርኛ), and Tigre (ትግሬ, ትግራይት). Most examples within this document are Amharic; future versions will include more examples from the other languages.
If your application can benefit from explicit linguistic information about the structure and grammatical properties of words in these languages, then you may want to use HM. HM can tell you, for example, that the verb የማይደረገው is negative, that the noun አባቴን is the object of some verb, that the stem (the word without prefixes and suffixes) of the verb የምንፈልጋቸው is -ፈልግ-, that the lemma (basic form) of the verb እንደሚመኟቸው is ተመኘ, that is, that this verb has something to do with ‘longing’. HM can also tell you that the word እንደሚመኟቸው consists of five segments (morphemes): እንደም+ይ+መኝ+ኡ+ኣቸው.
HM is a rule-based program; that is, the knowledge in the program is based on explicit linguistic rules and a lexicon, a dictionary of basic word forms (stems and roots), rather than on machine learning of the knowledge from a corpus.
- For Amharic, the lexicon is extracted mainly from Amsalu Aklilu’s Amharic-English Dictionary (Addis Ababa, Kuraz, 2004). The rules come from many grammars of the language.
- For Tigrinya, the lexicon is from Thomas Leiper Kane’s Tigrinya-English Dictionary (Kensington, MD, USA, Dunwoody Press, 2000). The rules come mainly from Wolf Leslau’s Documents Tigrigna, Grammaire et Textes (Paris, Librarie C. Klincksieck, 1941) and Amanuel Sahle’s ሰዋስው ትግርኛ ብሰፊሕ (Lawrenceville, NJ, Red Sea Press, 1998).
- For Oromo, the lexicon is from two dictionaries, Gene B. Gragg’s Oromo Dictionary (African Studies Center, Michigan State University, 1982) and Tamene Bitima’s A Dictionary of Oromo Technical Terms (Oromo-English) (Rüdiger Köppe, Köln, 2000). The rules come mainly from Catherine Griefenow-Mewis’s A Grammatical Sketch of Written Oromo (Köln, Rüdiger Köppe Verlag, 2001).
- For Tigre, all of the words and rules are from the Mansa` dialect of the language. The lexicon is still quite limited, containing only several hundred noun and adjective roots and 86 verb roots. The roots are taken from Saleh Mahmud Idris’s A Comparative Study of the Tigrinya Dialects (Aachen, Shaker [Semitica et Semiohamitica Berolinensia 18], 2015) and from Shlomo Raz’s Tigre Grammar and Texts (Malibu, CA, USA, Undena Publications, 1983). The rules come from Raz.
Though HM does not make use of machine learning, it is possible to use its output in models that do. For example, Gezmu & Nürnberger (2023) uses HM’s segmentation of Amharic words for neural machine translation.
HM assigns a part-of-speech (POS) to each word, but if you want a POS tagger, you should look elsewhere. A word’s POS often depends on the other words in the sentence in which it occurs, and HM analyzes words without looking at their context.
HM has a list of Amharic person and place names, but if you want named entity recognition, you should look for a program that has been trained to do this. If a name is not in HM’s list for Amharic, it will just be treated as an unknown word, and this will be true for almost all names in Tigrinya, Oromo, and Tigre.
Version 5 replaces Version 4.5 for Amharic. For other languages, see Version 4.3. Version 5 is not backward compatible with earlier versions. If you have used earlier versions of HM and would like to switch to Version 5, please contact [email protected] for help.
It is highly recommended that you install the program in a virtual
environment,
but this is not required. If you are using a virtual environment, you
will need to create the environment and activate it before running
pip install
.
First download the wheel file from the dist/
folder:
HornMorpho-5.3-py3-none-any.whl
Then, to install from the wheel file, do the following in a Python shell from the folder where the wheel file is
pip install HornMorpho-5.3-py3-none-any.whl
If this fails, it may mean that you don’t have
wheel installed, so try again
after installing wheel
.
Then to use the program, in a Python shell, do
>>> import hm
The first time you use HornMorpho, you will need to download the data for the languages that you will be using. Each language’s data is stored in a compressed .tgz archive. To download a language’s archive, do this
>>> hm.download(language)
where language
is 'a'
for Amharic, 't'
for Tigrinya, 'o'
for Oromo, or 'te'
for Tigre. This will download the compressed file
from the HornMorpho Github repository and then uncompress it. If you try
to use any of the functions described below without first downloading
the data for the relevant language, you will be prompted to download the
data.
If you have problems with installation, contact [email protected].
If you aren’t interested in learning more about what HM can do and just want to use it to analyze the words in a corpus of sentences, this section has the minimum that you’ll need to know.
To analyze the words in a corpus, use the function anal_corpus()
,
passing the sentences as a list of strings, using the keyword data
,
or as a path to a file containing the sentences, using the keyword
path
.
(1) >>> c = hm.anal_corpus('a', data=["በሶ የበላው አበበ አይደለም ።", "ጫላ ጩቤዬን ጨብጧል ።"])
This returns an instance of the class Corpus
, which has a
write()
method that you can call to write the analyses to a file,
using the keyword path
, or to standard output if you specify no
path. You can tell which word attributes you want to write with the
keyword attribs
. Some possible attributes are part-of-speech
('pos'
), morphological features ('um'
), segmentation into
morphemes ('seg'
), and lemma ('lemma'
).
(2) >>> c.write(attribs=['pos', 'um', 'lemma']) በሶ የበላው አበበ አይደለም ። በሶ N SG በሶ የበላው V *RELC;3;DEF;MASC;PFV;SG በላ አበበ V 3;MASC;PFV;SG አበበ አይደለም COP 3;MASC;NEG;PRS;SG ነው ። PUNCT ጫላ ጩቤዬን ጨብጧል ። ጫላ PROPN SG ጫላ ጩቤዬን N ACC;PSS1S;SG ጩቤ ጨብጧል V 3;MASC;PRF;SG ጨበጠ ። PUNCT
HM is a rule-based morphological analyzer and generator, implemented in the form of finite-state transducers weighted with feature structures. For the theory behind the program, see Gasser (2011).
Most users of HM will be interested in morphological analysis. The program also works in the opposite direction, performing morphological generation, taking as input the root and grammatical features of a word and returning the word form. Documentation of the generation functions is forthcoming.
The simplest HM function, anal
, takes a word and returns an instance
of the Word
class. An HM Word
is a list
of Python
dict
s, each representing a separate analysis of the input
word. [1] You can use the usual Python ways of accessing the elements
in a list
or dict
. For example, here is how you would analyze
the Amharic word የቤታችን. The first argument to anal
specifies the
language; 'a'
is Amharic, 't'
Tigrinya, 'o'
Oromo, 'te'
Tigre.
(3) >>> w = hm.anal('a', "የቤታችን")
The keys in the dict
for an analysis of a word represent different
pieces of information that you may be interested in. For example, you
may want the lemma of the input word. This is the basic form of the
word. For nouns in all of the languages, this is the stem of the word
without any prefixes or suffixes. Here’s how you’d get the lemma for the
above analysis of the word የቤታችን. w[0]
returns the first analysis
dict
in the list
of analyses, and w[0]['lemma']
returns the
value associated with the keyword lemma
in this dict
. [2]
(4) >>> w[0]['lemma'] >>> 'ቤት'
Other dict
keys are described below.
You will probably not want to use HM to analyze individual words, as in
the above example. There are also functions for analyzing sentences and
corpora of sentences, anal_sentence() and
anal_corpus(), described below. These functions call
anal()
on the words in the sentences.
A morphologically complex word consists of multiple morphemes, that is, more than one meaningful unit. One morpheme, the stem, is the part that conveys the basic meaning (the lexical meaning) of the word. The other morphemes, those that appear before the stem (as prefixes), after the stem (as suffixes) or within the stem (as infixes), modify the lexical meaning in various ways. For example, the Amharic word ለቤቶቻችን ‘for our houses’ consists of the stem ቤት and three additional morphemes, the prefix ለ- and the suffixes -ኦች and -ኣችን. [3]
A morphological segmentation of a word consists of a representation of the sequence of morphemes that make up the word. Morphological segmentation may be useful in NLP applications that make use of subword units, for example, language models. In these cases it provides an alternative to segmentation methods that are not informed by linguistic knowledge, such as byte pair encoding and the WordPiece method.
To get the segmentation of an analysis of a word, use the keyword
'seg'
. HM represents the segmentation of the word ለቤቶቻችን in the
following way.
(5) >>> w = hm.anal('a', 'ለቤቶቻችን') >>> w[0]['seg'] 'ለ-<ቤት>ኦች-ኣችን--'
The stem is the part between <
and >
. The instances of -
separate the other morphemes. When -
is not followed by a morpheme,
the word lacks a possible morpheme in a particular position. Here are
two other words showing prefixes and suffixes in other positions.
(6) >>> w1 = hm.anal('a', 'ለየቤታቸው') >>> w1[0]['seg'] 'ለ-እየ<ቤት>-ኣቸው--' (7) >>> w2 = hm.anal('a', 'ቤቱንም') w2[0]['seg'] '-<ቤት>-ኡ-ን-ም'
Like nouns, verbs in all four languages have a stem, which can take suffixes and (in Amharic, Tigrinya, and Tigre) prefixes. Here are two examples.
(8) >>> w1 = hm.anal('a', 'አልሰበሩም') w1[0]['seg'] '-ኣል-<ሰበር>ኡ--ም---' (9) >>> w2 = hm.anal('a', 'የማይሰብረው') w2[0]['seg'] 'የም-ኣ-ይ<ሰብር>-አው----'
But Amharic, Tigrinya, and Tigre verbs are much more complicated than nouns. As in other Semitic languges, the stem of a verb can be further separated into a root, consisting of a sequence of consonants, and a vowel template, consisting of a sequence of vowels and, in some case, the gemination, or lengthening of particular root consonants (Amharic: መጥበቅ).
For example, the stem of the verb in (8) above, ሰበር, could in turn be
segmented into the root, consisting of the consonants /s b r/, and the
template, consisting of the vowel ǝ (ኧ) between /s/ and /b/ and the same
vowel between /b/ and /r/, as well as the gemination of the /b/. We
could represent this in the following way, indicating gemination with
/
before the lengthened consonant.
(10) {ስ ብ ር} + {1 ǝ /2 ǝ 3}
Combining the root and template in (10), with 1=ስ
, 2=ብ
, and
3=ር
, gives the stem ሰ/በር (/sǝbbǝr/), to which the prefixes and
suffixes in (8) are added to yield the final form አልሰበሩም.
You can get the root of a verb using the keyword 'root'
.
(11) >>> w = hm.anal('a', 'ትሰብራላችሁ') >>>w[0]['root'] 'ስብር:A'
The letter following the characters in the root specifies the root class. HM distinguishes nine different root classes, each with particular sets of templates. Normally the class doesn’t matter, but in some cases the same set of root consonants can belong to two different classes. This is true, for example, for the Amharic roots ጥብቅ:A ‘wait for’ and ጥብቅ:B ‘be tight’.
(12) >>> w1 = hm.anal('a', 'ጠብቂው') w1[0]['root'] 'ጥብቅ:B' (13) >>> w2 = hm.anal('a', 'አጥብቂው') w2[0]['root'] 'ጥብቅ:A'
Oromo verbs are simpler, but they also have a stem that is followed by one or more suffixes, and this stem in turn can consist of a root followed by one or more suffixes. For example, the verb adeemsiise has the adeemsiis-, which in turn consists of the root adeem- followed by the causative suffix -siis.
(14) >>> w = hm.anal('o', "adeemsiise") >>> w[0]['seg'] '<adeem-siis>e-'
Prefixes and suffixes may take different forms depending on what precedes or follows them, and they may also affect the form of the stem that they are attached to. For example, consider the suffix that means ‘my’, that is, the first person singular possessive suffix. When this follows a consonant, it is pronounced /-e/ (ኤ) , as, for example, in ቤቴ. When it follows a vowel, however, it is usually pronounced /-ye/ (ዬ), as for example, in ሥራዬ. To indicate that these forms are actually the same morpheme, in HM they appear in one, canonical, form.
(15) >>> w1 = hm.anal('a', "ቤቴ") >>> w1[0]['seg'] '-<ቤት>-ኤ--' (16) >>> w2 = hm.anal('a', "ሥራዬ") >>> w2[0]['seg'] '-<ስራ>-ኤ--'
Here is a more complicated example with verbs. It shows how the suffix -ኢ /i/, indicating that the subject of the verb is second person feminine singular, can take different forms and can affect the end of the stem that it follows. In this case, HM normalizes both the stem and the suffix. Note how the segmentations in both (17) and (18) contain the suffix -ኢ and how the stem in the segmentations in both (18) and (19) is መርጥ.
(17) >>> w1 = hm.anal('a', 'ትፈልጊያለሽ') >>> w1[0]['seg'] '--ት<ፈልግ>ኢ---ኣለሽ--' (18) >>> w2 = hm.anal('a', 'ትመርጫለሽ') >>> w2[0]['seg'] '--ት<መርጥ>ኢ---ኣለሽ--' (19) >>> w3 = hm.anal('a', 'ትመርጣለህ') >>> w3[0]['seg'] '--ት<መርጥ>---ኣለህ--'
Example (16) illustrates another HM feature. The Amharic and Tigrinya writing systems have some characters that have the same pronunciation, for example, ስ and ሥ, and because the choice of a character within these sets is not standardized, in HM, the spelling of words is normalized. That is, characters with the same pronunciation are mapped to a single character. Thus both ሥራ and ስራ appear in HM output as ስራ; both ጸጉር and ፀጉር appear as ጸጉር. In Tigrinya, both ስራሕ and ሥራሕ appear in HM output as ስራሕ; both መጸ and መፀ appear as መጸ.
In addition to segmenting a word, morphological analysis can yield a set of grammatical (or morphological) features, which, together with the word’s stem, completely describe the word.
For example, given the Amharic word ለቤቶቻችን, morphological analysis should tell us that the stem of this word is the noun ቤት (so it refers to a house) and that it in addition has three grammatical features: it is plural (it refers to more than one house), it has the preposition ለ- (it means “for” or “to” the houses in question), and it has the possessive suffix -ኣችን (so the houses are “our” houses).
HM represents morphological features in terms of an internal format that you probably won’t have to deal with and, in addition, in terms of the features used in two cross-linguistic projects, Universal Morphology and Universal Dependencies.
The Universal Morphology project (UM)
aims to develop a set of morphological features that permit
morphological annotation of data from any language. The features are
defined as values along 23 dimensions, for example, gender and number,
but many of these are not relevant for a given language such as Amharic,
Tigrinya, Oromo, or Tigre. For a given word, the UM specification
consists of a list of all of the feature values that apply to that word,
separated by semicolons. To see the UM features for an analyzed word in
HM, use the keyword 'um'
.
(20) >>> w = hm.anal('a', 'ለቤቶቻችን') >>> w[0]['um'] 'DAT;PL;PSS1P'
The first of these features, DAT
, is an abbreviation of dative, a
linguistic term that describes some (though not all) of the functions of
the preposition ለ-. You can think of DAT
as standing in for all of
these functions. The second, PL
, indicates that the word is plural.
The third, PSS1P
, is an abbreviation of first person plural
possessive, that is, ‘our’ in English.
For verbs, the features includes ones for subject agreement, that is, features specifying the person, number, and gender of the subject of the verb, and ones for aspect, that is features related to the time of the event or state referred to by the verb.
(21) >>> w1 = hm.anal('a', 'ትፈልጋለች') >>> w1[0]['um'] '3;FEM;IPFV;SG' (22) >>> w2 = hm.anal('a', 'ፈለገች') >>> w2[0]['um'] '3;FEM;PFV;SG'
In both (21) and (22) the subject is third person ('3'
) singular
('SG'
) feminine ('FEM'
), that is, ‘she’. In (21), the aspect is
imperfective ('IPFV'
), corresponding roughly to English present or
future. In (22), the aspect is perfective ('PFV'
), corresponding
roughly to English past.
The full range of grammatical features that are output by HM will not be
described here. You can find out more about UM features in
Sylak-Glassman
(2016). HM adds
several features that are not part of the UM set; these are preceded by
'*'
in the HM output.
There are two levels of structure in sentences: morphology, structure
within words, and syntax, structure between words. Syntactic structure
can be represented in terms of labeled directed arcs joining words,
known as syntactic relations or dependencies. Here is a simple Amharic
example. The arrow from አዳነቻት to አልማዝ denotes the dependency with head
አዳነቻት and dependent አልማዝ; አልማዝ is the subject ('nsubj'
) of አዳነቻት.
One very influential syntactic framework in NLP, similar to the Universal Morphology project, is the Universal Depencency project (UD), which seeks to develop a set of universal dependencies that could adequately cover the syntactic structure of all languages.
Because morphology and syntax are related, annotating sentences within the UD framework includes not only linking words with dependencies but also assigning morphological features to words and possibly linking some of the morphemes within words by dependencies. Here is how that might appear for the sentence in (23).
Although morphological features in UD are based on those in UM, the
format is different. You can see the UD features in an analysis with the
keyword 'udfeats'
. Here are the UD features for the words in
examples (19) and (20).
(25) >>> w1 = hm.anal('a', 'ለቤቶቻችን') >>> w1[0]['udfeats'] 'Case=Dat|Number=Plur|Number[psor]=Plur|Person[psor]=1' (26) >>> w2 = hm.anal('a', 'ትፈልጋለች') >>> w2[0]['udfeats'] 'Aspect=Imp|Gender=Fem|Number=Sing|Person=3'
In UD, annotated sentences are represented in a plain text format known as CoNLL-U. Each word is assigned a line, with ten slots for its properties, including morphological features and POS. There are a number of tools for annotating sentences using files in CoNNL-U format.
With the function anal_corpus()
, HM allows a user to analyze the
words in a set of sentences and save the analyses in a file in CoNLL-U
format, including the morphological information (segmentation,
morphological features, within-word depencies) and POS tags but not the
dependencies between the words in the sentences. The resulting file can
then be uploaded using one of the annotation tools so that the
dependencies can be added. To learn how to use HM to create CoNNL-U
files, see [this section] (#conllu).
Many words are ambiguous; they may have multiple interpretations. Sometimes the ambiguity is orthographic; that is, the spelling represents two different words with different pronunciations. This can happen in Amharic, Tigrinya, and Tigre because the orthography does not indicate gemination. For example, the Amharic verb ይሰማሉ can mean either ‘they hear’ or ‘they are heard’, depending on the gemination of the /s/ and /m/. Besides gemination, these two interpretations can be distinguished on the basis of their lemmas, ሰማ in the first case, ተሰማ in the second. Other words are ambiguous because forms with the same pronunciation can have multiple interpretations. For example, the Amharic verb ብትመጣ can mean either ‘if you (masc.) come’ of ‘if she comes’; that is, the two interpretations differ in their grammatical features.
Each HM Word
is a list
of analyses, one for each possible
interpretation that HM finds.
(27) >>> w1 = hm.anal('a', 'ይሰማሉ') >>> w1[0]['lemma'] 'ሰማ' >>> w1[1]['lemma'] 'ተሰማ' (28) >>> w2 = hm.anal('a', 'ብትመጣ') >>> w2[0]['um'] '2;COND;IPFV;MASC;SG' >>> w2[1]['um'] '3;COND;FEM;IPFV;SG'
HM analyses include consonant gemination, but by default, the gemination
is omitted in the output. To show gemination, indicated by /
before
the geminated consonant, you can use the keyword degem=False
. This
can show, for example, the difference in the pronunciation of the two
analyses of ይሰማሉ.
(29) >>> w3 = hm.anal('a', 'ይሰማሉ', degem=False) >>> w3[0]['seg'] '--ይ<ሰማ>---ኣ/ሉ--' >>> w3[1]['seg'] '--ይ</ሰ/ማ>---ኣ/ሉ--'
HM attempts to order the analyses of ambiguous words on the basis of their relative frequency, but the frequency estimates are still somewhat crude, so you should not assume that the first analysis is always the most likely one.
Returns an instance of the Language
class, given its
abbreviation, if the language has been loaded.
Returns an instance of the Word
class belonging to the
Language
that is represented by the language
argument.
If word
belongs to HM’s list of common, unanalyzed words, the
analysis returns no segmentation.
(30) >>> w = hm.anal('a', "ወደ") >>> w[0] {'token': 'ወደ', 'pos': 'ADP', 'nsegs': 1, 'freq': 4999}
If HM cannot analyze word
, it returns an unanalyzed representation
of the word, with POS 'UNK'
. Note that HM is very strict when it
comes to the spelling of words.
(31) >>> w = hm.anal('a', "አላቹ") >>> w[0] {'seg': 'አላቹ', 'pos': 'UNK', 'nsegs': 1}
If the returned Word
includes analyses, each is a dict
, with
keywords for the different aspects of the analysis that you can access.
See below under the description of the Word
class
for a list of these keywords.
language
is a string abbreviation of the language:'a'
for Amharic,'t'
for Tigrinya,'o'
for Oromo,'te'
for Tigre. If the word’sLanguage
hasn’t been instantiated, that happens whenanal()
is called. If the data for thatLanguage
has not been downloaded, the user is prompted to download it whenanal()
is called.word
is a string representing an Amharic, Tigrinya, Oromo, or Tigre word or, ifmwe=True
is specified, a multi-word phrase.**kwargs
(keyword arguments with default values)degem=True
If
False
, specifies that the word’s segments are not “degeminated,” that is, that consonant gemination is indicated (see example (28) above).mwe=False
If
True
, specifies that theword
string contains one or two spaces and that HM should try to analyze it as a multi-word expression (MWE). HM knows a limited number of MWEs, so this will not always succeed as expected.(32) >>> w1 = hm.anal('a', "የትምህርት ቤታቸው", mwe=True) w1[0]['seg'] 'የ--ትምህርት <ቤት>-ኣቸው--' >>> w1[0]['lemma'] 'ቤት' (33) >>> w2 = hm.anal('a', "ብድግ ብለው", mwe=True) >>> w2[0]['seg'] 'ብድግ --<ብል>አው-----' >>> w2[0]['lemma'] 'አለ'
Returns an instance of the Sentence
class.
language
is a string abbreviation of the language, as foranal()
.sentence
is a string representation of a sentence to be analyzed.
You can get a list of analyzed Word
instances in the sentence using
the Sentence
attribute words
.
(34) >>> s = hm.anal_sentence('a', "ጫላ ጩቤ ጨበጠ ።") >>> s.words [W16:ጫላ[1], W18:ጩቤ[1], W19:ጨበጠ[1], W20:።[1]] >>> for word in s.words: print(word[0].get('pos')) PROPN N V PUNCT
anal_sentence()
tries to find known multi-word expressions before it
analyzes individual words in the sentence.
(35) >>> s = hm.anal_sentence('a', 'ዝም ብላ ትምህርት ቤት ገባች ።') >>> s.words [W4:ዝም ብላ[1], W5:ትምህርት ቤት[1], W6:ገባች[1], W7:።[1]] >>> for word in s.words: print(word[0].get('seg')) ዝም --<ብል>ኣ----- --ትምህርት <ቤት>--- --<ገባ>አች----- None
Returns an instance of the Corpus
class. A
Corpus
’s sentences
attribute is a list of Sentence
instances.
language
is a string abbreviation of the language, as foranal()
.**kwargs
(keyword arguments with default values)data=None
A list of strings, each representing an unanalyzed sentence, or
None
. Ifdata
is notNone
, each of the sentences is analyzed withanal_sentence()
, and the resultingSentence
objects are stored in theCorpus
’ssentences
attribute.(36) >>> c = hm.anal_corpus('a', data=['አበበ በሶ በላ ።', 'ጫላ ጩቤ ጭበጠ ።']) >>> c.sentences [S1::አበበ በሶ በላ ።, S1::ጫላ ጩቤ ጭበጠ ።] >>> c.sentences[0].words[2][0]['um'] '3;MASC;PFV;SG' >>> c.sentences[1].words[0][0]['pos'] 'PROPN'
path=None
A string representing a path to a file containing unanalyzed sentences, one per line, or
None
. Ifdata
isNone
, andpath
is notNone
, the sentences in the file are read in and analyzed withanal_sentence()
, and the resultingSentence
objects are stored in theCorpus
’ssentences
attribute.degem=False
: as foranal()
disambiguate=False
If
True
,Corpus.disambiguate()
is called, opening a GUI in which the user can select analyses for ambiguous words. This only works within the set of functions that create CoNNL-U representations for sentences. See the section Working with CoNNL-U format below to find out more.CGdisambiguate=True
If
True
and VISL CG3 is installed, it is called using the Constraint Grammar disambiguation rules for the language. (See Working with CoNNL-U format below.)annotate=True
If
True
and VISL CG3 is installed, it is called using the Constraint Grammar dependency annotation rules for the language. (See Working with CoNNL-U format below.)start=0
If
path
is specified,start
gives the index of the sentence in the file to begin from. The default is0
, the first sentence.n_sents=0
If
path
is specified andn_sents
is not0
, it specifies the number of sentences to be analyzed and added to the corpus.name=''
If
name
is not empty, the corpus is namedC_
``name``. Otherwise its name isC_#
, where#
is a unique integer.
Normally you will not need to instantiate any of these classes directly. Instead you will be creating them using the functions above.
A Language
is created using the static method Language.make()
.
Language.make()
is called whenever any of the analysis methods,
anal()
, anal_sentence()
, or anal_corpus()
, is called for the
first time with the language’s abbreviation.
An instantiated Language
consists of a set of properties, which are
read in from files called *.lg
and *.um
, where *
is the
language’s abbreviation, and a set of finite-state transducers (FSTs)
for each POS, loaded from compiled
pickle files. These
FSTs include the combined lexical and grammatical information for the
particular POS.
Normally you should not need to change any of these files. If you would like to add to or correct the knowledge that HM has for a particular language, please contact [email protected].
hm.Language.version
. Astr
representing the current version of the language’s compiled data files.
Word
is a subclass of list
. Each element of the list is a
dict
representing a single analysis of a word string. See below for
the keywords in the analysis dict
s.
Instances of Word
are normally created by the analysis functions,
anal()
, anal_sentence()
, and anal_corpus()
.
An analysis dict
may have the following possible keywords. (Note
that not all keywords appear for words that are not actually analyzed,
for example, prepositions like ወደ and conjunctions like ግን, or for
unknown words.)
'token'
The input word string.
'pos'
The word’s part-of-speech.
'lemma'
The word’s lemma. For nouns and adjectives, this is the stem, that is, the form of the word with no prefixes or suffixes. For Amharic, Tigrinya, and Tigre verbs, this is the third person singular masculine perfective form.
(37) >>> w = hm.anal('a', "ያነጋግሩታል") >>> w[0]['lemma'] 'አነጋገረ'
For Oromo verbs, the lemma is the infinitive.
(38) >>> w = hm.anal('o', "dubbattani") w[0]['lemma'] 'dubbachuu'
'root'
For nouns this is the same as the lemma. For Amharic, Tigrinya, and Tigre verbs, it consists of the consonants in the verb’s root, followed by a letter representing the root’s class. For some Amharic roots, HM uses the characters እ or ይ to represent consonants that were pronounced at some point in the language’s history but now only affect the particular vowels that appear.
(39) >>> w1 = hm.anal('a', "ብሉት") >>> w1[0]['root'] 'ብልእ:A' (40) >>> w2 = hm.anal('a', "ፃፈችው") >>> w2[0]['root'] 'ጽእፍ:A' (41) >>> w3 = hm.anal('a', "ሰለቸኝ") >>> w3[0]['root'] 'ስልችይ:E'
For Oromo verbs, the root is either the same as the stem, or it is a more basic form to which suffixes representing additional meanings have been added. For many verbs, HornMorpho will return both possibilities as separate analyses.
(42) >>> w = hm.anal('o', "banamaniiru") >>> w[0]['seg'];w[0]['um'];w[0]['root'] '<banam>aniiru-' '3;PFV;PL' 'banam' >>> w[1]['seg'];w[1]['um'];w[1]['root'] '<ban-am>aniiru-' '3;PASS;PFV;PL' 'ban'
'um'
The UM features for the analysis. See examples (20), (21), (22), and (28).
'seg'
The segmentation of the word into its constituent morphemes, with
-
representing morpheme positions that are empty and< >
enclosing the word’s stem. See examples (5)-(9), (15)-(19), and (29).'pre'
,'suf'
,'stem'
The prefixes, suffixes, and stem of the word. Each is a list of
dict
s, or''
if a position is empty. Eachdict
contains keys for the morpheme’s form ('seg'
); its POS ('pos'
); the index within the word of its dependency head ('head'
), that is, the index of the stem; its relation to the word’s head ('dep'
); and a set of features that are relevant for [UD CoNLL-U] (#conllu) format ('udfeats'
).(43) >>> w = hm.anal('a', 'ሲያጎበድዱላቸውና') >>> w[0]['pre'] [{'seg': 'ስ', 'pos': 'SCONJ', 'dep': 'mark', 'head': 2}, '', {'seg': 'ይ', 'pos': 'PRON', 'dep': 'nsubj', 'head': 2}] >>> w[0]['stem'] {'seg': 'ኣ-ጎበድድ', 'pos': 'V', 'head': 2, 'udfeats': 'Aspect=Imp|ClauseType=AdvCl|DatNum=Plur|DatPers=3|Number=Plur|Person=3|Voice=Trans'} >>> w[0]['suf'] [{'seg': 'ኡ', 'pos': 'PRON', 'dep': 'nsubj', 'head': 2}, {'seg': 'ላቸው', 'pos': 'PRON', 'dep': 'obl', 'head': 2}, '', '', '', {'seg': 'ና', 'pos': 'CCONJ', 'dep': 'cc', 'head': 2}]
'freq'
An estimate of the frequency of the word’s stem within a corpus, used to sort analyses.
'udfeats'
The set of morphological features in Universal Depencency format. See examples (25) and (26). This is only something you need to know about if you are using HM for to create a file of sentences in CoNNL-U format.
language
is a string abbreviation of the language, as foranal()
.``sentence`` is a string representation of a sentence. HM doesn’t have a tokenizer, so it is assumed that the sentence is already tokenized, in particular, that punctuation is separated from words.
``**kwargs`` (keyword arguments)
degem=True
As for anal(), if False, specifies that the segments of each word are not "degeminated," that is, that consonant gemination is indicated (see example (29) above).
hm.Sentence.words
. Alist
of analyzedWord
objects.
hm.Sentence.print_conllu()
Prints the CoNNL-U representation for the
Sentence
.
**kwargs
: seeanal_corpus()
.
hm.Corpus.sentences
. Alist
of analyzedSentence
objects.
hm.Corpus.write
(path, properties)``Writes specified properties of the analyses in the corpus’s sentences to a file or standard output. See example (2).
Parameters
path
A string representing a path to a file or
None
. IfNone
, the analyses are written to standard output.properties
A
list
of strings consisting of analysis keywords to be written, for example,'pos'
and'seg'
.
The features described in this section are being used as part of a project to create Amharic and Tigrinya treebanks. [4]
The process involves five steps. 1, 2, and 4 are automatic; 3 and 5 are manual.
- Each sentence is analyzed by HornMorpho, which treats each word (or multi-word expression in its lexicon) independently. This results in significant ambiguity.
- A set of Constraint Grammar disambiguation rules is run on the sentences. Given ambiguity for a word, the rules look at its context and attempt to assign the correct interpretation or at least to eliminate incorrect interpretations.
- The manual disambiguator is run on the sentences. For words that are still ambiguous following 2., the user selects the correct interpretation.
- A set of Constraint Grammar dependency annotation rules is run on the sentences. These rules may assign the root of a sentence, create dependencies from one word to another, with or without an associated relation label.
- The output of 4. is a partial analysis of each sentence in CoNNL-U format. For all but the simplest sentences, this will be incomplete, so the sentences will need to be annotated by hand, using one or another annotation tool.
Constraint Grammar (CG) is a framework for the rule-based dependency parsing of sentences. CG rules are of two types: those that disambiguate words and those that assigning dependencies words. HornMorpho includes rules of both types for Amharic and Tigrinya, but these should be considered experimental since they still need to be tested on a large number of sentences.
In order to use the rules while analyzing the sentences in a corpus with HM, you will need to first install VISL CG3, a program that reads and applies CG rules to sentences. Installation instructions are here. The first time you call anal_corpus() on Amharic or Tigrinya text, you will be prompted for the path to VISL CG3 or given the option of not using CG at all.
To create representations of the sentences in a corpus in CoNLL-U
format, first pass disambiguate=True
to anal_corpus()
when you
call it on the corpus. After the sentences in the corpus have been
analyzed (with anal_sentence()
), the CG disambiguation rules will be applied if
you have installed VISL CG3 and set the path to it.
Next a GUI window will open so that you
can select analyses for ambiguous words by hand.
(44) >>> c = hm.anal_corpus("a", data=["ድምፅ ከቤቱ ውስጥ ይሰማል ።"], disambiguate=True)
The GUI displays the analyses for one word at a time, beginning with the first word in the first sentence. The current word is underlined where the current sentence is displayed.
To move to the next or the previous word in the current sentence, use the directional buttons under “Word ID” (or the arrow keys). To move to the next or the previous sentence, use the buttons under “Sentence ID”.
Unambiguous words are displayed with gray backgrounds, for example, the word ድምፅ in figure (45) above. The word’s single analysis is shown in the space at the bottom of the window.
Figure (46) shows an example of an ambiguous word, ከቤቱ. The options are shown in the two yellow boxes under the ambiguous suffix, which can mean either ‘his’ (the upper box) or ‘the’ (the lower box). To choose one of the options, click on your choice.
Figure (47) shows another sort of ambiguity, for the word ይሰማል. Two
entire analyses are shown at the bottom of the window. The difference
between the two interpretations, ‘hears’ and ‘is heard’, is represented
in HM in terms of the features (one is PASS
(passive); the other is
not) and in terms of the different lemmas, ሰማ and ተሰማ. To choose one of
the options for cases like this, click on the number to the left of the
analysis, 2 in this case.
There is an Undo
button to allow you to undo selections that have
made. When you are finished disambiguating, clicking on Quit
closes
the window.
The analyses will have been updated for all of the words
that you have disambiguated; that is, each of these words will now have
only one CoNNL-U style analysis.
At this point, if you have installed VISL CG3 and set the path to it,
the dependency annotations rules will be applied to the sentences.
To write the CoNLL-U representations to a file or standard output, pass
conllu=True
to the Corpus
method write()
.
(48) >>> c.write(conllu=True) # text = ድምፅ ከቤቱ ውስጥ ይሰማል ። # sent_id = s1 1 ድምፅ ድምጽ NOUN NOUN Number=Sing 1 _ _ _ 2-4 ከቤቱ _ _ _ _ _ _ _ _ 2 ከ ከ ADP ADP Case=Abl 3 case _ _ 3 ቤት ቤት NOUN NOUN _ 3 _ _ _ 4 ኡ ኡ DET DET Definite=Def 3 det _ _ 5 ውስጥ ውስጥ ADP ADP _ 5 _ _ _ 6-8 ይሰማል _ _ _ _ _ _ _ _ 6 ይ ይ PRON PRON Gender=Masc|Number=Sing|Person=3 7 nsubj _ _ 7 ሰማ ተሰማ VERB VERB Aspect=Imp|Voice=Pass 7 _ _ _ 8 ኣል ኣል AUX AUX _ 7 aux _ _ 9 ። ። PUNCT PUNCT _ 9 _ _ _
Once you have written the CoNNL-U representation of a set of sentences to a file, you can upload the file to a UD annotation tool like Arborator, where you can add the missing dependencies between words.
[1] | As we’ll see below, words out of context can be morphologically ambiguous, with more than one possible analysis. |
[2] | In general it’s safer to use the
|
[3] | For prefixes and suffixes beginning with the vowel /a/, HM uses the character ኣ. For suffixes beginning with the vowel /ǝ/, it uses the character አ rather than the usual Amharic character ኧ. This convention agrees with normal spelling rules in most other languages written with the Ge’ez script, which do not use the character ኧ. |
[4] | CoNLL-U format is not yet supported for Oromo. |