Morphological annotation of texts in the Slovak National Corpus

Morphological annotation is a fundamental and very common linguistic information found in corpora, especially for inflectional languages. It comprises the grammatical (part of speech) and morphological features of a word in context. It is usually preceded by the process of lemmatization – assignment of the basic form to a particular lexeme. In the Slovak National Corpus there are two types of morphological annotation and lemmatization:
  • manual morphological annotation in the subcorpus r-mak based on a set of tags and rules concluding the lemmatization rules,
  • automatic morphological annotation in all the other corpora and subcorpora, using the same set of tags and rules, while the tagger MorphoDiTa was trained and tuned on subcorpus r-mak, data (word forms) from the Morphological Database of the SNC were also used.

All the tags can be viewed in the following charts. Examples are taken from the manually annotated subcorpus.

AdjectiveConjunctionUndefinable part of speech
PronounParticleNon-verbal element
NumeralInterjectionForeign language citation
VerbReflexive morphemeNumber
ParticipleConditional morphemeProper name
AdverbAbbreviation, symbolIncorrect spelling

Document on morphological annotation of texts in the Slovak National Corpus can be found here.

Each tokens is subject to morphological annotation. A token is a sequence of characters, that are normally found between two spaces, as well as the punctuation marks when spaces are systematically added before them during segmentation.

Each token is assigned a lemma and tag.

A lemma is basically a dictionary entry of a token, covering all the word forms from inflected parts of speech and adverbs.

A tag expresses formal description of a token.

More detailed information (in Slovak) on tokenization, lemmatization and morphological annotation of the Slovak National Corpus can be found here.