Reference corpus

The reference corpus prim-7.0-frk is a subset of the corpus prim-7.0-public-all. The corpus was built upon the concept of Frequency Dictionary of Slovak based on the Slovak National Corpus and also upon the bibliographic and style-genre annotation under the following criteria:

  • The texts appear in standard written Slovak using diacritical marks, 88.86% of them having undergone linguistic proofreading before release;
  • Texts originate exclusively from print sources, so they do not include the texts initially written and published on the Internet;
  • Texts were published between 1991 and 2015, therefore capturing the vocabulary of contemporary Slovak;
  • They are evenly distributed over the three main styles (fiction, non-fiction and journalistic), with 0.2% of the rhyming texts in the final version of the corpus.

Corpus prim-7.0-frk contains 253,127,609 tokens from a total of 158,281 documents. It is lemmatized and morphologically annotated by MorphoDiTa tagger (specially trained to recognise proper nouns), tuned on SNC tagset.

Selected examples in Declension of nouns in Slovak with corpus examples were also taken from the prim-7.0-frk reference corpus.