Structure of the corpus prim-11.0

Main corpus of written texts:

The version prim-11.0 is comprised of the publicly available subcorpora::

  • prim-11.0-public-all – all publicly available SNC texts (69.45 % journalistic, 17.93 % fiction, 11.76 % professional, 0.86 % other texts, 1 859 466 001 tokens, 1 494 472 047 words
  • prim-11.0-public-sane – excluding texts with incorrect diacritics, from outside the territory of Slovakia (69.11 % journalistic, 18.19 % fiction, 11.87 % professional, 0.82 % other texts), 1 830 899 368 tokens,  1 470 656 952 words
  • prim-11.0-public-vyv – balanced subcorpus (33.34 % journalistic, 33.33 % fiction, 33.33 % professional texts), 655 921 497 tokens, 527 684 718 words
  • prim-11.0-public-inf –  subcorpus of journalistic (informational) texts, 1 265 407 184 tokens, 1 015 747 671 words
  • prim-11.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 217 314 642 tokens, 176 427 670 words
  • prim-11.0-public-img – subcorpus of fiction texts, 333 119 892 tokens, 266 065 753 words
  • prim-11.0-public-sk – subcorpus of original Slovak texts (78.66 % journalistic, 12.83 % fiction, 7.52 % professional, 0,99 % other texts), 1 491 727 093 tokens 1 200 473 815 words
  • prim-11.0-public-img-sk – subcorpus of original Slovak fiction texts, 112 147 673 tokens, 90 130 279 words
  • r1955az1989-8.0 – specific corpus of texts from years 1955 – 1989 (3.70 % journalistic, 82.17 % fiction, 10.15 % professional, 3.97 % other texts), 118 208 927 tokens, 95 077 456 words

The corpus is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by SpaCy tagger which has been trained at the Slovak National Corpus, and also the internal database of forms was used for corrections of wrongly set tags.

Frequency statistics of the corpora

The following frequency statistics are available for individual corpora and subcorpora:

Statistic information on earlier versions are available here: prim-10.0, prim-9.0, prim-8.0, prim-7.0, prim-6.1, prim-6.0, prim-5.0, prim-4.0, prim-3.0 a prim-2.1.