Structure of the corpus prim-9.0

The version prim-9.0 is comprised of the publicly available subcorpora:

  • prim-9.0-public-all – all publicly available SNC texts (73,96 % journalistic, 15,98 % fiction, 9,15 % professional, 0,91 % other texts), 1 652 197 242 tokens, 1 282 202 460 words

  • prim-9.0-public-sane – excluding texts with incorrect diacritics, from outside the territory of Slovakia (73,69 % journalistic, 16,21 % fiction, 9,23 % professional, 0,87 % other texts), 1 620 900 802 tokens, 1 256 679 127 words

  • prim-9.0-public-vyv –balanced subcorpus (33.33 % journalistic, 33.33 % fiction, 33.33 % professional texts), 453 594 173 tokens, 354 964 595 words

  • prim-9.0-public-inf – subcorpus of journalistic (informational texts, 1 194 435 396 tokens, 919 577 280 words

  • prim-9.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 149 581 785 tokens, 117 253 528 words

  • prim-9.0-public-img – subcorpus of fiction texts, 262 818 945 tokens, 208 414 905 words

  • prim-9.0-public-sk – subcorpus of original Slovak texts, 1 257 727 282 tokens, 976 508 960 words

  • prim-9.0-public-img-sk – subcorpus of original Slovak fiction texts 93 429 604 tokens, 74 277 009 words

  • r1955az1989-6.0 – specific corpus of texts from years 1955 – 1989 (4,50 % journalistic, 78,62 % fiction, 12,44 % professional, 4,43 % other texts), 98 544 125 tokens, 78 516 963 words

Each texts is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by MorphoDiTa tagger which has been trained at the Slovak National Corpus.

 

Frequency statistics of the corpora

Statistic information on earlier versions are available here: prim-8.0prim-7.0prim-6.1prim-6.0prim-5.0prim-4.0prim-3.0 a prim-2.1.