Structure of the corpus prim-8.0

MAIN CORPUS OF WRITTEN TEXTS​

The version prim-8.0 of the SNC is comprised of the publicly available subcorpora:

  • prim-8.0-public-all – all publicly available SNC texts (71.10 % journalistic, 15.22 % fiction, 8.51 % professional and 5.17 % other texts), 1 477 447 216 tokens, 1 160 286 731 words
  • prim-8.0-public-sane – excluding texts with incorrect diacritics, published before the year 1955, from outside the territory of Slovakia, and from linguistic journals (73.75 % journalistic, 16.33 % fiction, 8.91 % professional, 1.01 % other texts), 1 368 990 447 tokens, 1 076 309 519 words
  • prim-8.0-public-vyv – balanced subcorpus (33.33 % journalistic, 33.33 % fiction, 33.33 % professional texts), 377 138 077 tokens, 297 524 160 words
  • prim-8.0-public-inf – subcorpus of journalistic (informational) texts, 1 009 613 215 tokens, 791 376 893 words
  • prim-8.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 121 926 591 tokens, 96 084 340 words
  • prim-8.0-public-img – subcorpus of fiction texts, 223 552 510 tokens, 177 545 076 words
  • prim-8.0-public-sk – subcorpus of original Slovak texts, (81.24 % journalistic, 7.91 % fiction, 9.53 % professional, 1.32 % other texts), 1 042 623 207 tokens, 821 878 724 words
  • prim-8.0-public-img-sk – subcorpus of original Slovak fiction texts, 82 503 983 tokens, 65 627 003 words
  • r1955az1989-4.0 – specific corpus of texts from years 1955–1989 (5.11 % journalistic, 75.73 % fiction, 13.82 % professional, 5.34 % other texts), 83 631 422 tokens, 66 825 217 words.

The corpus is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by MorphoDiTa tagger which has been trained at the Slovak National Corpus.

Frequency statistics of the corpora​

Statistic information on earlier versions are available here: prim-7.0, prim-6.1, prim-6.0, prim-5.0, prim-4.0, prim-3.0 a prim-2.1.