Structure of the corpus prim-10.0

Main corpus of written texts

The version prim-10.0 is comprised of the publicly available subcorpora:

  • prim-10.0-public-all – all publicly available SNC texts (71.00 % journalistic, 16.82 % fiction, 11.28 % professional, 0.90 % other texts, 1 688 211 881 tokens, 1 355 262 962 words
  • prim-10.0-public-sane – excluding texts with incorrect diacritics, from outside the territory of Slovakia (70.52 % journalistic, 17.15 % fiction, 11.46 % professional, 0.87 % other texts), 1 649 561 653 tokens, 1 323 046 192 words
  • prim-10.0-public-vyv – balanced subcorpus (33.33 % journalistic, 33.33 % fiction, 33.33 % professional texts), 571 526 056 tokens, 459 358 995 words
  • prim-10.0-public-inf –  subcorpus of journalistic (informational) texts, 1 163 232 349 tokens, 931 861 092 words
  • prim-10.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 189 007 940 tokens, 153 180 224 words
  • prim-10.0-public-img – subcorpus of fiction texts, 282 950 554 tokens, 226 154 881 words
  • prim-10.0-public-sk – subcorpus of original Slovak texts (79.82 % journalistic, 7.09 % fiction, 12.06 % professional, 1.03 % other texts), 1 361 493 241 tokens 1 093 242 491 words
  • prim-10.0-public-img-sk – subcorpus of original Slovak fiction texts, 96 575 573 tokens, 77 595 977 words
  • r1955az1989-7.0 – specific corpus of texts from years 1955 – 1989 (3.99 % journalistic, 81.15 % fiction, 11.10 % professional, 3.76 % other texts), 108 567 651 tokens, 87 398 831 words
The corpus is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by MorphoDiTa tagger which has been trained at the Slovak National Corpus.
 

Frequency statistics of the corpora

Statistic information on earlier versions are available here: prim-9.0, prim-8.0, prim-7.0, prim-6.1, prim-6.0, prim-5.0, prim-4.0, prim-3.0 a prim-2.1.