Main corpus of written texts
The version prim-10.0 is comprised of the publicly available subcorpora:
- prim-10.0-public-all – all publicly available SNC texts (71.00 % journalistic, 16.82 % fiction, 11.28 % professional, 0.90 % other texts, 1 688 211 881 tokens, 1 355 262 962 words
- prim-10.0-public-sane – excluding texts with incorrect diacritics, from outside the territory of Slovakia (70.52 % journalistic, 17.15 % fiction, 11.46 % professional, 0.87 % other texts), 1 649 561 653 tokens, 1 323 046 192 words
- prim-10.0-public-vyv – balanced subcorpus (33.33 % journalistic, 33.33 % fiction, 33.33 % professional texts), 571 526 056 tokens, 459 358 995 words
- prim-10.0-public-inf – subcorpus of journalistic (informational) texts, 1 163 232 349 tokens, 931 861 092 words
- prim-10.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 189 007 940 tokens, 153 180 224 words
- prim-10.0-public-img – subcorpus of fiction texts, 282 950 554 tokens, 226 154 881 words
- prim-10.0-public-sk – subcorpus of original Slovak texts (79.82 % journalistic, 7.09 % fiction, 12.06 % professional, 1.03 % other texts), 1 361 493 241 tokens 1 093 242 491 words
- prim-10.0-public-img-sk – subcorpus of original Slovak fiction texts, 96 575 573 tokens, 77 595 977 words
- r1955az1989-7.0 – specific corpus of texts from years 1955 – 1989 (3.99 % journalistic, 81.15 % fiction, 11.10 % professional, 3.76 % other texts), 108 567 651 tokens, 87 398 831 words
The corpus is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by MorphoDiTa tagger which has been trained at the Slovak National Corpus.