MAIN CORPUS OF WRITTEN TEXTS
The version prim-8.0 of the SNC is comprised of the publicly available subcorpora:
- prim-8.0-public-all – all publicly available SNC texts (71.10 % journalistic, 15.22 % fiction, 8.51 % professional and 5.17 % other texts), 1 477 447 216 tokens, 1 160 286 731 words
- prim-8.0-public-sane – excluding texts with incorrect diacritics, published before the year 1955, from outside the territory of Slovakia, and from linguistic journals (73.75 % journalistic, 16.33 % fiction, 8.91 % professional, 1.01 % other texts), 1 368 990 447 tokens, 1 076 309 519 words
- prim-8.0-public-vyv – balanced subcorpus (33.33 % journalistic, 33.33 % fiction, 33.33 % professional texts), 377 138 077 tokens, 297 524 160 words
- prim-8.0-public-inf – subcorpus of journalistic (informational) texts, 1 009 613 215 tokens, 791 376 893 words
- prim-8.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 121 926 591 tokens, 96 084 340 words
- prim-8.0-public-img – subcorpus of fiction texts, 223 552 510 tokens, 177 545 076 words
- prim-8.0-public-sk – subcorpus of original Slovak texts, (81.24 % journalistic, 7.91 % fiction, 9.53 % professional, 1.32 % other texts), 1 042 623 207 tokens, 821 878 724 words
- prim-8.0-public-img-sk – subcorpus of original Slovak fiction texts, 82 503 983 tokens, 65 627 003 words
- r1955az1989-4.0 – specific corpus of texts from years 1955–1989 (5.11 % journalistic, 75.73 % fiction, 13.82 % professional, 5.34 % other texts), 83 631 422 tokens, 66 825 217 words.
The corpus is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by MorphoDiTa tagger which has been trained at the Slovak National Corpus.
Frequency statistics of the corpora
Statistic information on earlier versions are available here: prim-7.0, prim-6.1, prim-6.0, prim-5.0, prim-4.0, prim-3.0 a prim-2.1.