Structure of the corpus prim-8.0

MAIN CORPUS OF WRITTEN TEXTS

The version prim-8.0 of the SNC is comprised of the publicly available subcorpora:

prim-8.0-public-all – all publicly available SNC texts (71.10 % journalistic, 15.22 % fiction, 8.51 % professional and 5.17 % other texts), 1 477 447 216 tokens, 1 160 286 731 words
prim-8.0-public-sane – excluding texts with incorrect diacritics, published before the year 1955, from outside the territory of Slovakia, and from linguistic journals (73.75 % journalistic, 16.33 % fiction, 8.91 % professional, 1.01 % other texts), 1 368 990 447 tokens, 1 076 309 519 words
prim-8.0-public-vyv – balanced subcorpus (33.33 % journalistic, 33.33 % fiction, 33.33 % professional texts), 377 138 077 tokens, 297 524 160 words
prim-8.0-public-inf – subcorpus of journalistic (informational) texts, 1 009 613 215 tokens, 791 376 893 words
prim-8.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 121 926 591 tokens, 96 084 340 words
prim-8.0-public-img – subcorpus of fiction texts, 223 552 510 tokens, 177 545 076 words
prim-8.0-public-sk – subcorpus of original Slovak texts, (81.24 % journalistic, 7.91 % fiction, 9.53 % professional, 1.32 % other texts), 1 042 623 207 tokens, 821 878 724 words
prim-8.0-public-img-sk – subcorpus of original Slovak fiction texts, 82 503 983 tokens, 65 627 003 words
r1955az1989-4.0 – specific corpus of texts from years 1955–1989 (5.11 % journalistic, 75.73 % fiction, 13.82 % professional, 5.34 % other texts), 83 631 422 tokens, 66 825 217 words.

The corpus is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by MorphoDiTa tagger which has been trained at the Slovak National Corpus.

Frequency statistics of the corpora

Statistic information on earlier versions are available here: prim-7.0, prim-6.1, prim-6.0, prim-5.0, prim-4.0, prim-3.0 a prim-2.1.

Structure of the corpus prim-8.0

MAIN CORPUS OF WRITTEN TEXTS

Frequency statistics of the corpora

Address

Phone

Mobile

E-mail

Structure of the corpus prim-8.0

MAIN CORPUS OF WRITTEN TEXTS​

Frequency statistics of the corpora​

MAIN CORPUS OF WRITTEN TEXTS

Frequency statistics of the corpora