Structure of the corpus prim-10.0

Main corpus of written texts

The version prim-10.0 is comprised of the publicly available subcorpora:

prim-10.0-public-all – all publicly available SNC texts (71.00 % journalistic, 16.82 % fiction, 11.28 % professional, 0.90 % other texts, 1 688 211 881 tokens, 1 355 262 962 words
prim-10.0-public-sane – excluding texts with incorrect diacritics, from outside the territory of Slovakia (70.52 % journalistic, 17.15 % fiction, 11.46 % professional, 0.87 % other texts), 1 649 561 653 tokens, 1 323 046 192 words
prim-10.0-public-vyv – balanced subcorpus (33.33 % journalistic, 33.33 % fiction, 33.33 % professional texts), 571 526 056 tokens, 459 358 995 words
prim-10.0-public-inf – subcorpus of journalistic (informational) texts, 1 163 232 349 tokens, 931 861 092 words
prim-10.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 189 007 940 tokens, 153 180 224 words
prim-10.0-public-img – subcorpus of fiction texts, 282 950 554 tokens, 226 154 881 words
prim-10.0-public-sk – subcorpus of original Slovak texts (79.82 % journalistic, 7.09 % fiction, 12.06 % professional, 1.03 % other texts), 1 361 493 241 tokens 1 093 242 491 words
prim-10.0-public-img-sk – subcorpus of original Slovak fiction texts, 96 575 573 tokens, 77 595 977 words
r1955az1989-7.0 – specific corpus of texts from years 1955 – 1989 (3.99 % journalistic, 81.15 % fiction, 11.10 % professional, 3.76 % other texts), 108 567 651 tokens, 87 398 831 words

The corpus is provided with a detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated by MorphoDiTa tagger which has been trained at the Slovak National Corpus.

Frequency statistics of the corpora

Statistic information on earlier versions are available here: prim-9.0, prim-8.0, prim-7.0, prim-6.1, prim-6.0, prim-5.0, prim-4.0, prim-3.0 a prim-2.1.

Structure of the corpus prim-10.0

Main corpus of written texts

Frequency statistics of the corpora

Address

Phone

Mobile

E-mail