Structure of the corpus prim-7.0

Main corpus of written texts

The version prim-7.0 of the SNC is comprised of the publicly available subcorpora:

  • prim-7.0-public-all – all publicly available SNC texts (65.1 % journalistic, 15.1 % fiction, 9.5 % professional and 10.3 % other texts), 1 250 382 876 tokens, 971 799 239 words
  • prim-7.0-public-sane – excluding texts with incorrect diacritics, before the year 1955,
    from outside the territory of Slovakia, and from linguistic journals, 1 089 102 930 tokens, 848 547 025 words
  • prim-7.0-public-vyv – balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 %
    professional texts), 340 708 046 tokens, 266 732 524 words
  • prim-7.0-public-inf – subcorpus of journalistic (informational) texts, 771 248 707 tokens, 597 141 681 words
  • prim-7.0-public-prf – subcorpus of scientific, professional and non-fiction texts, 114 081 861 tokens, 89 152 482 words
  • prim-7.0-public-img – subcorpus of fiction texts, 187 749 798 tokens, 149 220 076 words
  • prim-7.0-public-sk – subcorpus of original Slovak texts, 806 707 046 tokens, 629 681 531 words
  • prim-7.0-public-img-sk – subcorpus of original Slovak fiction texts, 65 009 205 tokens, 51 839 437 words
  • r1955az1989-4.0 – specific corpus of texts from years 1955–1989 (7.4 % journalistic,
    69.3 % fiction, 16.6 % professional and 6.7 % other texts), 67 392 068 tokens, 53 998 092 words.

The corpus is provided with detailed bibliographical, style and genre annotation, it is lemmatized and morphologically annotated.

Statistic information on earlier versions are available here: prim-6.1, prim-6.0, prim-5.0, prim-4.0, prim-3.0 a prim-2.1.