Content
- 1. Written corpora − synchronous, general
- 2. Written corpora − synchronous, web
- 3. Written corpora − synchronous, merged
- 4. Written corpora − parallel
- 5. Written corpora − synchronous, acquisitional
- 6. Written corpora − synchronous, specialised
- 7. Written corpora of texts before the year 1955 (mainly texts of books from the SME Golden Fund)
- 8. Spoken corpora − synchronous, standard
- 9. Corpora of dialects of the SNC
- 10. Historical corpus
1. Written corpora − synchronous, general
version of the main corpus and subcorpora | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
1 961 million tokens / 1 572 million words | yes | – internal corpus | main (monolingual) corpus comprised of all texts published or written after the year 1955 | |
1 688 million tokens / 1 355 million words | yes | 2022 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (71.0 % journalistic, 16.8 % fiction, 11.3 % professional, 0.9 % other texts) | |
1 921 million tokens / 1 540 million words | yes | – | main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. | |
1 650 million tokens / 1 323 million words | yes | 2022 | main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora | |
572 million tokens / 459 million words | yes | 2022 | subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) | |
1 163 million tokens / 932 million words | yes | 2022 | subcorpus of journalistic (informational) texts | |
189 million tokens / 153 million words | yes | 2022 | subcorpus of scientific, professional and popular science texts | |
283 million tokens / 226 million words | yes | 2022 | subcorpus of fiction texts | |
1 361 million tokens / 1 093 million words | yes | 2022 | subcorpus of original texts written in Slovak | |
97 million tokens / 78 million words | yes | 2022 | subcorpus of original fiction texts written in Slovak | |
109 million tokens / 87 million words | yes | 2022 | subcorpus of texts from years 1955–1989 (4.0 % journalistic, 81.2 % fiction, 11.1 % professional, 3.7 % other texts) | |
1 870 million tokens / 1 455 million words | yes | – | main (monolingual) corpus comprised of all texts published or written after the year 1955 | |
1 652 million tokens / 1 282 million words | yes | 2020 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (74.0 % journalistic, 16.0 % fiction, 9.2 % professional, 0.9 % other texts) | |
1 838 million tokens / 1 429 million words | yes | – | main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. | |
1 621 million tokens / 1 257 million words | yes | 2020 | main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora | |
454 million tokens / 355 million words | yes | 2020 | subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) | |
1 194 million tokens / 920 million words | yes | 2020 | subcorpus of journalistic (informational) texts | |
150 million tokens / 117 million words | yes | 2020 | subcorpus of scientific, professional and popular science texts | |
263 million tokens / 208 million words | yes | 2020 | subcorpus of fiction texts | |
1 258 million tokens / 977 million words | yes | 2020 | subcorpus of original texts written in Slovak | |
93 million tokens / 74 million words | yes | 2020 | subcorpus of original texts written in Slovak | |
99 million tokens / 79 million words | yes | 2020 | podkorpus textov z rokov 1955 – 1989 (4.5 % journalistic, 78.6 % fiction, 12.4 % professional, 4.4 % other texts) | |
1 647 million tokens / 1 295 million words | yes | – | main (monolingual) corpus comprised of all texts published or written after the year 1955 | |
1 477 million tokens / 1 160 million words | yes | 2018 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (71.1 % journalistic, 15.4 % fiction, 8.5 % professional, 5.0 % other texts) | |
1 518 million tokens / 1 195 million words | yes | – | main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. | |
1 369 million tokens / 1 076 million words | yes | 2018 | main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora | |
377 million tokens / 298 million words | yes | 2018 | subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) | |
1 010 million tokens / 791 million words | yes | 2018 | subcorpus of journalistic (informational) texts | |
122 million tokens / 96 million words | yes | 2018 | subcorpus of scientific, professional and popular science texts | |
224 million tokens / 178 million words | yes | 2018 | subcorpus of fiction texts | |
1 043 million tokens / 822 million words | yes | 2018 | subcorpus of original texts written in Slovak | |
83 million tokens / 66 million words | yes | 2018 | subcorpus of original texts written in Slovak | |
84 million tokens / 67 million words | yes | 2018 | podkorpus textov z rokov 1955 – 1989 (5.3 % journalistic, 75.3 % fiction, 14.0 % professional, 5.4 % other texts) | |
1 437 million tokens / 1 119 million words | yes | – | main (monolingual) corpus comprised of all texts published or written after the year 1955 | |
1 250 million tokens / 972 million words | yes | 2015 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (65.1 % journalistic, 15.1 % fiction, 9.5 % professional, 10.3 % other texts) | |
1 202 million tokens / 938 million words | yes | – | main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. | |
1 089 million tokens / 849 million words | yes | 2015 | main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora | |
341 million tokens / 267 million words | yes | 2015 | subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts) | |
771 million tokens / 597 million words | yes | 2015 | subcorpus of journalistic (informational) texts | |
114 million tokens / 89 million words | yes | 2015 | subcorpus of scientific, professional and popular science texts | |
188 million tokens / 149 million words | yes | 2015 | subcorpus of fiction texts | |
807 million tokens / 630 million words | yes | 2015 | subcorpus of original texts written in Slovak | |
65 million tokens / 52 million words | yes | 2015 | subcorpus of original texts written in Slovak | |
67 million tokens / 54 million words | yes | 2015 | podkorpus textov z rokov 1955 – 1989 (7.4 % journalistic, 69.3 % fiction, 16.6 % professional, 6.7 % other texts) | |
830 million tokens / 656 million words | yes | 2013 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (68.8 % journalistic, 13.9 % fiction, 15.3 % professional, 2 % other texts) | |
63 million tokens / 51 million words | yes | 2013 | podkorpus textov z rokov 1955 – 1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional, 8.5 % other texts) | |
1 155 million tokens / 939 million words | yes | 2013 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (77.8 % journalistic, 9.8 % fiction, 11 % professional, 1.4 % other texts) | |
719 million tokens / 599 million words | yes | 2011 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (73 % journalistic, 14 % fiction, 12 % professional, 1 % other texts) | |
44 million tokens / 35 million words | yes | 2011 | podkorpus textov z rokov 1955 – 1989 | |
526 million tokens / 429 million words | yes | 2009 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (65 % journalistic, 17 % fiction, 16 % professional, 2 % other texts) | |
40 million tokens / 32 million words | yes | 2009 | podkorpus textov z rokov 1955 – 1989 | |
339 million tokens / 276 million words | yes | 2007 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (57 % journalistic, 21.5 % fiction, 18.5 % professional, 3 % other texts) | |
294 million tokens / 229 million words | yes | 2006 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (63 % journalistic, 20 % fiction, 12 % professional, 5 % other texts) | |
prim-2.0-public-all | 250 million tokens | pilot | 2005 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search |
prim-1.0-public-all | 182 million tokens | test | 2004 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search |
prim-0.2-public-all | 170 million tokens | no | 2003 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search |
prim-0.1-public-all | 30 million tokens | no | 2003 | main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search |
2. Written corpora − synchronous, web
corpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
4 373 million tokens / 3 639 million words | yes | 2022 | corpus of Slovak texts available on the web | |
4 042 million tokens / 3 326 million words | yes | 2020 | corpus of Slovak texts available on the web | |
2 963 million tokens / 2 440 million words | yes | 2018 | corpus of Slovak texts available on the web | |
2 372 million tokens / 1 993 million words | yes | 2015 | corpus of Slovak texts available on the web | |
1 046 million tokens / 839 million words | yes | 2012 | corpus of Slovak texts available on the web | |
952 million tokens / 773 million words | yes | 2011 | corpus of Slovak texts available on the web | |
51 million tokens / 38 million words | yes | 2020 | corpus of texts from Slovak Wikipédia (as of 2019-08-01) | |
47 million tokens / 35 million words | yes | 2018 | corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2018-03-15) | |
45 million tokens / 34 million words | yes | 2017 | corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2017-02-28) | |
43 million tokens / 34 million words | yes | 2016 | corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2016-02-26) | |
40 million tokens / 32 million words | yes | 2015 | corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2015-02-28) |
3. Written corpora − synchronous, merged
corpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
2 239 million tokens | yes | 2013 | Corpus omnia-2.0-public – it is the merged corpus from corpora: prim-6.0-public-all, s-hovor-4.0, legal-1.1, web-1.1, web-1.2 after removing duplicate texts or duplicate parts of texts (deduplication) and with minor modifications in tokenization (words with a hyphen are like one token) and in lemmatization (negated forms are within the affirmative lemma) prepared from the sources of SNC V. Benko primarily for the needs of the staff of the department of contemporary lexicology and lexicography Štúr Institute of Linguistics of the Slovak Academy of Sciences. |
4. Written corpora − parallel
corpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release (first version released in) | characteristics |
556 million tokens / 436 million words | yes, | 2015 | Slovak-English paralel corpus: 261 million tokens in Slovak half, 295 million tokens in English half | |
200 million tokens / 160 million words | yes, | 2015 | Slovak-English paralelný korpus, podkorpus beletrie: 92 million tokens in Slovak half, 108 million tokens in English half | |
163 million tokens / 108 million words | yes, both languages | 2014 | Slovak-Bulgarian paralel corpus: 78 million tokens in Slovak half, 85 million tokens in Bulgarian half | |
418 million tokens / 306 million words | yes, | 2016 | Slovak-Czech paralel corpus: 209 million tokens in Slovak half, 209 million tokens in Czech half | |
31.5 million tokens / 25.0 million words | yes, | 2018 | Slovak-Czech paralelný korpus, podkorpus beletrie: 15.7 million tokens in Slovak half, 15.8 million tokens in Czech half | |
449 million tokens / 332 million words | yes, | 2016 | Slovak-French paralel corpus: 217 million tokens in Slovak half, 233 million tokens ino French half | |
9.9 million tokens / 8.3 million words | yes, | 2016 | Slovak-French paralelný korpus, podkorpus beletrie: 4.3 million tokens in Slovak half, 5.5 million tokens in French half | |
5.0 million tokens / 4.1 million words | yes, | 2018 | Slovak-Latin paralel corpus: 2.7 million tokens in Slovak half, 2.3 million tokens in Latin half | |
99 million tokens / 75 million words | yes, | 2015 | Slovak-Hungarian paralel corpus: 51 million tokens in Slovak half, 48 million tokens in Hungarian half | |
4.0 million tokens / 3.2 million words | yes, | 2015 | Slovak-Hungarian paralelný korpus, podkorpus beletrie: 2.0 million tokens in Slovak half, 1.9 million tokens in Hungarian half | |
468 million tokens / 318 million words | yes, | 2022 (2014) | Slovak-German paralel corpus: 230 million tokens in Slovak half, 238 million tokens in German half | |
29.7 million tokens / 24.1 million words | yes, | 2022 | Slovak-German paralelný korpus, podkorpus beletrie: 13.7 million tokens in Slovak half, 16.0 million tokens in German half | |
446 million tokens / 300 million words | yes, both languages | 2016 (2014) | Slovak-German paralel corpus: 220 million tokens in Slovak half, 226 million tokens in German half | |
7.6 million tokens / 6.2 million words | yes, | 2016 | Slovak-German paralelný korpus, podkorpus beletrie: 3.5 million tokens in Slovak half, 4.1 million tokens in German half | |
8.2 million tokens / 6.5 million words | yes, both languages | 2018 (2018) | Slovak-Polish paralel corpus: 4.1 million tokens in Slovak half, 4.1 million tokens in Polish half | |
1.3 million tokens / 1.0 million words | yes, | 2017 | Slovak-Romanian paralel corpus: 603 111 tokens in Slovak half, 688 867 tokens in Romanian half | |
8.5 million tokens / 6.6 million words | yes, | 2014 | Slovak-Russian paralel corpus: 4.2 million tokens in Slovak half, 4.2 million tokens in Russian half | |
35.6 million tokens / 29.4 million words | yes, | 2022 | Slovak-Spanish paralel corpus: 16.7 million tokens in Slovak half, 18.9 million tokens in Spanish half | |
11.5 million tokens / 9.6 million words | yes, | 2019 | Slovak-Spanish paralel corpus: 5.5 million tokens in Slovak half, 6.0 million tokens in Spanish half |
5. Written corpora − synchronous, acquisitional
corpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
137 393 tokens / 112 271 words | yes | 2022 | corpus of written texts of students learning Slovak as a foreign language |
6. Written corpora − synchronous, specialised
corpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
66 million tokens / 54 million words | yes | 2014 | corpus of religious texts | |
15 million tokens / 12 million words | yes | 2008 | corpus of religious texts | |
1.6 million tokens / 1.2 million words | yes | 2014 | corpus of copywrighting texts | |
165 million tokens / 140 million words | yes | 2016 | corpus of economic texts (3.8 % professional and 96.2 % journalistic texts from the field of economics, banking, trade, management and merchandising) | |
20 million tokens / 17 million words | yes | 2014 | corpus of economic texts (81.4 % professional and 18.6 % journalistic texts from the field of economics, banking, trade, management and merchandising) | |
11.7 million tokens / 9.6 million words | yes | 2019 | corpus of texts of state service | |
39 million tokens / 30 million words | yes | 2016 | corpus of humanistic texts | |
1.5 million tokens / 1.3 million words | yes | 2015 | corpus of judicial decisions | |
49 million tokens / 40 million words | yes | 2013 | corpus of legal texts (deduplicated) | |
147 million tokens / 114 million words | yes | 2011 | corpus of legal texts | |
4 149 million tokens | yes | 2019 | corpus of texts of judgments (corpus from project OpenData) | |
253 million tokens / 203 million words
| yes | 2018 | The reference corpus prim-7.0-frk was the source for “Frekvenčný slovník slovenčiny na báze Slovenského národného korpusu” (Slovak Frequency Dictionary Based on the Slovak National Corpus), as well as for the examples listed in the publication “Skloňovanie podstatných mien v slovenčine s korpusovými príkladmi” (Declension of the Slovak Nouns with Corpus Examples). | |
1 199 794 tokens / 977 871 words | yes | 2017 | manually morphologically annotated corpus (30.6 % journalistic, 50.2 % fiction, 19.2 % professional texts) | |
1 200 088 tokens / 977 871 words | yes | 2016 | manually morphologically annotated corpus (28.5 % journalistic, 44.5 % fiction, 27 % professional texts) | |
1 199 224 tokens / 976 877 words | yes | 2013 | manually morphologically annotated corpus (36.2 % journalistic, 44.9 % fiction, 18.9 % professional texts) | |
1 207 813 tokens / 983 714 words | yes | 2008 | manually morphologically annotated corpus (36.7 % journalistic, 44.3 % fiction, 19.0 % professional texts) | |
511 432 tokens / 410 177 words | yes | 2007 | manually morphologically annotated corpus (28.9 % journalistic, 58.1 % fiction, 13.0 % professional texts) | |
322 498 tokens / 256 647 words | yes | 2006 | manually morphologically annotated corpus (41.8 % journalistic, 57.9 % fiction, 0.2 % professional texts) |
7. Written corpora of texts before the year 1955 (mainly texts of books from the SME Golden Fund)
corpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
2.1 million tokens / 1.6 million words | no | 2015 | corpus of texts from 864–1843: texts transcribed into contemporary Slovak, orthography as used in the latest edition | |
24 million tokens / 19 million words | no | 2015 | corpus of texts from 1843–1954: texts transcribed into contemporary Slovak, orthography as used in the latest edition |
8. Spoken corpora − synchronous, standard
version of corpus and subcorpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
7.9 million tokens | yes | 2022 | corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia | |
4.2 million tokens | yes | 2022 | subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute | |
3.6 million tokens | yes | 2022 | subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute | |
6.6 million tokens / 5.5 million words | yes | 2017 | corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia | |
3.7 million tokens / 3.0 million words | yes | 2017 | subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute | |
2.9 million tokens / 2.4 million words | yes | 2017 | subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute | |
5.7 million tokens / 4.7 million words | yes | 2015 | corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia | |
3.6 million tokens / 3.0 million words | yes | 2015 | subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute | |
2.1 million tokens / 1.8 million words | yes | 2015 | subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute | |
2.6 million tokens / 2.2 million words | yes | 2012 | corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia | |
1.6 million tokens / 1.3 million words | yes | 2012 | subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute | |
1.0 million tokens / 0.9 million words | yes | 2012 | subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute | |
2.1 million tokens / 1.4 million words | yes | 2011 | corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia | |
678 592 tokens / 560 933 words | yes | 2010 | corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia | |
127 714 tokens / 104 458 words | yes | 2008 | corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia |
9. Corpora of dialects of the SNC
corpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
980 643 tokens / 786 312 words | no | 2022 | corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia | |
711 766 tokens / 571 352 words | no | 2018 | corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia | |
494 722 tokens / 403 180 words | no | 2016 | corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia | |
328 907 tokens / 252 166 words | no | 2015 | corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia | |
73 855 tokens / 54 598 words | no | 2014 | corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia |
10. Historical corpus
corpus | size number of tokens / number of words | lemmatisation, morphological annotation | year of release | characteristics |
916 743 tokens / 720 492 words | no | 2022 | corpus of historical Slovak: source materials (in original spelling) | |
997 809 tokens / 731 498 words | no | 2020 | corpus of historical Slovak: source materials (in original spelling) | |
917 586 tokens / 668 245 words | no | 2016 | corpus of historical Slovak: source materials (in original spelling) | |
836 393 tokens / 600 410 words | no | 2015 | corpus of historical Slovak: source materials (in original spelling) | |
551 973 tokens / 422 166 words | no | 2014 | corpus of historical Slovak: source materials (in original spelling) | |
370 758 tokenov | no | 2012 | corpus of historical Slovak: source materials (in original spelling) |