Publicly available SNC corpora

Content

1. Written corpora − synchronous, general

version of the main corpus and subcorpora	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
prim-11.0-juls-all	2 137 million tokens / 1 716 million words	yes	– internal corpus	main (monolingual) corpus comprised of all texts published or written after the year 1955
prim-11.0-public-all	1 859 million tokens / 1 494 million words	yes	2025	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (69.5 % journalistic, 17.9 % fiction, 11.8 % professional, 0.9 % other texts)
prim-11.0-juls-sane	2 108 million tokens / 1 692 million words	yes	– internal corpus	main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.
prim-11.0-public-sane	1 831 million tokens / 1 471 million words	yes	2025	main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora
prim-11.0-public-vyv	656 million tokens / 528 million words	yes	2025	subcorpus balanced with regard to style (33.4 % journalistic, 33.3 % fiction, 33.3 % professional texts)
prim-11.0-public-inf	1 265 million tokens / 1 016 million words	yes	2025	subcorpus of journalistic (informational) texts
prim-11.0-public-prf	217 million tokens / 176 million words	yes	2025	subcorpus of scientific, professional and popular science texts
prim-11.0-public-img	333 million tokens / 266 million words	yes	2025	subcorpus of fiction texts
prim-11.0-public-sk	1 492 million tokens / 1 200 million words	yes	2025	subcorpus of original texts written in Slovak
prim-11.0-public-img-sk	112 million tokens / 90 million words	yes	2025	subcorpus of original fiction texts written in Slovak
r1955az1989-8.0	118 million tokens / 95 million words	yes	2025	subcorpus of texts from years 1955–1989 (3.7 % journalistic, 82.2 % fiction, 10.2 % professional, 4.0 % other texts)
prim-10.0-juls-all	1 961 million tokens / 1 572 million words	yes	– internal corpus	main (monolingual) corpus comprised of all texts published or written after the year 1955
prim-10.0-public-all	1 688 million tokens / 1 355 million words	yes	2022	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (71.0 % journalistic, 16.8 % fiction, 11.3 % professional, 0.9 % other texts)
prim-10.0-juls-sane	1 921 million tokens / 1 540 million words	yes	– internal corpus	main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.
prim-10.0-public-sane	1 650 million tokens / 1 323 million words	yes	2022	main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora
prim-10.0-public-vyv	572 million tokens / 459 million words	yes	2022	subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)
prim-10.0-public-inf	1 163 million tokens / 932 million words	yes	2022	subcorpus of journalistic (informational) texts
prim-10.0-public-prf	189 million tokens / 153 million words	yes	2022	subcorpus of scientific, professional and popular science texts
prim-10.0-public-img	283 million tokens / 226 million words	yes	2022	subcorpus of fiction texts
prim-10.0-public-sk	1 361 million tokens / 1 093 million words	yes	2022	subcorpus of original texts written in Slovak
prim-10.0-public-img-sk	97 million tokens / 78 million words	yes	2022	subcorpus of original fiction texts written in Slovak
r1955az1989-7.0	109 million tokens / 87 million words	yes	2022	subcorpus of texts from years 1955–1989 (4.0 % journalistic, 81.2 % fiction, 11.1 % professional, 3.7 % other texts)
prim-9.0-juls-all	1 870 million tokens / 1 455 million words	yes	– internal corpus	main (monolingual) corpus comprised of all texts published or written after the year 1955
prim-9.0-public-all	1 652 million tokens / 1 282 million words	yes	2020	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (74.0 % journalistic, 16.0 % fiction, 9.2 % professional, 0.9 % other texts)
prim-9.0-juls-sane	1 838 million tokens / 1 429 million words	yes	– internal corpus	main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.
prim-9.0-public-sane	1 621 million tokens / 1 257 million words	yes	2020	main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora
prim-9.0-public-vyv	454 million tokens / 355 million words	yes	2020	subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)
prim-9.0-public-inf	1 194 million tokens / 920 million words	yes	2020	subcorpus of journalistic (informational) texts
prim-9.0-public-prf	150 million tokens / 117 million words	yes	2020	subcorpus of scientific, professional and popular science texts
prim-9.0-public-img	263 million tokens / 208 million words	yes	2020	subcorpus of fiction texts
prim-9.0-public-sk	1 258 million tokens / 977 million words	yes	2020	subcorpus of original texts written in Slovak
prim-9.0-public-img-sk	93 million tokens / 74 million words	yes	2020	subcorpus of original texts written in Slovak
r1955az1989-6.0	99 million tokens / 79 million words	yes	2020	podkorpus textov z rokov 1955 – 1989 (4.5 % journalistic, 78.6 % fiction, 12.4 % professional, 4.4 % other texts)
prim-8.0-juls-all	1 647 million tokens / 1 295 million words	yes	– internal corpus	main (monolingual) corpus comprised of all texts published or written after the year 1955
prim-8.0-public-all	1 477 million tokens / 1 160 million words	yes	2018	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (71.1 % journalistic, 15.4 % fiction, 8.5 % professional, 5.0 % other texts)
prim-8.0-juls-sane	1 518 million tokens / 1 195 million words	yes	– internal corpus	main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.
prim-8.0-public-sane	1 369 million tokens / 1 076 million words	yes	2018	main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora
prim-8.0-public-vyv	377 million tokens / 298 million words	yes	2018	subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)
prim-8.0-public-inf	1 010 million tokens / 791 million words	yes	2018	subcorpus of journalistic (informational) texts
prim-8.0-public-prf	122 million tokens / 96 million words	yes	2018	subcorpus of scientific, professional and popular science texts
prim-8.0-public-img	224 million tokens / 178 million words	yes	2018	subcorpus of fiction texts
prim-8.0-public-sk	1 043 million tokens / 822 million words	yes	2018	subcorpus of original texts written in Slovak
prim-8.0-public-img-sk	83 million tokens / 66 million words	yes	2018	subcorpus of original texts written in Slovak
r1955az1989-5.0	84 million tokens / 67 million words	yes	2018	podkorpus textov z rokov 1955 – 1989 (5.3 % journalistic, 75.3 % fiction, 14.0 % professional, 5.4 % other texts)
prim-7.0-juls-all	1 437 million tokens / 1 119 million words	yes	– internal corpus	main (monolingual) corpus comprised of all texts published or written after the year 1955
prim-7.0-public-all	1 250 million tokens / 972 million words	yes	2015	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (65.1 % journalistic, 15.1 % fiction, 9.5 % professional, 10.3 % other texts)
prim-7.0-juls-sane	1 202 million tokens / 938 million words	yes	– internal corpus	main (monolingual) corpus excluding texts with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc.
prim-7.0-public-sane	1 089 million tokens / 849 million words	yes	2015	main (monolingual) corpus of texts under the license on on-line search, excluding texts: with incorrect diacritics, from outside the territory of Slovakia, from linguistic journals, scholarly works etc. − the corpus is further divided into subcorpora
prim-7.0-public-vyv	341 million tokens / 267 million words	yes	2015	subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)
prim-7.0-public-inf	771 million tokens / 597 million words	yes	2015	subcorpus of journalistic (informational) texts
prim-7.0-public-prf	114 million tokens / 89 million words	yes	2015	subcorpus of scientific, professional and popular science texts
prim-7.0-public-img	188 million tokens / 149 million words	yes	2015	subcorpus of fiction texts
prim-7.0-public-sk	807 million tokens / 630 million words	yes	2015	subcorpus of original texts written in Slovak
prim-7.0-public-img-sk	65 million tokens / 52 million words	yes	2015	subcorpus of original texts written in Slovak
r1955az1989-4.0	67 million tokens / 54 million words	yes	2015	podkorpus textov z rokov 1955 – 1989 (7.4 % journalistic, 69.3 % fiction, 16.6 % professional, 6.7 % other texts)
prim-6.1-public-all	830 million tokens / 656 million words	yes	2013	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (68.8 % journalistic, 13.9 % fiction, 15.3 % professional, 2 % other texts)
r55az89-3.0	63 million tokens / 51 million words	yes	2013	podkorpus textov z rokov 1955 – 1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional, 8.5 % other texts)
prim-6.0-public-all	1 155 million tokens / 939 million words	yes	2013	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (77.8 % journalistic, 9.8 % fiction, 11 % professional, 1.4 % other texts)
prim-5.0-public-all	719 million tokens / 599 million words	yes	2011	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (73 % journalistic, 14 % fiction, 12 % professional, 1 % other texts)
r55az89-2.0	44 million tokens / 35 million words	yes	2011	podkorpus textov z rokov 1955 – 1989
prim-4.0-public-all	526 million tokens / 429 million words	yes	2009	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (65 % journalistic, 17 % fiction, 16 % professional, 2 % other texts)
r55az89-1.0	40 million tokens / 32 million words	yes	2009	podkorpus textov z rokov 1955 – 1989
prim-3.0-public-all	339 million tokens / 276 million words	yes	2007	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (57 % journalistic, 21.5 % fiction, 18.5 % professional, 3 % other texts)
prim-2.1-public-all	294 million tokens / 229 million words	yes	2006	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search (63 % journalistic, 20 % fiction, 12 % professional, 5 % other texts)
prim-2.0-public-all	250 million tokens	pilot	2005	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search
prim-1.0-public-all	182 million tokens	test	2004	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search
prim-0.2-public-all	170 million tokens	no	2003	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search
prim-0.1-public-all	30 million tokens	no	2003	main (monolingual) corpus comprised of all texts published or written after the year 1955 under the license on on-line search

2. Written corpora − synchronous, web

corpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
web-6.0	4 373 million tokens / 3 639 million words	yes	2022	corpus of Slovak texts available on the web
web-5.0	4 042 million tokens / 3 326 million words	yes	2020	corpus of Slovak texts available on the web
web-4.0	2 963 million tokens / 2 440 million words	yes	2018	corpus of Slovak texts available on the web
web-3.0	2 372 million tokens / 1 993 million words	yes	2015	corpus of Slovak texts available on the web
web-2.0	1 046 million tokens / 839 million words	yes	2012	corpus of Slovak texts available on the web
web-1.0	952 million tokens / 773 million words	yes	2011	corpus of Slovak texts available on the web
wiki-2019-08	51 million tokens / 38 million words	yes	2020	corpus of texts from Slovak Wikipédia (as of 2019-08-01)
wiki-2018-03	47 million tokens / 35 million words	yes	2018	corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2018-03-15)
wiki-2017-02	45 million tokens / 34 million words	yes	2017	corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2017-02-28)
wiki-2016-02	43 million tokens / 34 million words	yes	2016	corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2016-02-26)
wiki-2015-02	40 million tokens / 32 million words	yes	2015	corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2015-02-28)

3. Written corpora − synchronous, merged

corpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
omnia-2.0-public	2 239 million tokens	yes	2013	Corpus omnia-2.0-public – it is the merged corpus from corpora: prim-6.0-public-all, s-hovor-4.0, legal-1.1, web-1.1, web-1.2 after removing duplicate texts or duplicate parts of texts (deduplication) and with minor modifications in tokenization (words with a hyphen are like one token) and in lemmatization (negated forms are within the affirmative lemma) prepared from the sources of SNC V. Benko primarily for the needs of the staff of the department of contemporary lexicology and lexicography Štúr Institute of Linguistics of the Slovak Academy of Sciences.

4. Written corpora − parallel

corpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release (first version released in)	characteristics
par-sken-all-4.0	556 million tokens / 436 million words	yes, both languages	2015 (2010)	Slovak-English paralel corpus: 261 million tokens in Slovak half, 295 million tokens in English half
par-sken-fic-4.0	200 million tokens / 160 million words	yes, both languages	2015	Slovak-English paralelný korpus, podkorpus beletrie: 92 million tokens in Slovak half, 108 million tokens in English half
par-skbg-free-0.1	163 million tokens / 108 million words	yes, both languages	2014	Slovak-Bulgarian paralel corpus: 78 million tokens in Slovak half, 85 million tokens in Bulgarian half
par-skcs-all-4.0	418 million tokens / 306 million words	yes, both languages	2016 (2010)	Slovak-Czech paralel corpus: 209 million tokens in Slovak half, 209 million tokens in Czech half
par-skcs-fic-5.0	31.5 million tokens / 25.0 million words	yes, both languages	2018 (2010)	Slovak-Czech paralelný korpus, podkorpus beletrie: 15.7 million tokens in Slovak half, 15.8 million tokens in Czech half
par-skfr-all-3.0	449 million tokens / 332 million words	yes, both languages	2016 (2006)	Slovak-French paralel corpus: 217 million tokens in Slovak half, 233 million tokens ino French half
par-skfr-fic-3.0	9.9 million tokens / 8.3 million words	yes, both languages	2016 (2006)	Slovak-French paralelný korpus, podkorpus beletrie: 4.3 million tokens in Slovak half, 5.5 million tokens in French half
par-skla-3.0	5.0 million tokens / 4.1 million words	yes, both languages	2018 (2012)	Slovak-Latin paralel corpus: 2.7 million tokens in Slovak half, 2.3 million tokens in Latin half
par-skhu-all-1.0	99 million tokens / 75 million words	yes, both languages	2015 (2014)	Slovak-Hungarian paralel corpus: 51 million tokens in Slovak half, 48 million tokens in Hungarian half
par-skhu-fic-1.0	4.0 million tokens / 3.2 million words	yes, both languages	2015	Slovak-Hungarian paralelný korpus, podkorpus beletrie: 2.0 million tokens in Slovak half, 1.9 million tokens in Hungarian half
par-skde-all-3.0	468 million tokens / 318 million words	yes, both languages	2022 (2014)	Slovak-German paralel corpus: 230 million tokens in Slovak half, 238 million tokens in German half
par-skde-fic-3.0	29.7 million tokens / 24.1 million words	yes, both languages	2022	Slovak-German paralelný korpus, podkorpus beletrie: 13.7 million tokens in Slovak half, 16.0 million tokens in German half
par-skde-all-2.0	446 million tokens / 300 million words	yes, both languages	2016 (2014)	Slovak-German paralel corpus: 220 million tokens in Slovak half, 226 million tokens in German half
par-skde-fic-2.0	7.6 million tokens / 6.2 million words	yes, both languages	2016	Slovak-German paralelný korpus, podkorpus beletrie: 3.5 million tokens in Slovak half, 4.1 million tokens in German half
par-skpl-1.0	8.2 million tokens / 6.5 million words	yes, both languages	2018 (2018)	Slovak-Polish paralel corpus: 4.1 million tokens in Slovak half, 4.1 million tokens in Polish half
par-skro-1.1	1.3 million tokens / 1.0 million words	yes, both languages	2017 (2016)	Slovak-Romanian paralel corpus: 603 111 tokens in Slovak half, 688 867 tokens in Romanian half
par-skru-2.0	8.5 million tokens / 6.6 million words	yes, both languages	2014 (2005)	Slovak-Russian paralel corpus: 4.2 million tokens in Slovak half, 4.2 million tokens in Russian half
par-skes-2.0	35.6 million tokens / 29.4 million words	yes, both languages	2022 (2019)	Slovak-Spanish paralel corpus: 16.7 million tokens in Slovak half, 18.9 million tokens in Spanish half
par-skes-1.0	11.5 million tokens / 9.6 million words	yes, both languages	2019 (2019)	Slovak-Spanish paralel corpus: 5.5 million tokens in Slovak half, 6.0 million tokens in Spanish half

5. Written corpora − synchronous, acquisitional

corpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
errkorp-pilot	137 393 tokens / 112 271 words	yes	2022	corpus of written texts of students learning Slovak as a foreign language

6. Written corpora − synchronous, specialised

corpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
blf-2.0	66 million tokens / 54 million words	yes	2014	corpus of religious texts
blf-1.0	15 million tokens / 12 million words	yes	2008	corpus of religious texts
cw-2014-all	1.6 million tokens / 1.2 million words	yes	2014	corpus of copywrighting texts
ecn-2.0-public	165 million tokens / 140 million words	yes	2016	corpus of economic texts (3.8 % professional and 96.2 % journalistic texts from the field of economics, banking, trade, management and merchandising)
ecn-1.0-public	20 million tokens / 17 million words	yes	2014	corpus of economic texts (81.4 % professional and 18.6 % journalistic texts from the field of economics, banking, trade, management and merchandising)
gov-web-1.0	11.7 million tokens / 9.6 million words	yes	2019	corpus of texts of state service
hum-1.0-public	39 million tokens / 30 million words	yes	2016	corpus of humanistic texts
judikat-1.0	1.5 million tokens / 1.3 million words	yes	2015	corpus of judicial decisions
legal-1.1	49 million tokens / 40 million words	yes	2013	corpus of legal texts (deduplicated)
legal-1.0	147 million tokens / 114 million words	yes	2011	corpus of legal texts
od-justice-1.0	4 149 million tokens	yes	2019	corpus of texts of judgments (corpus from project OpenData)
prim-7.0-frk	253 million tokens / 203 million words	yes	2018	The reference corpus prim-7.0-frk was the source for “Frekvenčný slovník slovenčiny na báze Slovenského národného korpusu” (Slovak Frequency Dictionary Based on the Slovak National Corpus), as well as for the examples listed in the publication “Skloňovanie podstatných mien v slovenčine s korpusovými príkladmi” (Declension of the Slovak Nouns with Corpus Examples).
r-mak-6.0	1 199 794 tokens / 977 871 words	yes	2017	manually morphologically annotated corpus (30.6 % journalistic, 50.2 % fiction, 19.2 % professional texts)
r-mak-5.0	1 200 088 tokens / 977 871 words	yes	2016	manually morphologically annotated corpus (28.5 % journalistic, 44.5 % fiction, 27 % professional texts)
r-mak-4.0	1 199 224 tokens / 976 877 words	yes	2013	manually morphologically annotated corpus (36.2 % journalistic, 44.9 % fiction, 18.9 % professional texts)
r-mak-3.0	1 207 813 tokens / 983 714 words	yes	2008	manually morphologically annotated corpus (36.7 % journalistic, 44.3 % fiction, 19.0 % professional texts)
r-mak-2.0	511 432 tokens / 410 177 words	yes	2007	manually morphologically annotated corpus (28.9 % journalistic, 58.1 % fiction, 13.0 % professional texts)
r-mak-1.0	322 498 tokens / 256 647 words	yes	2006	manually morphologically annotated corpus (41.8 % journalistic, 57.9 % fiction, 0.2 % professional texts)

7. Written corpora of texts before the year 1955 (mainly texts of books from the SME Golden Fund)

corpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
r864az1843-1.0	2.1 million tokens / 1.6 million words	no	2015	corpus of texts from 864–1843: texts transcribed into contemporary Slovak, orthography as used in the latest edition
r1843az1954-1.0	24 million tokens / 19 million words	no	2015	corpus of texts from 1843–1954: texts transcribed into contemporary Slovak, orthography as used in the latest edition

8. Spoken corpora − synchronous, standard

version of corpus and subcorpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
s-hovor-7.0	7.9 million tokens	yes	2022	corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia
s-hovor-7.0-sane	4.2 million tokens	yes	2022	subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute
s-hovor-7.0-upn	3.6 million tokens	yes	2022	subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute
s-hovor-6.0	6.6 million tokens / 5.5 million words	yes	2017	corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia
s-hovor-6.0-sane	3.7 million tokens / 3.0 million words	yes	2017	subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute
s-hovor-6.0-upn	2.9 million tokens / 2.4 million words	yes	2017	subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute
s-hovor-5.0	5.7 million tokens / 4.7 million words	yes	2015	corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia
s-hovor-5.0-sane	3.6 million tokens / 3.0 million words	yes	2015	subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute
s-hovor-5.0-upn	2.1 million tokens / 1.8 million words	yes	2015	subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute
s-hovor-4.0	2.6 million tokens / 2.2 million words	yes	2012	corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia
s-hovor-4.0-sane	1.6 million tokens / 1.3 million words	yes	2012	subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions, excluding the recordings provided by The Nation´s Memory Institute
s-hovor-4.0-upn	1.0 million tokens / 0.9 million words	yes	2012	subcorpus of the Corpus of Spoken Slovak: utterances and their transcriptions from the Project Oral History within the Nation’s Memory Institute
s-hovor-3.0	2.1 million tokens / 1.4 million words	yes	2011	corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia
s-hovor-2.0	678 592 tokens / 560 933 words	yes	2010	corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia
s-hovor-1.0	127 714 tokens / 104 458 words	yes	2008	corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

9. Corpora of dialects of the SNC

corpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
dialekt-5.0	980 643 tokens / 786 312 words	no	2022	corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia
dialekt-4.0	711 766 tokens / 571 352 words	no	2018	corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia
dialekt-3.0	494 722 tokens / 403 180 words	no	2016	corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia
dialekt-2.0	328 907 tokens / 252 166 words	no	2015	corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia
dialekt-1.0	73 855 tokens / 54 598 words	no	2014	corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas of Slovakia

10. Historical corpus

corpus	size number of tokens / number of words	lemmatisation, morphological annotation	year of release	characteristics
hist-6.0	916 743 tokens / 720 492 words	no	2022	corpus of historical Slovak: source materials (in original spelling)
hist-5.0	997 809 tokens / 731 498 words	no	2020	corpus of historical Slovak: source materials (in original spelling)
hist-4.0	917 586 tokens / 668 245 words	no	2016	corpus of historical Slovak: source materials (in original spelling)
hist-3.0	836 393 tokens / 600 410 words	no	2015	corpus of historical Slovak: source materials (in original spelling)
hist-2.0	551 973 tokens / 422 166 words	no	2014	corpus of historical Slovak: source materials (in original spelling)
hist-1.0	370 758 tokenov	no	2012	corpus of historical Slovak: source materials (in original spelling)

Publicly available SNC corpora

Content

1. Written corpora − synchronous, general

2. Written corpora − synchronous, web

3. Written corpora − synchronous, merged

4. Written corpora − parallel

5. Written corpora − synchronous, acquisitional

6. Written corpora − synchronous, specialised

7. Written corpora of texts before the year 1955 (mainly texts of books from the SME Golden Fund)

8. Spoken corpora − synchronous, standard

9. Corpora of dialects of the SNC

10. Historical corpus

Address

Phone

Mobile

E-mail