→po slovensky

Publicly available SNC corpora

1. Written corpora − synchronous, general

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

prim-7.0-public-all

1 250 million tokens / 972 million words

yes

2015

all publicly available texts in SNC (65.1 % journalistic, 15.1 % fiction, 9.5 % professional and 10.3 % other texts)

prim-7.0-public-sane

1 089 million tokens / 849 million words

yes

2015

corpus excluding: texts with incorrectly used diacritics, texts from outside the territory of Slovakia, linguistic journals, students' papers etc.

prim-7.0-public-vyv

341 million tokens / 267 million words

yes

2015

subcorpus balanced with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

prim-7.0-public-inf

771 million tokens / 597 million words

yes

2015

subcorpus of journalistic (informational) texts

prim-7.0-public-prf

114 million tokens / 89 million words

yes

2015

subcorpus of scientific, professional and popular science texts

prim-7.0-public-img

188 million tokens / 149 million words

yes

2015

subcorpus of fiction texts

prim-7.0-public-sk

807 million tokens / 630 million words

yes

2015

subcorpus of original texts written in Slovak

prim-7.0-public-img-sk

65 million tokens / 52 million words

yes

2015

subcorpus of original fiction texts written in Slovak

r1955az1989-4.0

67 million tokens / 54 million words

yes

2015

specific corpus of texts from years 1955–1989 (7.4 % journalistic, 69.3 % fiction, 16.6 % professional and 6.7 % other texts)

prim-6.1-public-all

830 million tokens / 656 million words

yes

2013

all publicly available SNC texts (68.8 % journalistic, 13.9 % fiction, 15.3 % professional and 2 % other texts)

prim-6.1-public-sane

773 million tokens / 610 million words

yes

2013

corpus excluding: texts with incorrectly used diacritics, texts from outside the territory of Slovakia, linguistic journals, students' papers etc.

prim-6.1-public-vyv

317 million tokens / 252 million words

yes

2013

balanced subcorpus with regard to style (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

prim-6.1-public-inf

541 million tokens / 425 million words

yes

2013

subcorpus of journalistic (informational) texts

prim-6.1-public-prf

106 million tokens / 84 million words

yes

2013

subcorpus of scientific, professional and popular science texts

prim-6.1-public-img

114 million tokens / 91 million words

yes

2013

subcorpus of fiction texts

prim-6.1-public-sk

558 million tokens / 441 million words

yes

2013

subcorpus of original texts written in Slovak

prim-6.1-public-img-sk

35 million tokens / 28 million words

yes

2013

subcorpus of original Slovak fiction texts

r55az89-3.0

63 million tokens / 51 million words

yes

2013

specific corpus of texts from years 1955–1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional and 8.5 % other texts)

prim-6.0-public-all

1 155 million tokens / 939 million words

yes

2013

all publicly available SNC texts (77.8 % journalistic, 9.8 % fiction, 11 % professional, 1.4 % other texts)

prim-5.0-public-all

719 million tokens / 599 million words

yes

2011

all publicly available SNC texts (73 % journalistic, 14 % fiction, 12 % professional, 1 % other texts)

prim-4.0-public-all

526 million tokens / 429 million words

yes

2009

all publicly available SNC texts (65 % journalistic, 17 % fiction, 16 % professional, 2 % other texts)

prim-3.0-public-all

339 million tokens / 276 million words

yes

2007

all publicly available SNC texts (57 % journalistic, 21.5 % fiction, 18.5 % professional, 3 % other texts)

prim-2.1-public-all

294 million tokens / 240 million words

yes

2006

all publicly available SNC texts (63 % journalistic, 20 % fiction, 12 % professional, 5 % other texts)

web-3.0

2 372 million tokens / 1 993 million words

yes

2015

corpus of Slovak texts available on the web

wiki-2016-02

43 million tokens / 34 million words

yes

2016

corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2016-02-26)

wiki-2015-02

40 million tokens / 32 million words

yes

2015

corpus of texts from Slovak Wikipédia and Necyklopédia (as of 2015-02-28)

2. Written corpora − synchronous, specialised

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

cw-2014-all

1.6 million tokens / 1.2 million words

yes

2014

corpus of copywrighting texts

ecn-2.0-public

165 million tokens / 140 million words

yes

2016

corpus of economic texts (3.76 % professional and 96.24 % journalistic texts from the field of economics, banking, trade, management and merchandising)

ecn-1.0-public

20 million tokens / 17 million words

yes

2014

corpus of economic texts (81.4 % professional and 18.6 % journalistic texts from the field of economics, banking, trade, management and merchandising)

blf-2.0

66 million tokens / 54 million words

yes

2014

corpus of religious texts

legal-1.1

49 million tokens / 40 million words

yes

corpus of legal texts (deduplicated)

legal-1.0

147 million tokens / 114 million words

yes

2011

corpus of legal texts

judikat-1.0

1.5 million tokens / 1.3 million words

yes

2015

corpus of judicial decisions

r-mak-5.0

1.2 million tokens / 978 thousand words

yes

2016

manually morphologically annotated corpus (28.5 % journalistic, 44.5 % fiction, 27 % professional texts)

r-mak-4.0

1.2 million tokens / 977 thousand words

yes

2013

manually morphologically annotated corpus (36.2 % journalistic, 44.9 % fiction, 18.9 % professional texts)

3. Written corpora − parallel

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release (first version released in)

characteristics

par-sken-4.0

556 million tokens

yes,
both languages

2015
(2010)

Slovak-English parallel corpus: 261 million tokens in Slovak half, 295 million tokens in English half

par-skbg-free-0.1

163 million tokens

yes,
both languages

2014

Slovak-Bulgarian parallel corpus: 78 million tokens in Slovak half, 85 million tokens in Bulgarian half

par-skcs-all-4.0

418 million tokens

yes,
both languages

2016
(2010)

Slovak-Czech parallel corpus: 209 million tokens in Slovak half, 209 million tokens in Czech half

par-skfr-all-2.0

442 million tokens

yes,
both languages

2016
(2006)

Slovak-French parallel corpus: 213 million tokens in Slovak half, 228 million tokens in French half

par-skla-2.0

1.4 million tokens

yes,
both languages

2012
(2012)

Slovak-Latin parallel corpus: 781 thousand tokens in Slovak half, 662 thousand tokens in Latin half

par-skhu-1.0

99 million tokens

yes,
both languages

2015
(2014)

Slovak-Hungarian parallel corpus: 51 million tokens in Slovak half, 48 million tokens in Hungarian half

par-skde-all-2.0

446 million tokens

yes,
both languages

2016
(2014)

Slovak-German parallel corpus: 220 million tokens in Slovak half, 226 million tokens in German half

par-skru-2.0

8.5 million tokens

yes,
both languages

2014
(2005)

Slovak-Russian parallel corpus: 4.2 million tokens in Slovak half, 4.2 million tokens in Russian half

4. Written corpora of texts before the year 1955

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

r864az1843-1.0

2.1 million tokens

no

2015

corpus of texts from 864–1843: texts transcribed into contemporary Slovak, orthography as used in the latest edition)

r1843az1954-1.0

24 million tokens

no

2015

corpus of texts from 1843−1954: texts transcribed into contemporary Slovak, orthography as used in the latest edition

5. Historical corpus

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

hist-3.0

836 thousand tokens

no

2015

corpus of historical Slovak: source materials (in original spelling)

hist-2.0

552 thousand tokens

no

2014

corpus of historical Slovak: source materials (in original spelling)

hist-1.0

371 thousand tokens

no

2012

corpus of historical Slovak: source materials (in original spelling)

6. Spoken corpora − synchronous, standard

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

s-hovor-5.0

5.7 million tokens

yes

2015

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-4.0

2.6 million tokens

yes

2012

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-3.0

2.1 million tokens

yes

2011

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-2.0

679 thousand tokens

yes

2010

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

s-hovor-1.0

128 thousand tokens

yes

2008

corpus of spoken Slovak: speech utterances and their transcriptions into standardized Slovak covering the whole territory of Slovakia

7. Corpora of dialects of the SNC

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

dialekt-2.0

329 thousand tokens

no

2015

corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas

dialekt-1.0

74 thousand tokens

no

2014

corpus of dialects of the Slovak National Corpus: published texts based on dialect audio or transcribed recordings that cover various dialect areas