→po slovensky

Attributes and structural values in the publicly available SNC corpora

1. Written corpora − synchronous, general

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

attributes

structures

prim-9.0-public-all

1 652 million tokens / 1 282 million words

yes

2020

all publicly available texts in SNC (74.0 % journalistic, 16.0 % fiction, 9.2 % professional and 0.9 % other texts)

word, lemma, tag, prec, word_lc lemma_lc

doc, s, p, g, noise, hi

prim-9.0-public-sane

1 621 million tokens / 1 257 million words

yes

2020

corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, noise, hi

prim-9.0-public-vyv

454 million tokens / 355 million words

yes

2020

balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, noise, hi

prim-9.0-public-inf

1 194 million tokens / 920 million words

yes

2020

subcorpus of journalistic (informational) texts

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, noise, hi

prim-9.0-public-prf

150 million tokens / 117 million words

yes

2020

subcorpus of scientific, professional and popular science texts

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

prim-9.0-public-img

263 million tokens / 208 million words

yes

2020

subcorpus of fiction texts

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

prim-9.0-public-sk

1 258 million tokens / 977 million words

yes

2020

subcorpus of original texts written in Slovak

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

prim-9.0-public-img-sk

93 million tokens / 74 million words

yes

2020

subcorpus of original fiction texts written in Slovak

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

r1955az1989-6.0

99 million tokens / 79 million words

yes

2020

specific corpus of texts from years 1955–1989 (4.5 % journalistic, 78.6 % fiction, 12.4 % professional and 4.4 % other texts)

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

prim-8.0-public-all

1 477 million tokens / 1 160 million words

yes

2018

all publicly available texts in SNC (71.1 % journalistic, 15.4 % fiction, 8.5 % professional and 5.0 % other texts)

word, lemma, tag, prec, word_lc lemma_lc

doc, s, p, g, noise, hi

prim-8.0-public-sane

1 369 million tokens / 1 076 million words

yes

2018

corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, noise, hi

prim-8.0-public-vyv

377 million tokens / 298 million words

yes

2018

balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, noise, hi

prim-8.0-public-inf

1 010 million tokens / 791 million words

yes

2018

subcorpus of journalistic (informational) texts

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, noise, hi

prim-8.0-public-prf

122 million tokens / 96 million words

yes

2018

subcorpus of scientific, professional and popular science texts

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

prim-8.0-public-img

224 million tokens / 178 million words

yes

2018

subcorpus of fiction texts

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

prim-8.0-public-sk

1 043 million tokens / 822 million words

yes

2018

subcorpus of original texts written in Slovak

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

prim-8.0-public-img-sk

83 million tokens / 66 million words

yes

2018

subcorpus of original fiction texts written in Slovak

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

r1955az1989-5.0

84 million tokens / 67 million words

yes

2018

specific corpus of texts from years 1955–1989 (5.3 % journalistic, 75.3 % fiction, 14.0 % professional and 5.4 % other texts)

word, lemma, tag, prec, word_lc, lemma_lc

doc, s, p, g, hi

prim-7.0-public-all

1 250 million tokens / 972 million words

yes

2015

all publicly available texts in SNC (65.1 % journalistic, 15.1 % fiction, 9.5 % professional and 10.3 % other texts)

word, lemma, tag, prec

doc, s, p, g

prim-7.0-public-sane

1 089 million tokens / 849 million words

yes

2015

corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals

word, lemma, tag, prec

doc, s, p, g

prim-7.0-public-vyv

341 million tokens / 267 million words

yes

2015

balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

word, lemma, tag, prec

doc, s, p, g

prim-7.0-public-inf

771 million tokens / 597 million words

yes

2015

subcorpus of journalistic (informational) texts

word, lemma, tag, prec

doc, s, p, g

prim-7.0-public-prf

114 million tokens / 89 million words

yes

2015

subcorpus of scientific, professional and popular science texts

word, lemma, tag, prec

doc, s, p, g

prim-7.0-public-img

188 million tokens / 149 million words

yes

2015

subcorpus of fiction texts

word, lemma, tag, prec

doc, s, p, g

prim-7.0-public-sk

807 million tokens / 630 million words

yes

2015

subcorpus of original texts written in Slovak

word, lemma, tag, prec

doc, s, p, g

prim-7.0-public-img-sk

65 million tokens / 52 million words

yes

2015

subcorpus of original fiction texts written in Slovak

word, lemma, tag, prec

doc, s, p, g

r1955az1989-4.0

67 million tokens / 54 million words

yes

2015

specific corpus of texts from years 1955–1989 (7.4 % journalistic, 69.3 % fiction, 16.6 % professional and 6.7 % other texts)

word, lemma, tag, prec

doc, s, p, g

prim-6.1-public-all

830 million tokens / 656 million words

yes

2013

all publicly available SNC texts (68.8 % journalistic, 13.9 % fiction, 15.3 % professional and 2 % other texts)

word, lemma, tag, prec

doc, s, p, g

prim-6.1-public-sane

773 million tokens / 610 million words

yes

2013

corpus excluding: texts with incorrectly used diacritics, texts written before 1955, texts from outside the territory of Slovakia, texts from linguistic journals

word, lemma, tag, prec

doc, s, p, g

prim-6.1-public-vyv

317 million tokens / 252 million words

yes

2013

balanced subcorpus (33.3 % journalistic, 33.3 % fiction, 33.3 % professional texts)

word, lemma, tag, prec

doc, s, p, g

prim-6.1-public-inf

541 million tokens / 425 million words

yes

2013

subcorpus of journalistic (informational) texts

word, lemma, tag, prec

doc, s, p, g

prim-6.1-public-prf

106 million tokens / 84 million words

yes

2013

subcorpus of scientific, professional and popular science texts

word, lemma, tag, prec

doc, s, p, g

prim-6.1-public-img

114 million tokens / 91 million words

yes

2013

subcorpus of fiction texts

word, lemma, tag, prec

doc, s, p, g

prim-6.1-public-sk

558 million tokens / 441 million words

yes

2013

subcorpus of original texts written in Slovak

word, lemma, tag, prec

doc, s, p, g

prim-6.1-public-img-sk

35 million tokens / 28 million words

yes

2013

subcorpus of original Slovak fiction texts

word, lemma, tag, prec

doc, s, p, g

r55az89-3.0

63 million tokens / 51 million words

yes

2013

specific corpus of texts from years 1955–1989 (11.9 % journalistic, 55.5 % fiction, 24.1 % professional and 8.5 % other texts)

word, lemma, tag, prec

doc, s, p, g

prim-6.0-public-all

1 155 million tokens / 939 million words

yes

2013

all publicly available SNC texts (77.8 % journalistic, 9.8 % fiction, 11 % professional, 1.4 % other texts)

word, lemma, tag, prec

doc, s, p, g

prim-5.0-public-all

719 million tokens / 599 million words

yes

2011

all publicly available SNC texts (73 % journalistic, 14 % fiction, 12 % professional, 1 % other texts)

word, lemma, tag, prec

doc, s, p, br, noise, picture, head, hi, equation, table

prim-4.0-public-all

526 million tokens / 429 million words

yes

2009

all publicly available SNC texts (65 % journalistic, 17 % fiction, 16 % professional, 2 % other texts)

word, lemma, tag, prec

doc, s, p, br, noise, picture, head, hi, equation, table

prim-3.0-public-all

339 million tokens / 276 million words

yes

2007

all publicly available SNC texts (57 % journalistic, 21.5 % fiction, 18.5 % professional, 3 % other texts)

word, lemma, tag, hlemma, htag

doc, s, p, br, noise, picture, head, hi, equation, table

prim-2.1-public-all

294 million tokens / 240 million words

yes

2006

all publicly available SNC texts (63 % journalistic, 20 % fiction, 12 % professional, 5 % other texts)

word, lemma, tag, hlemma, htag

doc, s, p, br, noise, picture, head, hi, equation, table

web-5.0

4 042 million tokens / 3 326 million words

yes

2020

corpus of Slovak texts available on the web

word, lemma, tag, prec, word_lc, lemma_lc

doc, p, s, g, pgap, sgap

web-4.0

2 963 million tokens / 2 440 million words

yes

2018

corpus of Slovak texts available on the web

word, lemma, tag, prec, word_lc, lemma_lc

doc, p, s, g, pgap, sgap

web-3.0

2 372 million tokens / 1 993 million words

yes

2015

corpus of Slovak texts available on the web

word, lemma, tag, prec

doc, p, s, g, gap

wiki-2019-08

51 million tokens / 38 million words

yes

2020

corpus of texts from Slovak Wikipédia

word, lemma, tag, prec

doc, s, p, m, g

wiki-2018-03

47 million tokens / 35 million words

yes

2018

corpus of texts from Slovak Wikipédia and Necyklopédia

word, lemma, tag, prec

doc, s, p

wiki-2017-02

45 million tokens / 34 million words

yes

2017

corpus of texts from Slovak Wikipédia and Necyklopédia

word, lemma, tag, prec

doc, s, p

wiki-2016-02

43 million tokens / 34 million words

yes

2016

corpus of texts from Slovak Wikipédia and Necyklopédia

word, lemma, tag, prec

doc, s, p

wiki-2015-02

40 million tokens / 32 million words

yes

2015

corpus of texts from Slovak Wikipédia and Necyklopédia

word, lemma, tag, prec

doc, s, p

prim-7.0-frk

253 million tokens / 203 million words

yes

2018

The reference corpus prim-7.0-frk was the source for Frekvenčný slovník slovenčiny na báze Slovenského národného korpusu (Slovak Frequency Dictionary Based on the Slovak National Corpus), as well as for the examples listed in the publication Skloňovanie podstatných mien v slovenčine s korpusovými príkladmi (Declension of the Slovak Nouns with Corpus Examples).

word, lemma, tag, prec

doc, s, p, g

r-mak-6.0

1.2 million tokens / 978 000 words

yes

2017

manually morphologically annotated corpus (30.6 % journalistic, 50.2 % fiction, 19.2 % professional texts)

word, lemma, tag

doc, s, p, br, noise, picture, head, hi, equation, table

r-mak-5.0

1.2 million tokens / 978 000 words

yes

2016

manually morphologically annotated corpus (28.5 % journalistic, 44.5 % fiction, 27 % professional texts)

word, lemma, tag

doc, s, p, br, noise, picture, head, hi, equation, table

r-mak-4.0

1.2 million tokens / 977 000 words

yes

2013

manually morphologically annotated corpus (36.2 % journalistic, 44.9 % fiction, 18.9 % professional texts)

word, lemma, tag

doc, s, p, hi

2. Written corpora − synchronous, specialised

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

attributes

structures

blf-2.0

66 million tokens / 54 million words

yes

2014

corpus of religious texts

word, lemma, tag, prec

doc, s, p, g

cw-2014-all

1.6 million tokens / 1.2 million words

yes

2014

corpus of copywrighting texts

word, lemma, tag, prec

doc, s, p, g

ecn-2.0-public

165 million tokens / 140 million words

yes

2016

corpus of economic texts (3.76 % professional and 96.24 % journalistic texts from the field of economics, banking, trade, management and merchandising)

word, lemma, tag, prec

doc, s, p, g

ecn-1.0-public

20 million tokens / 17 million words

yes

2014

corpus of economic texts (81.4 % professional and 18.6 % journalistic texts from the field of economics, banking, trade, management and merchandising)

word, lemma, tag, prec

doc, s, p, g

hum-1.0-public

39 million tokens / 30 million words

yes

2016

corpus of humanistic texts

word, lemma, tag, prec

doc, s, p, g

judikat-1.0

1.5 million tokens / 1.3 million words

yes

2015

corpus of judicial decisions

word, lemma, tag, prec

doc, s, p

legal-1.1

49 million tokens / 40 million words

yes

corpus of legal texts (deduplicated)

word, lemma, tag, ftag, rgtag

doc, p, s, s0, g

legal-1.0

147 million tokens / 114 million words

yes

2011

corpus of legal texts

3. Written corpora − parallel

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release (first version released in)

characteristics

attributes

structures

par-skbg-free-0.1

163 million tokens / 108 million words

yes,
both languages

2014
(2014)

Slovak-Bulgarian parallel corpus: 78 million tokens in Slovak half, 85 million tokens in Bulgarian half

word, lemma, tag

doc, s

par-skcs-all-4.0

418 million tokens / 306 million words

yes,
both languages

2016
(2010)

Slovak-Czech parallel corpus: 209 million tokens in Slovak half, 209 million tokens in Czech half

word, lemma, tag

doc, s

par-skcs-fic-5.0

31.5 million tokens / 25.0 million words

yes,
both languages

2018
(2010)

Slovak-Czech parallel corpus, subcorpus fiction: 15.7 million tokens in Slovak half, 15.8 million tokens in Czech half

word, lemma, tag

doc, s

par-skde-all-2.0

446 million tokens / 300 million words

yes,
both languages

2016
(2014)

Slovak-German parallel corpus: 220 million tokens in Slovak half, 226 million tokens in German half

word, lemma, tag

doc, s

par-sken-4.0

556 million tokens / 436 million words

yes,
both languages

2015
(2010)

Slovak-English parallel corpus: 261 million tokens in Slovak half, 295 million tokens in English half

word, lemma, tag

doc, s

par-skfr-all-3.0

449 million tokens / 332 million words

yes,
both languages

2016
(2006)

Slovak-French parallel corpus: 217 million tokens in Slovak half, 232 million tokens in French half

word, lemma, tag

doc, s

par-skhu-1.0

99 million tokens / 75 million words

yes,
both languages

2015
(2014)

Slovak-Hungarian parallel corpus: 51 million tokens in Slovak half, 48 million tokens in Hungarian half

word, lemma, tag

doc, s

par-skhu-0.2

3.9 million tokens

yes,
both languages

2015
(2014)

Slovak-Hungarian parallel corpus: 2.0 million tokens in Slovak half, 1.9 million tokens in Hungarian half

word, lemma, tag

doc, s

par-skla-3.0

5.0 million tokens / 4.1 million words

yes,
both languages

2018
(2012)

Slovak-Latin parallel corpus: 2.7 million tokens in Slovak half, 2.3 million tokens in Latin half

word, lemma, tag

doc, s

par-skro-1.1

1.3 million tokens / 1.0 million words

yes,
both languages

2017
(2016)

Slovak-Romanian parallel corpus: 603 000 tokens in Slovak half, 689 000 tokens in Romanian half

word, lemma, tag

doc, s

par-skpl-1.0

8.2 million tokens / 6.5 million words

yes,
both languages

2018
(2018)

Slovak-Polish parallel corpus: 4.1 mil. tokens in Slovak half, 4.1 million tokens in Polish half

word, lemma, tag

doc, s

par-skru-2.0

8.5 million tokens / 6.6 million words

yes,
both languages

2014
(2012)

Slovak-Russian parallel corpus: 4.2 mil. tokens in Slovak half, 4.2 million tokens in Russian half

word, lemma, tag

doc, s

4. Written corpora of texts before the year 1955

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

attributes

structures

r864az1843-1.0

2.1 million tokens / 1.6 million words

no

2015

corpus of texts from 864–1843

word

doc, s, p, g

r1843az1954-1.0

24 million tokens / 19 million words

nie

2015

corpus of texts from 1843–1954

word

doc, s, p, g

5. Historical corpus

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

attributes

structures

hist-5.0

998 000 tokens / 731 000 words

no

2020

Corpus of historical Slovak

word, lemma

doc, s, p, g, noise, rem, miss

hist-4.0

918 000 tokens / 668 000 words

no

2016

Corpus of historical Slovak

word, lemma

doc, s, p, g

hist-3.0

836 000 tokens / 600 000 words

no

2015

Corpus of historical Slovak

word, lemma

doc, s, p, g

hist-2.0

552 000 tokens / 422 000 words

no

2014

Corpus of historical Slovak

word, lemma

doc, s, p, g

hist-1.0

371 000 tokens

no

2012

Corpus of historical Slovak

word, nword

doc, s, p, g

6. Spoken corpora − synchronous, standard

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

attributes

structures

s-hovor-6.0

6.6 million tokens / 5.5 million words

yes

2017

Corpus of spoken Slovak

word, pron, lemma, tag, prec

structures for s-hovor-6.0

s-hovor-5.0

5.7 million tokens / 4.7 million words

yes

2015

Corpus of spoken Slovak

word, pron, lemma, tag, prec

doc, section, turn, event, sync, background, who, spk

s-hovor-4.0

2.6 million tokens / 2.2 million words

yes

2012

Corpus of spoken Slovak

word, pron, lemma, tag, prec

doc, section, turn, event, sync, background, who, spk

s-hovor-3.0

2.1 million tokens / 1.4 million words

yes

2011

Corpus of spoken Slovak

word, pron, lemma, tag, dcount

doc, section, turn, event, sync, background, who

s-hovor-2.0

679 000 tokens / 561 000 words

yes

2010

Corpus of spoken Slovak

word, pron, lemma, tag, dcount

doc, section, turn, event, sync, background, who

s-hovor-1.0

128 000 tokens / 104 000 words

yes

2008

Corpus of spoken Slovak

word, pron, lemma, tag, dcount

doc, section, turn, event, sync, background, who

7. Corpora of dialects of the SNC

corpus

size − number of tokens / number of words

lemmatisation, morphological annotation

year of release

characteristics

attributes

structures

dialekt-4.0

712 000 tokens / 571 000 words

no

2020

Corpora of dialects of the SNC

word, lemma

doc, spk, s, p, rem

dialekt-3.0

495 000 tokens / 403 000 words

no

2016

Corpora of dialects of the SNC

word, lemma

doc, spk, s, p, rem

dialekt-2.0

329 000 tokens / 252 000 words

no

2015

Corpora of dialects of the SNC

word, lemma

doc, spk, s, p, rem

dialekt-1.0

74 000 tokens / 55 000 words

no

2014

Corpora of dialects of the SNC

word, lemma

doc, s, p