Version 6.0
The sixth version
wiki-2019-08 containing
50 619 991 tokens was made available in January 2020. The corpus contains texts from Slovak
Wikipédia, as of 2019-08-01. This version carries four notable changes:
- several errors of parsing MediaWiki markup have been fixed;
- mathematical expressions (<math> elements) in articles have been substituted with the single token <m/>;
- we use the new structute <g/> (glue) where there is no space character between the surrounding tokens in the original text;
- Necyklopédia has been excluded from the corpus.
It is lemmatized (lemma is capitalized when it is a proper noun) and morphologically
annotated, information on the source is provided.
Version 5.0
The fifth version wiki-2018-03 containing 47 283 205 tokens was made available in May 2018. The corpus contains texts from Slovak
Wikipédia and
Necyklopédia, as of 2018-03-15. It is lemmatized (lemma is capitalized when it is a proper noun) and morphologically
annotated, information on the source is provided.
Version 4.0
The fourth version wiki-2017-02 containing 45 109 693 tokens was made available in March 2017. The corpus contains texts from Slovak
Wikipédia and
Necyklopédia, as of 2017-02-28. It is lemmatized (lemma is capitalized when it is a proper noun) and morphologically
annotated, information on the source is provided.
Version 3.0
The third version wiki-2016-02 containing 42 615 597 tokens was made available in March 2016. The corpus contains texts from Slovak
Wikipédia and
Necyklopédia, as of 2016-02-26. It is lemmatized and morphologically
annotated, information on the source is provided.
Version 2.0
The second version wiki-2015-02 containing 40 million tokens was released in March 2015. It includes texts from Slovak Wikipédia and Necyklopédia, as of February 2015.
Version 1.0
The first version wiki-2014-02 was released in February 2014 containing 37 548 997 tokens.