Web corpus

The current version web-8.0 was released on 28 February 2026, containing 5 889 464 749 tokens.

The corpus contains data gained from beta version of Araneum Slovacum VII Maximum (26.03) developed within the project Aranea. It is comprised of texts from 2013 to 2026. Sentence segmentation and tokenization follows the original corpus Araneum Slovacum, with improved tokenization compared to the previous version.

Compared to the previous version, the corpus increased by approximately 600 000 000 tokens.

If you use the corpus in your work or wish to cite it, please use the following reference:

Benko, V. (2024). The Aranea Corpora Family: Ten+ Years of Processing Web-Crawled Data. In: Nöth, E., Horák, A., Sojka, P. (eds) Text, Speech, and Dialogue. TSD 2024. Lecture Notes in Computer Science, vol 15048. Springer, Cham.
https://doi.org/10.1007/978-3-031-70563-2_5

Version 7.0

The corpus web-7.0 was released in February 2024. The corpus contains 5 300 485 736 tokens.

The corpus contains data gained from beta version of Araneum Slovacum VII Maximum (24.02) developed within the project Aranea (base corpus author – Vladimír Benko). It is comprised of texts from 2013 to 2024. Sentence segmentation and tokenization follows the original corpus Araneum Slovacum, but the data are further lemmatized and morphologically annotated by tagger used also for the main corpus prim-10.0. The texts contain information on URL and retrieval time.

As compared to the previous version, the corpus increased by a billion tokens in size.

Version 6.0

The corpus web-6.0 created in March 2022, was released in June 2022. The corpus contains 4 373 231 228 tokens.

The corpus contains data gained from beta version of Araneum Slovacum VI Maximum (22.01), developed within the project Aranea (base corpus author – Vladimír Benko). Tokenised and segmeneted data are further lemmatised and morphologically annotated by MorphoDiTa tagger, tuned and trained on SNC tagset that is used also on tagging written corpora. The texts are given basic information on URL and time of retrieval.

Version 5.0

The version of the corpus web-5.0 containing 4 042 363 283 tokens was released in January 2020.

This corpus contains data gained from Araneum Slovacum V Maximum (20.01) Web Corpus developed within the project Aranea (base corpus author – Vladimír Benko). The data were tokenised, segmented, lemmatised and morphologically annotated by MorphoDiTa tagger, trained and tuned on SNC tagset. The texts are given basic information on URL and time of retrieval.

Version 4.0

The version web-4.0 containing 2 963 462 451 tokens was released in January 2018.

This version contains Slovak texts gained within the Araneum project (base corpus author – Vladimír Benko). The corpus is lemmatized and morphologically annotated by MorphoDiTa tagger which was trained on SNC tagset. The texts are given basic information on URL and time of retrieval.

The corpus contains texts written in Slovak gained within the Araneum project (base corpus author – Vladimír Benko). It is lemmatised and morphologically annotated by MorphoDiTa tagger which has been trained and tuned on tagset developed by the SNC, the texts are given basic information about their URL and time of retrieval.

Version 3.0

The version of the corpus web-3.0 containing 2 372 769 958 tokens was released in March 2015.

Web corpus was a collection of Slovak texts downloaded from the web that were provided by the Faculty of Informatics of Masaryk University in Brno in 2010 (a collection of 988 474 323 tokens, including duplicate content and texts in Czech), also Slovak texts downloaded from the web by SNC during 2011–2012 (489 869 717 tokens, excluding duplicate content and foreign texts) and Slovak texts from the project Araneum (3 221 914 708 tokens, including duplicate content and foreign texts).

The corpus texts are lemmatised and morphologically annotated, bibliography is provided. The lists of the 1000 most frequent word forms and lemmas are available here.

Version 2.0

The version web-2.0 containing 1 045 558 148 tokens was released in March 2012.

Version 1.0

The first version web-1.0 was released in 2011. The corpus, containing 952 095 260 tokens was developed jointly with the Faculty of Informatics, Masaryk University in Brno.

Web corpus

Version 7.0

Version 6.0

Version 5.0

Version 4.0

Version 3.0

Version 2.0

Version 1.0

Address

Phone

Mobile

E-mail