Corpus of Texts from 1843–1954

The latest version of the corpus r1843az1954-2.0 was made on 16 January, 2026. The corpus has been available since 5 February, 2026, containing almost 4 million tokens (3,897,816 tokens, 3,112,096 words).

Texts from the SME Golden Fund were excluded from the current version of the corpus, as they contained grammatical, syntactic, and lexical changes of the original texts. Similarly, texts whose Dateorig (1st edition) was between the years 1843 and 1954, but actually published much later, are not included in the corpus. After the selection, the number of remaining texts from the first version of the corpus was supplemented with new texts, which were processed and prepared in the Slovak National Corpus during 2024 and 2025 according to the obtained originals.

Currently, the main difference between the first and second version of the corpus lies in the change of keys for corpus creation. In the previous version of the corpus, r1843az1954-1.0, two keys were used: Date and Dateorig. In the new version, only the Date key (year of publication between 1843 and 1954) was used so that the texts reflect the language and grammatical principles of standard Slovak at the time of publication, as well as the principles of that time editors, or publishers.

In the latest version of the corpus of texts from 1843–1954, each text contains a detailed bibliographical, style and genre annotation, all text units are experimentally lemmatized and morphologically annotated using the spaCy tool trained on the Slovak National Corpus, and an internal database of word forms was used to correct incorrectly identified forms.

Version 1.0

The first version of the corpus named r1843az1954-1.0, containing 24 million tokens, was released on February 5, 2015. The corpus contains a spread of publications mostly from the so called Zlatý fond SME (SME Golden Fund). The corpus includes texts written after the language standardization attempts by Ludevít Štúr. The transcribed texts follow the grammatical principles used at that time, as well as principles of that time editors or publishers. The corpus includes basic bibliographical and style annotation, the texts are not lemmatized nor morphologically annotated. The user can search for a word form or he can use the CQL.