Corpus of Historical Slovak

The Corpus of Historical Slovak (CHS) is a diachronic corpus of texts from the pre-codification period. It contains both own project’s transliterated texts from photocopies of the original texts, as well as printed texts preserving the original orthography.

The first version hist-1.0 containing 370,758 tokens was released in December 2012. It contained electronically processed texts published in Pramene k dejinám slovenčiny, I. – III. Currently, the sixth version, hist-6.0, is available and it contains 916,743 tokens from 20 texts. The list of processed sources in the hist-6.0 corpus is provided with full bibliography data in the section Text sources and corpus versions.

To improve the quality of the corpus, compared to the previous version, annotation was unified – the duplicate annotation of language and abbreviations was removed. The current annotation is described in the section Specific Structure Tags. The texts with modified original orthography were also removed. The only text in which an intervention in the orthography was found (according to Bernolák’s Slovak) is Valaská škola by H. Gavlovič, since the intervention was discovered only after release of the hist-6.0 corpus. This error will be removed in the next version of CHS.

Texts in CHS are not lemmatized nor morphologically annotated, users can search for a word form or use CQL. Transliterated lexts include information about the origin of the text, its storage (or release) and date. The corpus is accessible after registration in the NoSketch Engine.