Slovak-Hungarian Parallel Corpus


Current version par-skhu-1.1 was developed on 20 January 2023 and made accessible on 26 January 2023.

Compared to previous version, the content of the corpus remains unchanged, but the texts have been added metadata on style and genre annotation. Therefore, a user can search the corpus using keys of style and genre annotation, as well as keys of bibliographical annotation.

The corpus consists of two parts: the subcorpus of fiction (4 million tokens, 2 million per language) and the subcorpus of freely available texts. To access the fiction subcorpus, you can use the NoSketch Engine interface to query the Hungarian texts, or the Slovak texts.

To access the whole corpus, use the NoSketch Engine interface to query the Hungarian texts, or the Slovak texts; knowledge of the NoSketch Engine and CQL is recommended.

Slovak-Hungarian Parallel Corpus is a database containing texts in both Slovak and Hungarian language. Slovak texts are translated into Hungarian or vice versa, the freely available texts were translated from third language. Texts are automatically aligned at sentence level. Slovak texts are automatically morphologically annotated by the tagger Morče trained on Slovak tagset developed by SNK. The Hungarian texts are annotated by the HUNPOS tagger.


Previous versions

The version par-skhu-1.0 from 17 December 2015 contains 99 million tokens (51 million in the Slovak half, 48 million in the Hungarian half).

The previous version par-skhu-0.2 was released in May 2015 containing 4 million tokens (approximately 2 million tokens per language).

The pilot version par-skhu-0.1 was released in January 2014 containing 3 million tokens (approximately 1.5 million tokens per language).

Developed jointly by Slovenský národný korpus, Jazykovedný ústav Ľ. Štúra SAV and Magyar Tudományos Akadémia, Nyelvtudományi Intézet.