Slovak-Hungarian Parallel Corpus

Current version par-skhu-1.0 from 17 December 2015 contains 99 million tokens (51 million in the Slovak half, 48 million in the Hungarian half).

The corpus consists of two parts: the subcorpus of fiction (4 million tokens, 2 million per language) and the subcorpus of freely available texts. To access the fiction subcorpus, you can use the NoSketch Engine interface to query the Hungarian texts, or the Slovak texts.

To access the whole corpus, use the NoSketch Engine interface to query the Hungarian texts, or the Slovak texts; knowledge of the NoSketch Engine and CQL is recommended.

Slovak-Hungarian Parallel Corpus is a database containing texts in both Slovak and Hungarian language. Slovak texts are translated into Hungarian or vice versa, the freely available texts were translated from third language. Texts are automatically aligned at sentence level. Slovak texts are automatically morphologically annotated by the tagger Morče trained on Slovak tagset developed by SNK. The Hungarian texts are annotated by the HUNPOS tagger.

Previous versions

The previous version par-skhu-0.2 was released in May 2015 containing 4 million tokens (approximately 2 million tokens per language).

The pilot version par-skhu-0.1 was released in January 2014 containing 3 million tokens (approximately 1.5 million tokens per language).

Developed jointly by Slovenský národný korpus, Jazykovedný ústav Ľ. Štúra SAV and Magyar Tudományos Akadémia, Nyelvtudományi Intézet.