Current version par-skhu-1.0 from 17 December 2015 contains 99 million tokens (51 million in the Slovak half, 48 million in the Hungarian half).
The corpus consists of two parts: the subcorpus of fiction (4 million tokens, 2 million per language) and the subcorpus of freely available texts. To access the fiction subcorpus, you can use the NoSketch Engine interface to query the Hungarian texts, or the Slovak texts.
Slovak-Hungarian Parallel Corpus is a database containing texts in both Slovak and Hungarian language. Slovak texts are translated into Hungarian or vice versa, the freely available texts were translated from third language. Texts are automatically aligned at sentence level. Slovak texts are automatically morphologically annotated by the tagger Morče trained on Slovak tagset developed by SNK. The Hungarian texts are annotated by the HUNPOS tagger.
The previous version par-skhu-0.2 was released in May 2015 containing 4 million tokens (approximately 2 million tokens per language).
The pilot version par-skhu-0.1 was released in January 2014 containing 3 million tokens (approximately 1.5 million tokens per language).
Developed jointly by Slovenský národný korpus, Jazykovedný ústav Ľ. Štúra SAV and Magyar Tudományos Akadémia, Nyelvtudományi Intézet.