Slovak-Ukrainian Parallel Corpus

The first version of the Slovak-Ukrainian Parallel Corpus par-skuk-1.0, made available on 22 March 2023, contains 4.3 million tokens (2.1 million tokens in the Slovak half, 2.2 million in the Ukrainian half).

The corpus comprises mostly fiction text – 64 % of texts are originally from Ukrainian, 7 % from Slovak and 29 % are translations from Russian and Polish languages. To search the corpus (the Ukrainian part, the Slovak part) via NoSketchEngine, registration is required.

Texts are automatically sentence-aligned. Slovak text are morphologically annotated by the MorphoDiTa tagger, trained and tuned on tagset developed by the SNK. Ukrainian texts are annotated by the UDPipe trained on tagset MULTEXT-East.