Slovak-Czech Parallel Corpus

The most recent data from the Slovak-Czech Parallel Corpus was released as follow:

  • new version of subcorpus of fiction par-skcs-fic-5.0 was released on 13 December 2018 containing 31.5 million tokens (15.72 million tokens in the Slovak half, 15.77 million tokens in the Czech half),
  • a complete corpus par-skcs-all-4.0, released on 25 May 2016 containing 418.5 million tokens (209.2 million tokens in the Slovak half, 209.3 million tokens in the Czech one), is available in its original version.

The corpus par-skcs-all-4.0 consists of the following parts: the subcorpus of fiction and the free subcorpus.

  • the subcorpus of fiction (19 million tokens), apart from fiction, also contains popular science, literature of fact, etc. Subcorpus par-skcs-fic-4.0 contains texts identical to par-skcs-fic-3.0,
  • the free subcorpus consists of EU legal texts and reports, computational and other manuals translated from the third (English) language. The texts can be downloaded here.

Subcorpus par-skcs-fic-5.0, in comparison to previous versions, includes more than 12 million tokens and 217 books (116 translated from Slovak into Czech, 56 translated from Czech into Slovak, 3 written in Czech as well as Slovak by the same author (V. Zamarovský), 28 texts translated into Czech and Slovak from the third language (from English), 14 texts translated into Slovak as well as Czech from other languages.

Slovak-Czech Parallel Corpus is a database of texts that are translations of each other, Slovak texts are translated into Czech or vice versa. Texts are automatically sentence aligned. Slovak texts are automatically morphologically annotated by taggers Morče and MorphoDiTa which have been trained and tuned on the tagset developed by the Slovak National Corpus. The Czech texts are annotated by the tagger Morče which has been trained and tuned on the tagset developed by the Czech National Corpus.

There are several ways how to query the corpus:

Version 4.0

The version includes the same texts as par-skcs-3.0.

Version 3.0

The corpus par-skcs-3.0 was released in January 2014. The database contained 240 million tokens (119.4 million in the Slovak half, 119.53 million in the Czech half).

The subcorpus of fiction par-skcs-fic-3.0 contained 19 million tokens (approximately 9.5 million for each half).

Version 2.0

The corpus par-skcs-2.0 contained 6 433 thousand sentence pairs (approximately 120 million tokens for each half).

The subcorpus of fiction contained 740 thousand sentence pairs (approximately 10 million for each half).

Version 1.0

The corpus par-skcs-1.0 contained 735 thousand sentence pairs (10 million tokens per language).

Development of the free corpus supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X).

Developed jointly: Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Czech National Corpus, Faculty of Arts, Charles University in Prague and Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.