Slovak-English Parallel Corpus

The current version par-sken-5.0 released in May 2025 contains 526 million tokens (277 million tokens in the Slovak half, 249 million tokens in the English one).

The corpus consists of two parts: the subcorpus of fiction (88 million tokens – 40 million tokens in the Slovak part, 48 million tokens in the English part) and the subcorpus of freely available texts, the EU documents especially.

To access the whole corpus, free registration is required. Use the web interface NoSketchEngine to query the Slovak texts or the English texts. Knowledge of NoSketch Engine and CQL is recommended.

Slovak-English Parallel Corpus is a database containing texts for both Slovak and English language. Slovak texts are translated into English or vice versa. Texts are automatically aligned at sentence level. Slovak texts are automatically morphologically annotated by the tagger Morče which has been trained and tuned on tagset developed by the SNK. English texts are part-of-speech tagged with The Penn Treebank Tagset, using the TreeTagger software.

The difference in data between the version 4 and 5 is due to license limitations.

Version 4.0

The version par-sken-4.0 released in December 2015 contains 556 million tokens (261 million tokens in the Slovak half, 295 million tokens in the English one).

The corpus consists of two parts: the subcorpus of fiction (200 million tokens – 92 million tokens in the Slovak part, 108 million tokens in the English part) and the subcorpus of freely available texts. To query the subcorpus of fiction, use NoSketch Engine

Version 3.0

There are several ways how to query the corpus:

The most straightforward – a simple WWW interface. Enter the query term (Slovak/English word, lemma or a regular expression) into the input field Search. In the selection box corpus, choose the desired source (par-sken-3.0-sk for Slovak texts and par-sken-3.0-en for English texts). By clicking on the leftmost column, a short bibliography will be displayed.
The most simple – a dictionary interface. This does not contain the whole corpus, just automatically selected translation equivalents.

Version 2.0

The whole parallel corpus par-sken-2.0 contained 10
million sentence pairs (196 million tokens in the English half, 173
million tokens in the Slovak one).

The subcorpus of fiction contained 4 million sentence pairs (63 million
tokens in the English half, 54 million tokens in the Slovak one).

Version 1.0

Corpus par-sken-1.0 contained 1.6 million sentence pairs (24 million tokens in the English half, 20 million tokens in the Slovak one).

Corpus par-sken-1.0 was supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X). Extended version of par-sken-2.0 was also supported by the EC grant.

Developed jointly by Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences and Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.

Slovak-English Parallel Corpus

Version 4.0

Version 3.0

Version 2.0

Version 1.0

Address

Phone

Mobile

E-mail