Slovak-English Parallel Corpus

The current version par-sken-4.0 released in December 2015 contains 556 million tokens (261 million tokens in the Slovak half, 295 million tokens in the English one).

The corpus consists of two parts: the subcorpus of fiction (200 million tokens – 92 million tokens in the Slovak part, 108 million tokens in the English part) and the subcorpus of freely available texts. To query the subcorpus of fiction, use NoSketch Engine for the English half, for the Slovak half.

To access the whole corpus, use the web interface NoSketchEngine to query the Slovak texts or the English texts. Knowledge of NoSketch Engine and CQL is recommended.

Slovak-English Parallel Corpus is a database containing texts for both Slovak and English language. Slovak texts are translated into English or vice versa. Texts are automatically aligned at sentence level. Slovak texts are automatically morphologically annotated by the tagger Morče which has been trained and tuned on tagset developed by the SNK. English texts are part-of-speech tagged with The Penn Treebank Tagset, using the TreeTagger software.

Version 3.0

There are several ways how to query the corpus:

  • The most straightforward – a simple WWW interface. Enter the query term (Slovak/English word, lemma or a regular expression) into the input field Search. In the selection box corpus, choose the desired source (par-sken-3.0-sk for Slovak texts and par-sken-3.0-en for English texts). By clicking on the leftmost column, a short bibliography will be displayed.

  • The most simple – a dictionary interface. This does not contain the whole corpus, just automatically selected translation equivalents.

Version  2.0

The whole parallel corpus par-sken-2.0 contained 10
million sentence pairs (196 million tokens in the English half, 173
million tokens in the Slovak one).

The subcorpus of fiction contained 4 million sentence pairs (63 million
tokens in the English half, 54 million tokens in the Slovak one).

Version 1.0

Corpus par-sken-1.0 contained 1.6 million sentence pairs (24 million tokens in the English half, 20 million tokens in the Slovak one).

 
Corpus par-sken-1.0 was supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X). Extended version of par-sken-2.0 was also supported by the EC grant.