Slovak-English Parallel Corpus
The current version par-sken-4.0 released in December 2015 contains 556 million tokens (261 million tokens in the Slovak half, 295 million tokens in the English one).
The corpus consists of two parts: the subcorpus of fiction (200 million tokens – 92 million tokens in the Slovak part, 108 million tokens in the English part) and the subcorpus of freely available texts. To query the subcorpus of fiction, use NoSketch Engine for the English half, for the Slovak half.
Slovak-English Parallel Corpus is a database containing texts for both Slovak and English language. Slovak texts are translated into English or vice versa. Texts are automatically aligned at sentence level. Slovak texts are automatically morphologically annotated by the tagger Morče which has been trained and tuned on tagset developed by the SNK. English texts are part-of-speech tagged with The Penn Treebank Tagset, using the TreeTagger software.
The version par-sken-3.0 was released in January 2014. The database contains 392 million tokens (184 million tokens in the Slovak half, 208 million tokens in the English half), the subcorpus of fiction is comprised of 170 million tokens.
There are several ways how to query the corpus:
The most straightforward – a simple WWW interface. Enter the query term (Slovak/English word, lemma or a regular expression) into the input field Search. In the selection box corpus, choose the desired source (par-sken-3.0-sk for Slovak texts and par-sken-3.0-en for English texts). By clicking on the leftmost column, a short bibliography will be displayed.
The most simple – a dictionary interface. This does not contain the whole corpus, just automatically selected translation equivalents.
You can query the subcorpus of “fiction“ (140 million tokens) through the simple web interface or the NoSketch Engine in the English half, in the Slovak half. Texts of the free subcorpus can be downloaded here.
The whole parallel corpus par-sken-2.0 contained 10 million sentence pairs (196 million tokens in the English half, 173 million tokens in the Slovak one).
The subcorpus of fiction contained 4 million sentence pairs (63 million tokens in the English half, 54 million tokens in the Slovak one).
Corpus par-sken-1.0 contained 1.6 million sentence pairs (24 million tokens in the English half, 20 million tokens in the Slovak one).
Corpus par-sken-1.0 was supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X). Extended version of par-sken-2.0 was also supported by the EC grant.
Developed jointly by Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences and Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.