The data is jointly released by the Slovak National Corpus, Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences and Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University in Prague.
These (and other) datasets relevant for MT are also available from the Clarin ERIC repository located at the LINDAT-Clarin project page.
To get access to the files, please contact us.
Translation tables for the Moses MT system
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
These tables will help you build your own MT system.
- English->Slovak based on parallel corpus of fiction
- Slovak->English based on parallel corpus of fiction
- English->Slovak factorized (lemma+tag) model, based on parallel corpus of fiction
- Slovak->English factorized (lemma+tag) model, based on parallel corpus of fiction
- English->Slovak based on parallel corpus of fiction+Europarl v6
- Slovak->English based on parallel corpus of fiction+Europarl v6
- Czech->Slovak based on parallel Slovak-Czech corpus and Europarl v6
- Slovak->Czech based on parallel Slovak-Czech corpus and Europarl v6
- Slovak->Czech supplementary phrase table of inflected word forms
- Czech->Slovak supplementary phrase table of inflected word forms
- Slovak->English supplementary phrase table of inflected noun word forms
- English->Slovak supplementary phrase table of inflected noun word forms
You can get more complete language models here.
Parallel corpora (English-Slovak)
Slovak texts are automatically lemmatized and morphologically annotated with the Slovak National Corpus tagset. English texts are lemmatized and part-of-speech tagged with the Penn Treebank Tagset.Corpus | Source | Sentence pairs |
Official Journal of the European Union | http://apertium.eu/data (oj4-ss-1) | 3272180 |
OPUS, the open parallel corpus | http://opus.lingfil.uu.se/ | |
OPUS-EMEA | 1054178 | |
OPUS-EUconst | 10119 | |
OPUS-KDE4 | 105425 | |
OPUS-PHP | 31173 | |
JRC-Acquis 3.0 | http://langtech.jrc.it/JRC-Acquis.html | 1115765 |
The European Commission webpage | http://ec.europa.eu/ | 13050 |
Europarl v6 | http://www.statmt.org/europarl/ | 460779 |
Parallel corpora (Slovak-Czech)
Slovak texts are automatically lemmatized and morphologically annotated with the Slovak National Corpus tagset. Czech texts are automatically lemmatized and morphologically annotated with the Czech National Corpus tagset.Corpus | Source | Sentence pairs |
Official Journal of the European Union | http://apertium.eu/data (oj4-ss-1) | 3078210 |
OPUS, the open parallel corpus | http://opus.lingfil.uu.se/ | |
OPUS-EMEA | 1067905 | |
OPUS-EUconst | 10630 | |
OPUS-KDE4 | 97260 | |
OPUS-PHP | 28084 | |
JRC-Acquis 3.0 | http://langtech.jrc.it/JRC-Acquis.html | 926082 |
The European Commission webpage | http://ec.europa.eu/ | 24190 |
Europarl v6 | http://www.statmt.org/europarl/ | 459089 |
Supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X).