These (and other) datasets relevant for MT are also available from the Clarin ERIC repository located at the LINDAT-Clarin project page.
To get access to the files, please contact us.
Translation tables for the Moses MT system
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
These tables will help you build your own MT system.
- English->Slovak based on parallel corpus of fiction
- Slovak->English based on parallel corpus of fiction
- English->Slovak factorized (lemma+tag) model, based on parallel corpus of fiction
- Slovak->English factorized (lemma+tag) model, based on parallel corpus of fiction
- English->Slovak based on parallel corpus of fiction+Europarl v6
- Slovak->English based on parallel corpus of fiction+Europarl v6
- Czech->Slovak based on parallel Slovak-Czech corpus and Europarl v6
- Slovak->Czech based on parallel Slovak-Czech corpus and Europarl v6
- Slovak->Czech supplementary phrase table of inflected word forms
- Czech->Slovak supplementary phrase table of inflected word forms
- Slovak->English supplementary phrase table of inflected noun word forms
- English->Slovak supplementary phrase table of inflected noun word forms
You can get more complete language models here.
Parallel corpora (English-Slovak)
Slovak texts are automatically lemmatized and morphologically annotated with the Slovak National Corpus tagset. English texts are lemmatized and part-of-speech tagged with the Penn Treebank Tagset.Corpus | Source | Sentence pairs |
Official Journal of the European Union | http://apertium.eu/data (oj4-ss-1) | 3272180 |
OPUS, the open parallel corpus | http://opus.lingfil.uu.se/ | |
OPUS-EMEA | 1054178 | |
OPUS-EUconst | 10119 | |
OPUS-KDE4 | 105425 | |
OPUS-PHP | 31173 | |
JRC-Acquis 3.0 | http://langtech.jrc.it/JRC-Acquis.html | 1115765 |
The European Commission webpage | http://ec.europa.eu/ | 13050 |
Europarl v6 | http://www.statmt.org/europarl/ | 460779 |
Parallel corpora (Slovak-Czech)
Slovak texts are automatically lemmatized and morphologically annotated with the Slovak National Corpus tagset. Czech texts are automatically lemmatized and morphologically annotated with the Czech National Corpus tagset.Corpus | Source | Sentence pairs |
Official Journal of the European Union | http://apertium.eu/data (oj4-ss-1) | 3078210 |
OPUS, the open parallel corpus | http://opus.lingfil.uu.se/ | |
OPUS-EMEA | 1067905 | |
OPUS-EUconst | 10630 | |
OPUS-KDE4 | 97260 | |
OPUS-PHP | 28084 | |
JRC-Acquis 3.0 | http://langtech.jrc.it/JRC-Acquis.html | 926082 |
The European Commission webpage | http://ec.europa.eu/ | 24190 |
Europarl v6 | http://www.statmt.org/europarl/ | 459089 |
Supported by the EC grant FP7-ICT-2009-5 Bringing Machine Translation for European Languages to the User – Enlarged European Union (EuroMatrixPlus-X).