Slovak-German Parallel Corpus

Version 3.0

The current version of the par-skde-3.0 has been completed with more than 170 publications, so it contains 468 million tokens (229.9 million tokens in the Slovak part and 238.1 million tokens in the German part).

The corpus consists of two parts – the subcorpus of fiction and the free subcorpus (containing EU documents).

You can query the whole Slovak-German corpus using the NoSketchEngine in the German half, in the Slovak half.

Previous experience with NoSketch Engine and CQL is highly recommended.

Slovak-German Parallel Corpus is a database containing texts for both Slovak and German language. Slovak texts are translated into German or vice versa, as well translations from a third language. The database also contains written or published texts in their original form, therefore, an original ortography is preserved in case of the old ones.

The texts are automatically aligned at the sentence level. Slovak texts are automatically morphologically annotated by the tagger Morče and MorphoDiTa which have developed by the SNC. German texts are part-of-speech tagged, using the TreeTagger software.

Version 2.0

The previous version par-skde-2.0 was released in May 2016. The database contained almost 446.2 million tokens (219.8 million tokens in the Slovak half, 226.4 million tokens in the German half).

The corpus consists of two parts – the subcorpus of fiction (7.5 million tokens) and the free subcorpus (containing EU documents).

Version 1.0

The corpus par-skde-1.0 was released in December 2014. The database contained almost 263 million tokens (129.5 million tokens in the Slovak half, 133 million tokens in the German half).

The subcorpus of fiction contained 7.5 million tokens.