WordNet is a database of semantic relations. It describes semantic relations between the most frequent Slovak nouns, adjectives, verbs and adverbs following the general model of English WordNet. Slovak entries (synsets) are mapped to English synsets.
The project is still in the development stage. The database currently contains 25 000 synsets. It has been made available to give an insight into the data and processing technologies. The file format can change.
File format
The files are encoded using UTF-8 with the Unix line ending (LF, \n, U+00A0 …). Each synset is a single line consisting of two records separated by a symbol ␞ U+241E SYMBOL FOR RECORD SEPARATOR. The first Slovak record is linked to the other from Princeton WordNet.
Format of the Slovak record
Each record includes 4 annotations separated by a tab (\t):
Number is a synset identifier.
Part of speech classification:
n for nouns
v for verbs
a for adjectives
r for adverbs.
Words are literals grouped by similarity of meaning – literals are separated by a semicolon; explanation or further clarification can be given in the brackets. Plus sign (+) denotes semantically ‘most important’ literal in the synset. Minus sign (-) indicates that there is no direct equivalent in the target language. Question mark (?) denotes unclear synset.
Gloss is an optional definition or comment on synset; in most cases this annotation remains empty.
Slovak synset can be linked to several English synsets.
Please cite
Ondrej Dzurjuv, Ján Genči and Radovan Garabík: Generating Sets of Synonyms between Languages. In: Natural Language Processing, Multilinguality. Proceedings of the 6th International Conference SLOVKO 2011. Eds. D. Majchráková, R. Garabík. November 2011, Tribun, Brno.
Licences
Slovak WordNet is available under following licenses:
- GNU Affero General Public License, verzia 3
- Creative Commons Attribution-ShareAlike 3.0 Unported License
- Open Database License (ODbL) v1.0
Related links
Website for the Lithuanian WordNet can be found here.