DIAKRITIK is a tool for reconstructing diacritics. It was developed in the SNC and made available on August 18, 2014. It draws on the use of a language model based on the large corpus of Slovak language texts. One of the following methods with different error / speed ratios can be used for reconstruction:
first | Selects the first reconstruction option it finds in the text. |
random | Where possible, replaces each word with a random word with diacritics. |
naïve | Selects the most common words featuring diacritics. |
n-gram | Will use a language model – words are reconstructed in sections of the length n so that the probability of occurrence of the resulting sentence in Slovak is as high as possible. The higher the n, the better the accuracy, but the greater the computational complexity. |
remove diacritics | Opposite procedure, the tool removes diacritics from the uploaded text. |
The error rate of the reconstructed text, i.e. the ratio of words with incorrect diacritics, is approximately 0.21%, i.e. the reconstruction of about one word out of five hundred will be erroneous. The more similar the text is to standard Slovak, the more successful its reconstruction.
More links
- language models of the Slovak National Corpus
- CZACCENT – adding diacritics in Czech
- alternative addition of diacritics at brm.sk
- text correction at KEMT Natural Language Processing, TUKE Košice