DIAKRITIK is a tool for reconstructing diacritics. It was developed in the SNC and made available on August 18, 2014. It draws on the use of a language model based on the large corpus of Slovak texts.
One of the following methods with different error / speed ratios can be used for reconstruction:
|first||Selects the first reconstruction option it finds in the text.|
|random||Where possible, replaces each word with a random word with diacritics.|
|naïve||Selects the most common words featuring diacritics.|
|n-gram||Will use a language model – words are reconstructed in sections of the length n so that the probability of occurrence of the resulting sentence in Slovak is as high as possible. The higher the n, the better the accuracy, but the greater the computational complexity.|
|remove diacritics||Opposite procedure, the tool removes diacritics from the uploaded text.|
The error rate of the reconstructed text, i.e. the ratio of words with incorrect diacritics, is approximately 0.2%, i.e. the reconstruction of about one word out of five hundred will be erroneous. The more similar the text is to standard Slovak, the more successful its reconstruction.
The tool has been currently developing by the Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, not within the project Building and Development of the Slovak National Corpus (5th Stage).