Corpus of Dialects of the Slovak National Corpus

We started to prepare the Corpus of Dialects of the Slovak National Corpus (hereinafter referred to as CD SNC) in 2013. The aim of the initial phase is to gather existing dialect audio recordings or handwritten transcriptions, in particular those already published, to process them in the way using a corpus methodology and tools and make them available for research.

The new version dialekt-2.0 containing 328 907 text units was made accessible in August 2015. As compared with the previous version, the corpus has been supplemented with a number of processed texts coming from the monographs mapping the Slovak dialect areas.

CD SNC is not lemmatised nor morphologically annotated. User can browse the corpus by searching for a word or using CQL. The transcribed texts contain sociolinguistic metadata about respondents, informants, origin and content of record. User can access the corpus through web interface NoSketch Engine, but he/she must register for an account.

A specialized virtual keyboard named SNK-DIALEKT with the special characters used in the transcriptions is available in the NoSketch Engine interface since the version dialekt-2.0. The corpus also includes several specific values.

Version 1.0

The beta version was prepared in March 2014. The version dialekt-1.0 containing 73 855 tokens released in September 2014 was included into the publicly available sources of SNK. It includes a number of already published texts and the texts provided by the Department of Dialectology of the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences.