About corpus and corpus resources

THE CORPUS IS NEITHER AN ELECTRONIC DICTIONARY, NOR IS IT A SUBSTITUTE FOR THE CODIFICATION MANUALS.

The corpus material is most commonly obtained directly in an electronic form, or less frequently by technical processing of a previously published printed work. There are subsequently technical phases, such as removal of characters and symbols either generated by the editing software or from the graphic components of the text, conversion into a uniform format, and segmentation of the text into the smallest possible units. Text thereby segmented can be further tagged according to the corpus type, adding additional information such as bibliographic data, information on the text structure and linguistic information at the word level (e.g. part of speech, lemma – basic form of a word) or at the sentence level (word function, the semantics), etc.

Written corpora

  • Main corpus prim and specialised corpora

Corpora of written texts include texts electronically processed by the type on which the corpus focuses. The main corpus, prim, contains written text of the contemporary Slovak language written since 1955 in different styles, genres, subject areas, regions and more. A prerequisite for inclusion of the text in the corpus is the consent of its author or copyright holder and this is embodied in the licence agreement. The same conditions apply to the specialised corpora (e.g. corpus of economic texts), although not to corpora containing the text of legal regulations, official decisions and judicial decisions, as Slovakia’s Copyright Act does not cover them.

Special thanks go to all providers for their willingness to cooperate in building the Slovak National Corpus and contributing texts for (not only) linguistic research.

 

 

  • Corpus of dialects

Corpus of Dialects of the Slovak National Corpus comprises, in an electronic format, existing, predominantly textual transcriptions of dialects previously released audio recordings or handwritten transcriptions. Along with information on the origin and content of the recordings, both the corpus methodology and tools have been uniformly processed and the texts contain sociolinguistic data on informants and explorators. That all enables comprehensive research of dialects.

  • Historical corpora

The Slovak National Corpus offers three corpora that consist of texts written in the Slovak language prior to 1955, of which the first two, r864az1843-1.0 and r1843az1954-1.0, include corpus-processed texts from publications available in the SME Gold Fund, courtesy of Petit Press, a.s. The third corpus comprising historical texts is substantially different from the other two because they were transcribed according to the grammatical principles of written Slovak that existed when the texts were published and according to the principles either the editors or the publishers held at the time.

Source materials in the original orthography were selected and processed for the Corpus of Historical Slovak. They had earlier appeared mainly in Pramene k dejinám slovenčiny (Sources for the History of the Slovak Language). Meanwhile, on a smaller scale, unpublished historical texts are being transcribed within the project of the Slovak National Corpus.

  • Web corpus

Different versions of the web corpus contain Slovak texts which can be found on the web, for each year automatically downloaded and subsequently processed. The first version of the web corpus was released in 2010, developed jointly with the Faculty of Informatics, Masaryk University in Brno. Starting with the third version, the web corpus includes also data produced by Araneum.

  • Parallel corpora

Each parallel corpus contains identical texts in two different languages, which may either be translations of each other or translations from a third language. The texts in Slovak, mostly translations from other languages, are licensed for inclusion in these corpora, while the foreign language texts have been obtained for the corpora from Internet sources. Some of the texts included in the parallel corpora covering translations into Slovak from English, Bulgarian, Czech, French, Hungarian and German are not protected by copyright as they originate from European legislation.

Texts in the SNC parallel corpora are sentence-level aligned. Even though each parallel corpus offers two-way searching, this does not necessarily mean that the foreign language texts have been translated from the original Slovak and, in fact, the foreign language text may happen to be the source. In case of major languages such as English, German or French, all or almost all of the texts were not originally written in Slovak. The ratio of text originally written in Slovak to text whose source is another foreign language varies among the other parallel corpora. For example, in the Slovak-Czech parallel corpus, over 53% of the source for the texts is Slovak, while only a little over 20% of them were originally written in Czech.

Spoken Corpora

The spoken corpus comprises audio recordings linked to a corresponding transcript of the recorded utterances. The transcripts are always accompanied by sociolinguistic information about the respondents and general information about the origin and content of the recording. These recordings were either made by SNK staff, or provided by  institutions out of their own archives.

Besides the basic transcript of the utterances, which are transcribed in accordance with the rules of written Slovak in the same way as text found in theatre and film scripts, dialogues in fiction and transcripts of interviews in newspapers. The corpus also captures overlapping, incomplete or repetitive utterances, any lapses and other accompanying, non-verbal phenomena. Some speakers tend to deviate significantly from the standard Slovak, so there is the need to capture non-softening, non-lengthening or non-similar pronunciation. When encountering supra-segmental phenomena, pauses and expressive quantity are marked while melody is indicated only with basic end punctuation.