ERRKORP – text corpus for foreigners learning Slovak as a foreign language

The corpus ERRKORP, an acquisition corpus was created by the Department of Slovak National Corpus at the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences and Studia Academica Slovaca – a centre for SFL at the Faculty of Arts of Comenius University in Bratislava.

The project was launched in 2017 and concluded in 2024. The objective of the project was to describe the language error types observed in process of the learning of SFL and to explore the correlations between types of language errors and various factors influencing the learning of SFL, upon the basis of the corpus of the written texts of the non-native speakers. The corpus is available free of charge not only to members of the project team, but also to others interested in teaching Slovak as a foreign language and to all registered users of the SNK as part of the text corpora in the NoSketch Engine tool.

In 2020 the project received support from the APVV agency and its development became part of the broader project Language Errors in Slovak as a Foreign Language Based on Learner Corpus (APVV-19-0155).

The project team included the employees of the Slovak National Corpus, at the Ľ. Štúr Institute of Linguistics of the Slovak Academy of Sciences (K. Gajdošová, J. Levická, J. Mášik, K. Rausová; up to 2020 also L. Klimová, M. Šimková), the employees of Studia Academica Slovaca at the Faculty of Arts of Comenius University in Bratislava (J. Pekarovičová, M. Mošaťová, H. Ľos Ivoríková, P. Kollárová) and colleagues from the Prešov Univesity in Prešov (M. Imrichová, M. Kyseľová, M. Ivanová) and Matej Bel University in Banská Bystrica (A. Gálisová, L. Urbancová). Major text contributors included teachers of Slovak and Slovak culture abroad and the Institute for Language and Professional Preparation for Foreign Students at the Comenius University in Bratislava.

As part of the APVV project, the following versions of the corpus were made available: errkorp-pilot (2022), errkorp-1.0 (2023) and errkorp-2.0 (2024):

The errkorp-pilot was released on August 5, 2022 containing 137,393 tokens.

The first version of the corpus errkorp-1.0 was made available on June 15, 2023 containing 347,395 tokens. The corpus is comprised of 1,063 texts written by students learning Slovak as a foreign language, with different mother tongues and different knowledge of Slovak. The version contains, at the level of manual annotation of errors, qualitatively improved data, compared to the pilot version, and also newly added data.

The second version of the corpus errkorp-2.0 was released on June 26, 2024 containing 727,668 tokens. The corpus is comprised of 2,185 texts written by students learning Slovak as a foreign language, with different mother tongues and different knowledge of Slovak.

After the APVV project closure (June 2024), the versions are hosted by the Department of Slovak National Corpus of the Ľ. Štúr Institute of Linguistics which prepared, within the project Building and Development of the Slovak National Corpus (5th Stage), the third version of the acquisition corpus. The corpus was released on January 26, 2026 containing 953,156 tokens. It is comprised of 3,054 texts. The version, like the previous versions, are automatically lemmatized and morphologically annotated by MorphoDiTa tagger.

ERRKORP – text corpus for foreigners learning Slovak as a foreign language

Address

Phone

Mobile

E-mail