Slovak Named Entity Corpus

The latest version snec-1.0 was released on 28 February 2024, containing 468 715 tokens.

The corpus is comprised of 201 texts from the free encyclopedia Wikipedia. It contains 27 000 sentences with more than 67 000 annotated entities. Hitherto manually annotated texts have also undergone a supervised semiautomated control.

The corpus uses data gained within the project Koncepcia a realizácia sémantickej anotácie korpusu (identifikácia viacslovných pomenovaní, ručná anotácia pomenovaných jednotiek, budovanie ontológií). The texts are semantically annotated.