Slovak Named Entity Corpus

The latest version snec-1.0 was released on 28 February 2024, containing 468 715 tokens.

The corpus is comprised of 201 texts from the free encyclopedia Wikipedia. It contains 27 000 sentences with more than 67 000 annotated entities. Hitherto manually annotated texts have also undergone a supervised semiautomated control.

The corpus uses data gained within the project Koncepcia a realizácia sémantickej anotácie korpusu (identifikácia viacslovných pomenovaní, ručná anotácia pomenovaných jednotiek, budovanie ontológií). The texts are semantically annotated. For the corpus a specific tagset has been defined. The annotation covers a collection of lexicons of named entities for identified categories, that were manually selected and disambiguated in harmony with the SNC tagset.

The corpus manager allows you to search by word, lemma, or tag. NE tags are displayed in the NoSkE tool as structural tags. When clicking on word reference one can see its values in the structure ne.type.

The tagset used for snec-1.0 is available here. Categories such as numerals, personal names and time periods are completed with the so called supertags, consisting of one capital letter (N, P, T), that present a complex name of an entity. For instance, name Ľudovít Štúr is tagged as P – personal name, a lexeme Ľudovít is tagged as pf (first name) and a lexeme Štúr is tagged as ps (surname).