Search instructions

SEARCHING THE SLOVAK NATIONAL CORPUS

1. SNC data searched with NoSketch Engine

The tool for searching the Slovak National Corpus is the NoSketch Engine. Originally, the corpus manager Manatee was used with the client Bonito developed at the Faculty of Informatics at the Masaryk University in Brno. The NoSketch Engine web interface with the SNC data is accessible at https://bonito.korpus.sk. If you would like to use it, you must register first.

2. Simple search with no registration – web interface

Simple search via the web interface requires no registration, but its possibilities are rather limited: few basic corpora are available (prim-6.0-public-all, r-mak-3.0 or others), without the option to search for statistical or any other data. One must read and sign the SNC Conditions of Use before searching the corpora.

HOW TO CITE THE CORPUS

The SNC versions and subcorpora, as well as individual sources contained therein, need to be cited according to the following instructions.

SNC TEXT ANNOTATIONS TYPES AND TAGS

SELECTION OF THE MOST COMMONLY USED SEARCH METHODS

We use metacharacters for search via the CQL attribute. The manner of searching for one token is always binding and has the form of [attribute = "searched_token"], e.g., [lemma = "head"]. We can also create a regular expression using a combination of attributes, e.g., [word = “. * able“ & tag! = “A.*“] (searching for all word forms ending in -able that are not adjectives).

In the corpus, a character means any character other than the space. The meta character of a DOT replaces any character , i.e. the dot in this case also replaces any number, punctuation mark, parenthesis, etc.

The following examples apply to Bonito I, NoSketch Engine, and Sketch Engine.

MetacharacterThe meaning of the metacharacterHow it is usedExpected search result
.DOT replaces any single character.home….homework, homebody
*ASTERISK specifies that the character before the asterisk is repeated any number of times (even zero times).oh*o, oh, ohh, ohhh
+PLUS specifies that the character preceding this regular expression is repeated once or several times.oh+oh, ohh, ohhh…
{ }CURLY BRACKETS. The number inside the curly brackets determines how many times the regular expression or the letter before the brackets is to be repeated.home.{4}homemade, homework,…
{m,n}The brackets may also include the interval of occurrence of the character occurring before the brackets..{5,10}; oh{1,4}any words consisting of 5 to 10 letters (abced, abcedf, zzzzzzzzzz,);     oh, ohh, ohhh, ohhhh
|The VERTICAL LINE has the function of the operator OR.home | homeworkhome, homework
[ ]SQUARE BRACKETS. They define a set of characters that may occur in a given expression in the place of the parentheses. The characters in the set are enumerated without being separated by a comma, or they are defined as an interval, e.g. a-z (the interval includes a sequence of characters without the diacritics).[tsc]on; [c-t]outton, son, con;     cout, dout, eout,… sout, tout (if such tokens occur in the corpus)
( )SIMPLE BRACKETS are used for a set of commands for a certain character of a search term through a regular expression or multiple regular expressions.(H | h)ouse; ([Cc] | [Mm])atHouse, house;     Cat, cat, Mat, mat
(?i)This regular expression causes the search to ignore the case.(?i)houseHouse, house, houSe, HOUse, houSE, HOusE,…
\The BACKLASH character before a regular expression means that the search engine will not consider the character a regular expression but a unit of the text.subst\.subst. (and not substa, substb, substantive,… (the dot after backslash means ONLY dot, not other character))
?The QUESTION MARK means zero or one occurrence of the character preceding it.s?harpsharp, harp
^The CARET means that the character after it must not be in the word at the given position.SSfs^2The 2 must not be after “s”, so they are all feminine in the singular except for the genitive forms, i.e. SSfs1, SSfs3, SSfs4, SSfs5, SSfs6, SSfs7, but theoretically, if such marks existed, it could also be e.g. SSfsA, SSfsaBBBB,…
&The AMPERSAND expresses AND, AT THE SAME TIME, function, which enables defining multiple values concurrently.[tag = “SAms4″ & lemma =”.*ci”]all nouns (S) with the adjective paradigm (A), masculine animate gender (m), singular (s), accusative (4), whose lemma ends with the suffix -ci, e.g. domáceho, kupujúceho, vedúceho (and the lemma is domáci, kupujúci, vedúci) – see https://korpus.juls.savba.sk/subst_en.html

1. Metacharacter combinations

.*The DOT ASTERISK combination replaces any character any number of times. The result of the search for the entry .*am will be the words ending with the suffix -am, but also the word am itself. E.g. am, madam, Madam, Adam, Sam, tram, Tram, Osram, RAM, dram…
.+The DOT PLUS combination is used when searching for words with a certain prefix, suffix, letter group, etc. The search result for the entry still.+ displays all the words beginning with the letters still- (except for the word still). When entering a term to be searched, a regular expression may be used anywhere. For example, by typing un.+ed, you can find all words beginning with the letters un- and ending with -ed (except for the word uned). Conversely, typing .*man.* will return all the words with the base “man”. By additionally modifying this entry to .*ma(n|t).*, the search engine will also find words containing alternation in the given base (for example, the words mat, man, mansion, many, matchball, Emanuel, snowman, tomato, automatically, performance).

2. Conditions used in the corpus search

2.1. within

ExampleMeaningExpected Result
[tag = “S.*”]{2} within [tag = “V.*”][]*[tag = “V.*”]Two nouns in immediate succession in a group of expressions between two verbs.…they invited my wife Alice to a garden party to help the girls…
[lemma=”zelený”] within <doc auth=”Vincent Šikula”/>All of the “zelený” lemmas in the works of Vincent Šikula.e.g. Aký je zelený , — divili sa chlapci .
[lemma=”hlava”][lemma=”deravý”] within <s/>[]*</s>Displays collocations of the two lemmas of “hlava” and “deravý” within a sentence, (only the searched tokens are highlighted in a different color).e.g.
Každý má na hlave deravý klobúk a pred sebou šálku, z ktorej stúpa riedky dym.
Veru tak, hlava opitá, hlava deravá!

2.2. containing

ExampleMeaningExpected Result
<s/>[]*</s> containing [lemma=”hlava”][lemma=”deravý”]Displays complete sentences that contain the lemmas of “hlava” and “deravý”.e.g. Sňal si z hlavy deravý slamený širák , zotrel z čela pot .
[tag=”V.*”][]{5} [tag=”V.*”] containing [tag=”S.*”]{3}Displays the entire 7-token phrases containing a noun group composed of three nouns in immediate succession, with the verbs at beginning and the ende.g. vybral z vrecka balíček cigariet a podal

2.3. meet

PríkladVýznam
(meet [tag=”S.*”] [tag=”VL.*”] -3 3)Displays a noun surrounded by verbs in the past tense spanning -3 to 3 positions. (see also https://korpus.juls.savba.sk/subst_en.html and https://korpus.juls.savba.sk/verb_en.html)

2.4. union

ExampleMeaning
(union (meet [lemma=”hovoriť”] [lemma=”pravda”] -4 4) (meet [lemma=”vysloviť”] [lemma=”lož”] -4 4))When searching for collocations using the meet condition function, the OR function will cause a display of only the “hovoriť” or “vysloviť” lemma.

3. General conditions used in the SketchEngine

ExampleMeaningExpected Result
1:[] 2:[] & 1.tag = 2.tagAll the words occurring next to each other, the morphological categories of which are identical.e.g. príliš automaticky, exkluzívne ekologické, až prakticky, celkom mimovoľne
1:[] 2:[] & 1.tag = 2.tag & f(1.tag) > 1000All the words occurring next to each other with the same morphological tag, where the frequency of the first morphological tag must be more than 1000 in the given corpus.e.g. udržateľný ekonomický, Ježišom Kristom, alebo ako, aj keď

Learn more about the search options in the Sketch Engine and the NoSketch Engine here.