Types of corpora

Corpora are divided according to:

1. language

monolingual corpora are available for many languages (national corpora)

bilingual and multilingual (parallel) corpora: original texts and their translations

2. language form

– besides the corpora of written texts there are also spoken corpora

3. size

– the first corpora, which contained less than 1 million word forms existed until 1975. At present, there are several corpora containing billions of words

4. type of text

– the corpora may be either general (not further specified) or specialized depending on source and scope of linguistic issue (single-author text corpus, corpus of informal speeches, corpus of most recent texts in order to capture neologisms, etc.)

5. mode of preservation (storage)

– corpora can be either preserved in basic text form, or be lemmatised (a single word is accompanied by its base form of word) and morphologically, syntactically, semantically, or stylistically annotated.

6. date of origin

synchronic corpora are oriented to illustrate the contemporary state of a language, diachronic corpora provide the evolution stages of a language over time.


From other point of view, corpora can be either representative or balanced. A representative corpus is a wide set of real examples of use of a national language in all its forms and diversity (diverse text types, genres, various authors…). A balanced corpus is usually based on proportional sampling of main text types, the other parameters such as genre, domain, etc. are registered only.