Corpus of Copywrighting Texts on the Web

The specialized corpus has been designed to collect the copywriting texts (promotional and self-presentation texts) on the web. The corpus named cw-2014-all containing 1 648 229 tokens was released in December 2014.

The corpus consists of 1 441 web pages of a total of 339 web domains that belong to large or small commercial enterprises or public institutions. Duplicity of particular terms is due to project objectives. As the full texts were processed, duplicate content, e.g. sidebars or navigation bars, has not been removed.

The corpus has been collaboratively developed by SNK and E. Jůnová from the Department of Mediamatics and Cultural Heritage, Faculty of Humanities, University of Žilina. It was developed during her research stay at SNK in November and December 2014.

The corpus has been lemmatized and morphologically annotated, the texts are accompanied by information about their source (web domain).