Corpus Factory

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	KILGARRIFF Adam REDDY Siva POMIKÁLEK Jan
Rok publikování	2009
Druh	Článek ve sborníku
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	http://www.kilgarriff.co.uk/Publications/2009-KilgReddyPomikalek-asialex-CorpFactory.doc
Popis	State-of the art lexicography requires corpora, but for many languages there are no large, general-language corpora available. Until recently, all but the richest publishing houses could do little but shake their heads in dismay as corpus-building was long, slow and expensive. But with the advent of the Web it can be highly automated and thereby fast and inexpensive. We have developed a ‘corpus factory’ where we build lexicographic corpora. In this paper we describe the method we use, and how it has worked, and how various problems were solved, for five languages: Dutch, Hindi, Telugu, Thai and Vietnamese. The corpora we have developed are available for use in the Sketch Engine corpus query tool.
Související projekty:	Inteligentní modely, algoritmy, metody a nástroje pro vytváření sémantického webu Centrum komputační lingvistiky Prostředky tvorby komplexní báze znalostí pro komunikaci se sémantickým webem v přirozeném jazyce