Building a 70 billion word corpus of English from ClueWeb

Pomikálek,  Jan; Rychlý,  Pavel; Jakubíček,  Miloš

Building a 70 billion word corpus of English from ClueWeb

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	POMIKÁLEK Jan RYCHLÝ Pavel JAKUBÍČEK Miloš
Rok publikování	2012
Druh	Článek ve sborníku
Konference	Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	http://nlp.fi.muni.cz/publications/lrec2012_xpomikal_pary_xjakub/lrec2012.pdf
Obor	Informatika
Klíčová slova	corpus; clueweb; English; encoding; word sketch
Přiložené soubory	lrec2012.pdf
Popis	This work describes the process of creation of a 70 billion word text corpus of English. We used an existing language resource, namely the ClueWeb09 dataset, as source for the corpus data. Processing such a vast amount of data presented several challenges, mainly associated with pre-processing (boilerplate cleaning, text de-duplication) and post-processing (indexing for efficient corpus querying using the CQL – Corpus Query Language) steps. In this paper we explain how we tackled them: we describe the tools used for boilerplate cleaning (jusText) and for de-duplication (onion) that was performed not only on full (document-level) duplicates but also on the level of near-duplicate texts. Moreover we show the impact of each of the performed pre-processing steps on the final corpus size. Furthermore we show how effective parallelization of the corpus indexation procedure was employed within the Manatee corpus management system and during computation of word sketches (one-page, automatic, corpus-derived summaries of a word’s grammatical and collocational behaviour) from the resulting corpus.
Související projekty:	Pattern Recognition-based Statistically Enhanced MT Temporální aspekty znalostí a informací Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum