Practical Web Crawling for Text Corpora
Authors | |
---|---|
Year of publication | 2011 |
Type | Article in Proceedings |
Conference | Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2011 |
MU Faculty or unit | |
Citation | |
web | https://nlp.fi.muni.cz/raslan/2011/paper09.pdf |
Field | Informatics |
Keywords | crawler; web crawling; corpus; web corpus; text corpus |
Description | SpiderLing--a web spider for linguistics--is new software for creating text corpora from the web, which we present in this article. Many documents on the web only contain material which is not useful for text corpora, such as lists of links, lists of products, and other kind of text not comprised of full sentences. In fact such pages represent the vast majority of the web. Therefore, by doing unrestricted web crawls, we typically download a lot of data which gets filtered out during post-processing. This makes the process of web corpus collection inefficient. The aim of our work is to focus the crawling on the text rich parts of the web and maximize the number of words in the final corpus per downloaded megabyte. We present our preliminary results from creating Web corpora of texts in Czech and Tajik. |
Related projects: |