arTenTen: a new, vast corpus for Arabic

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	BELINKOV Yonatan HABASH Nizar KILGARRIFF Adam ORDAN Noam ROTH Ryan SUCHOMEL Vít
Rok publikování	2013
Druh	Článek ve sborníku
Konference	Proceedings of WACL’2 Second Workshop on Arabic Corpus Linguistics
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	Webové stránky workshopu Sborník abstraktů
Klíčová slova	Arabic corpus; Arabic Corpus Linguistics; MADA; Arabic Gigaword; Modern Standard Arabic
Popis	We present arTenTen, a web crawled corpus of Arabic, gathered in 2012, and a member of the TenTen Corpus Family (Jakubíček et al 2013). arTenTen comprises 5.8 billion words. It has been carefully cleaned, including duplicate removal, using the JusText and Onion tools (Pomikalek 2011). We are currently (May 2013) in the process of tokenising, lemmatising and part-of-speech tagging arTenTen with the leading MADA tool version 3.2 (Habash and Rambow 2005; Habash et al. 2009). Once arTenTen is fully encoded, we will compare it with Arabic Gigaword and an earlier web-crawled corpus (Sharoff 2006). We also plan to explore arTenTen’s composition in relation to Modern Standard Arabic and the dialects, using, amongst other things, Buckwalter and Parkinson’s Frequency Dictionary (2011) and the keywords method presented in (Kilgarriff 2012).
Související projekty:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum