Set of Ethiopian Web Corpora

Suchomel,  Vít; Rychlý,  Pavel

Set of Ethiopian Web Corpora

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	SUCHOMEL Vít RYCHLÝ Pavel
Year of publication	2016
Type	Software
MU Faculty or unit	Faculty of Informatics
web	http://habit-project.eu/wiki/SetOfEthiopianWebCorpora
Description	A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)
Related projects:	Harvesting big text data for under-resourced languages