Set of Ethiopian Web Corpora

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	SUCHOMEL Vít RYCHLÝ Pavel
Rok publikování	2016
Druh	Software
Fakulta / Pracoviště MU	Fakulta informatiky
www	http://habit-project.eu/wiki/SetOfEthiopianWebCorpora
Popis	A set of 5 corpora for 4 Ethiopian languages: Amharic, Oromo, Somali and Tigrinya. The Amharic WIC corpus is a reprocessed existing corpus with part of speech annotation. The released version contains cleaning (especially numeric expressions) and unification of two versions with different scripts (Geez and SERA transliteration). The web corpora were built using automatic tools from Internet texts. They contain from 2.5 million words (Tigrinya) to 80 million words (Somali)
Související projekty:	Harvesting big text data for under-resourced languages