Towards 100M Morphologically Annotated Corpus of Tajik

Dovudov,  Gulshan; Suchomel,  Vít; Šmerk,  Pavel

Towards 100M Morphologically Annotated Corpus of Tajik

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	DOVUDOV Gulshan SUCHOMEL Vít ŠMERK Pavel
Year of publication	2012
Type	Article in Proceedings
Conference	Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012
MU Faculty or unit	Faculty of Informatics
Citation
web	https://nlp.fi.muni.cz/raslan/2012/paper15.pdf
Field	Linguistics
Keywords	web corpora; Tajik
Description	The paper presents a work in progress: building morphologically annotated corpus of Tajik language of the size more than 100 million tokens. The corpus is and will be by far the largest available computer corpus of Tajik: even its current size is almost 85 million tokens. Because the available text sources are rather scarce, to achieve the goal also the texts of a lower quality have to be included. This short paper briefly reviews the current state of the corpus and analyzer, discusses problems with either “normalization” or at least categorization of low quality texts and finally also the perspectives for the nearest future.
Related projects:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum