POS Annotated 50M Corpus of Tajik Language

Dovudov,  Gulshan; Suchomel,  Vít; Šmerk,  Pavel

POS Annotated 50M Corpus of Tajik Language

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	DOVUDOV Gulshan SUCHOMEL Vít ŠMERK Pavel
Year of publication	2012
Type	Article in Proceedings
Conference	Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL 8/AfLaT 2012)
MU Faculty or unit	Faculty of Informatics
Citation
Web	http://www.cnts.ua.ac.be/sites/default/files/saltmil8-aflat2012.pdf
Field	Informatics
Keywords	Tajik language; Tajik corpus; morphological analysis of Tajik
Description	Paper presents by far the largest available computer corpus of Tajik language of the size of more than 50 million words. To obtain the texts for the corpus two different approaches were used and the paper offers a description of both of them. Then the paper describes a newly developed morphological analyzer of Tajik and presents some statistics of its application on the corpus.
Related projects:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum