POS Annotated 50M Corpus of Tajik Language

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři	DOVUDOV Gulshan SUCHOMEL Vít ŠMERK Pavel
Rok publikování	2012
Druh	Článek ve sborníku
Konference	Proceedings of the Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL 8/AfLaT 2012)
Fakulta / Pracoviště MU	Fakulta informatiky
Citace
www	http://www.cnts.ua.ac.be/sites/default/files/saltmil8-aflat2012.pdf
Obor	Informatika
Klíčová slova	Tajik language; Tajik corpus; morphological analysis of Tajik
Popis	Paper presents by far the largest available computer corpus of Tajik language of the size of more than 50 million words. To obtain the texts for the corpus two different approaches were used and the paper offers a description of both of them. Then the paper describes a newly developed morphological analyzer of Tajik and presents some statistics of its application on the corpus.
Související projekty:	Projekt LINDAT-Clarin - Vybudování a provoz českého uzlu pan-evropské infrastruktury pro výzkum