Large Corpora for Turkic Languages and Unsupervised Morphological Analysis

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.
Authors

BAISA Vít SUCHOMEL Vít

Year of publication 2012
Type Article in Proceedings
Conference Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)
MU Faculty or unit

Faculty of Informatics

Citation
Web http://www.lrec-conf.org/proceedings/lrec2012/workshops/02.Turkic%20Languages%20Proceedings.pdf
Field Linguistics
Keywords corpus; turkic languages; unsupervised morphological analysis
Description In this article we describe six new web corpora for Turkish, Azerbaijani, Kazakh, Turkmen, Kyrgyz and Uzbek languages. The data for these corpora was automatically crawled from the web by SpiderLing. Only minimal knowledge of these languages was required to obtain the data in raw form. Corpora are tokenized only since morphological analyzers and disambiguators for these languages are not available (except for Turkish). Subsequent experiment with unsupervised morphological segmentation was carried out on the Turkish corpus. In this experiment we achieved encouraging results. We used data provided for MorphoChallenge competition for the purpose of evaluation.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.

More info