LEMPAS: A Make-Do Lemmatizer for the Swedish PAROLE-Corpus


This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.


Year of publication 2006
Type Article in Periodical
Magazine / Source Prague Bulletin of Mathematical Linguistics
MU Faculty or unit

Faculty of Informatics

Field Informatics
Keywords LEMPAS; PAROLE; Swedish; lemmatizer; rule-based
Description LEMPAS, the lemmatizer for the Swedish corpus PAROLE, came into existence as a by-product of running the Sketch Engine (Kilgarriff et al.) on Swedish, since many of the desirable features of the Sketch Engine, such as building word sketches, are only available for lemmatized corpora. We did not have access to any Swedish lexical sources and the time allowed for the lemmatization was very limited. Consequently, the lemmatizer had no great design ambitions. Initially, we were only attempting to bring related forms together under a pre-lemma, using general rules, and avoiding explicit lists where possible. When the initial rules gave surprisingly good lemmatizations of nouns, verbs and adjectives, we decided to transform the pre-lemmas into real lemmas. The improved lemmatizer made a very good impression. We have tested the program on the manually lemmatized Stockholm-Umea Corpus (SUC), and have analyzed the results.
Related projects:

You are running an old browser version. We recommend updating your browser to its latest version.

More info