A Comparative Study of Text Retrieval Models on DaReCzech

Varování

Publikace nespadá pod Ústav výpočetní techniky, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.
Autoři

ŠTETINA Jakub FAJČÍK Martin ŠTEFÁNIK Michal HRADIS Michal

Rok publikování 2024
Druh Článek ve sborníku
Konference The Eighteenth Workshop on Recent Advances in Slavonic Natural Language Processing
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
Klíčová slova Information Retrieval; Evaluation; Comparison; Czech Language; Performance Assessment; Document Retrieval; Model Analysis
Popis This article presents a comprehensive evaluation of off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.
Související projekty:

Používáte starou verzi internetového prohlížeče. Doporučujeme aktualizovat Váš prohlížeč na nejnovější verzi.

Další info