ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text

Rygl,  Jan; Sojka,  Petr; Růžička,  Michal; Řehůřek,  Radim

ScaleText: The Design of a Scalable, Adaptable and User-Friendly Document System for Similarity Searches : Digging for Nuggets of Wisdom in Text

Warning

This publication doesn't include Institute of Computer Science. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	RYGL Jan SOJKA Petr RŮŽIČKA Michal ŘEHŮŘEK Radim
Year of publication	2016
Type	Article in Proceedings
Conference	Proceedings of the Tenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2016
MU Faculty or unit	Faculty of Informatics
Citation
web	Domovská stránka workshopu preprint
Field	Informatics
Keywords	ScaleText; vector space modelling; Latent Semantic Indexing; LSI; machine learning; scalable search; search system design; text mining
Description	This paper describes the design of a new ScaleText system aimed at scalable semantic indexing of heterogeneous textual corpora. We discuss the design decisions that lead to a modular system architecture for indexing and searching using semantic vectors of document segments – nuggets of wisdom. The prototype system implementation is evaluated by applying Latent Semantic Indexing (LSI) on the Enron corpus. And the Bpref measure is used to automate comparing the performance of different algorithms and system configurations.
Related projects:	Výzkum v aplikované informatice na FI MU Inteligentní software pro sémantické hledání dokumentů