Abstract—The current study proposes to compare document
retrieval precision performances based on language modeling
techniques, particularly stemming and lemmatization.
Stemming is a procedure to reduce all words with the same stem
to a common form whereas lemmatization removes inflectional
endings and returns the base or dictionary form of a word.
Comparisons were also made between these two techniques
with a baseline ranking algorithm (i.e. with no language
processing). A search engine was developed and the algorithms
were tested based on a test collection. Both mean average
precisions and histograms indicate stemming and
lemmatization to outperform the baseline algorithm. As for the
language modeling techniques, lemmatization produced better
precision compared to stemming, however the differences are
insignificant. Overall the findings suggest that language
modeling techniques improves document retrieval, with
lemmatization technique producing the best result.
Index Terms—Document retrieval, language models,
lemmatization, stemming.
The authors are with the Faculty of Computer Science and Information
Systems, University of Malaya, Kuala Lumpur, Malaysia (e-mail:
vimala.balakrishnan@um.edu.my, ethel_lloyd@siswa.um.edu.my).
[PDF]
Cite: Vimala Balakrishnan and Ethel Lloyd-Yemoh, "Stemming and Lemmatization: A Comparison of Retrieval Performances," Lecture Notes on Software Engineering vol. 2, no. 3, pp. 262-267, 2014.