A software tool for building a statistical prefix processor
Fecha
2012Materia
Resumen
Information Retrieval or Text Classification need to match words between the user's input and the documents in a collection of texts. Matching of words is not a trivial process since words have grammatical (inflectional and derivational) variations. There are two main approaches for matching between inflected words: Stemming (removing word suffixes based on ad-hoc selected suffixes) and Lemmatizing (replacing the inflected form with the base form of a word). However, these approaches normalize the word variations in their rightmost side. We claim it will be beneficial to additionally concentrate on word normalization at the left side, by removing word prefixes. In this report, we present the architecture and functioning of a software tool that can be used as the first stage of a Statistical Prefix Processor, a system that could effectively remove prefixes from words and act as a preprocessing stage of text analysis applications. The tool we present is comprised of two stages / subtools. During the first stage, possible prefixes of words within a collection of texts are identified. During the second stage, a number of users (native speakers) process the text collection, automatically locate words that contain each stem and characterize the prefixes used with each stemmed word. After the text collection has been processed by all users, statistical conclusions can be drawn for each stemmed word and its associated prefixes. Copyright 2012 ACM.