Improving Hierarchical Short Text Clustering through Dominant Feature Learning
Data
2022Language
en
Soggetto
Abstract
This paper focuses on the popular problem of short text clustering. Since the short text documents typically exhibit high degrees of data sparseness and dimensionality, the problem in question is generally considered more challenging than the traditional clustering scenarios. Our proposed solution, named VEPH, is based on a novel algorithm that was published recently with the aim of optimally clustering short text documents. VEPH includes two stages: During the first stage, the original text vectors are projected on a lower dimensional space and the documents with projection vectors lying on the same dimensional space are grouped in the same cluster. The second stage is a refinement process which attempts to improve the quality of the clusters that were generated during the previous stage. The quality of a cluster is determined by its homogeneity and completeness and these are the two primary design criteria of this stage. Initially VEPH cleanses the clusters by removing all dissimilar elements, and then, it iteratively merges the similar clusters in a hierarchical agglomerative manner. The proposed algorithm has been experimentally evaluated in terms of F1 and NMI, by employing three datasets with diverse attributes. The results demonstrated its superiority over other state-of-the-art works of the relevant literature. © 2022 World Scientific Publishing Company.
Collections
Related items
Showing items related by title, author, creator and subject.
-
A Scalable Short-Text Clustering Algorithm Using Apache Spark
Akritidis L., Alamaniotis M., Fevgas A., Bozanis P. (2021)Short text clustering deals with the problem of grouping together semantically similar documents with small lengths. Nowadays, huge amounts of text data is being generated by numerous applications such as microblogs, ... -
Online clustering of distributed streaming data using belief propagation techniques
Halkidi, M.; Koutsopoulos, I. (2011)Extraction of patterns out of streaming data that are generated from geographically dispersed devices is a major challenge in data mining. The sequential, distributed fashion in which data become available to the decision ... -
Distributed clustering in vehicular networks
Maglaras, L. A.; Katsaros, D. (2012)Clustering in vanets is of crucial importance in order to cope with the dynamic features of the vehicular topologies. Algorithms that give good results in Manets fail to create stable clusters since vehicular nodes are ...