A Scalable Short-Text Clustering Algorithm Using Apache Spark
Ημερομηνία
2021Γλώσσα
en
Λέξη-κλειδί
Επιτομή
Short text clustering deals with the problem of grouping together semantically similar documents with small lengths. Nowadays, huge amounts of text data is being generated by numerous applications such as microblogs, messengers, and services that generate or aggregate entitled entities. This large volume of highly dimensional and sparse information may easily overwhelm the current serial approaches and render them inefficient, or even inapplicable. Although many traditional clustering algorithms have been successfully parallelized in the past, the parallelization of short text clustering algorithms is a rather overlooked problem. In this paper we introduce pVEPHC, a short text clustering method that can be executed in parallel in large computer clusters. The algorithm draws inspiration from VEPHC, a recent two-stage approach with decent performance in several diverse tasks. More specifically, in this work we employ the Apache Spark framework to design parallel implementations of both stages of VEPHC. During the first stage, pVEPHC generates an initial clustering by identifying and modelling common low-dimensional vector representations of the original documents. In the sequel, the initial clustering is improved in the second stage by applying cluster split and merge operations in a hierarchical fashion. We have attested our implementation on an experimental Spark cluster and we report an almost linear improvement in the execution times of the algorithm. © 2021 IEEE.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Online clustering of distributed streaming data using belief propagation techniques
Halkidi, M.; Koutsopoulos, I. (2011)Extraction of patterns out of streaming data that are generated from geographically dispersed devices is a major challenge in data mining. The sequential, distributed fashion in which data become available to the decision ... -
Distributed clustering in vehicular networks
Maglaras, L. A.; Katsaros, D. (2012)Clustering in vanets is of crucial importance in order to cope with the dynamic features of the vehicular topologies. Algorithms that give good results in Manets fail to create stable clusters since vehicular nodes are ... -
Improving Hierarchical Short Text Clustering through Dominant Feature Learning
Akritidis L., Alamaniotis M., Fevgas A., Tsompanopoulou P., Bozanis P. (2022)This paper focuses on the popular problem of short text clustering. Since the short text documents typically exhibit high degrees of data sparseness and dimensionality, the problem in question is generally considered more ...