A Scalable Short-Text Clustering Algorithm Using Apache Spark

Akritidis L., Alamaniotis M., Fevgas A., Bozanis P.

dc.creator	Akritidis L., Alamaniotis M., Fevgas A., Bozanis P.	en
dc.date.accessioned	2023-01-31T07:30:36Z
dc.date.available	2023-01-31T07:30:36Z
dc.date.issued	2021
dc.identifier	10.1109/ICTAI52525.2021.00149
dc.identifier.isbn	9781665408981
dc.identifier.issn	10823409
dc.identifier.uri	http://hdl.handle.net/11615/70349
dc.description.abstract	Short text clustering deals with the problem of grouping together semantically similar documents with small lengths. Nowadays, huge amounts of text data is being generated by numerous applications such as microblogs, messengers, and services that generate or aggregate entitled entities. This large volume of highly dimensional and sparse information may easily overwhelm the current serial approaches and render them inefficient, or even inapplicable. Although many traditional clustering algorithms have been successfully parallelized in the past, the parallelization of short text clustering algorithms is a rather overlooked problem. In this paper we introduce pVEPHC, a short text clustering method that can be executed in parallel in large computer clusters. The algorithm draws inspiration from VEPHC, a recent two-stage approach with decent performance in several diverse tasks. More specifically, in this work we employ the Apache Spark framework to design parallel implementations of both stages of VEPHC. During the first stage, pVEPHC generates an initial clustering by identifying and modelling common low-dimensional vector representations of the original documents. In the sequel, the initial clustering is improved in the second stage by applying cluster split and merge operations in a hierarchical fashion. We have attested our implementation on an experimental Spark cluster and we report an almost linear improvement in the execution times of the algorithm. © 2021 IEEE.	en
dc.language.iso	en	en
dc.source	Proceedings - International Conference on Tools with Artificial Intelligence, ICTAI	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85123944483&doi=10.1109%2fICTAI52525.2021.00149&partnerID=40&md5=ecd0b90a5627520b8d576ff9186da31b
dc.subject	Big data	en
dc.subject	Clustering algorithms	en
dc.subject	Machine learning	en
dc.subject	Parallel algorithms	en
dc.subject	'current	en
dc.subject	Clusterings	en
dc.subject	Large volumes	en
dc.subject	Micro-blog	en
dc.subject	Short text clustering	en
dc.subject	Short texts	en
dc.subject	Text Clustering	en
dc.subject	Text data	en
dc.subject	Text-clustering algorithm	en
dc.subject	Traditional clustering	en
dc.subject	Cluster analysis	en
dc.subject	IEEE Computer Society	en
dc.title	A Scalable Short-Text Clustering Algorithm Using Apache Spark	en
dc.type	conferenceItem	en

Files in questo item

Files	Dimensione	Formato	Mostra
Nessun files in questo item.

Questo item appare nelle seguenti collezioni

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]

Mostra i principali dati dell'item

A Scalable Short-Text Clustering Algorithm Using Apache Spark

Files in questo item

Questo item appare nelle seguenti collezioni

Related items

Online clustering of distributed streaming data using belief propagation techniques ﻿

Distributed clustering in vehicular networks ﻿

Improving Hierarchical Short Text Clustering through Dominant Feature Learning ﻿

Online clustering of distributed streaming data using belief propagation techniques

Distributed clustering in vehicular networks

Improving Hierarchical Short Text Clustering through Dominant Feature Learning