Supervised papers classification on large-scale high-dimensional data with apache spark

Akritidis L., Bozanis P., Fevgas A.

dc.creator	Akritidis L., Bozanis P., Fevgas A.	en
dc.date.accessioned	2023-01-31T07:30:37Z
dc.date.available	2023-01-31T07:30:37Z
dc.date.issued	2018
dc.identifier	10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.00140
dc.identifier.isbn	9781538675182
dc.identifier.uri	http://hdl.handle.net/11615/70353
dc.description.abstract	The problem of classifying a research article into one or more fields of science is of particular importance for the academic search engines and digital libraries. A robust classification algorithm offers the users a wide variety of useful tools, such as the refinement of their search results, the browsing of articles by category, the recommendation of other similar articles, etc. In the current literature we encounter approaches which attempt to address this problem without taking into consideration important parameters such as the previous history of the authors and the categorization of the scientific journals which publish the articles. In addition, the existing works overlook the huge volume of the involved academic data. In this paper, we expand an existing effective algorithm for research articles classification, and we parallelize it on Apache Spark-A parallelization framework which is capable of sharing large amounts of data into the main memory of the nodes of a cluster-to enable the processing of large academic datasets. Furthermore, we present data manipulation methodologies which are useful not only for this particular problem, but also for most parallel machine learning approaches. In our experimental evaluation, we demonstrate that our proposed algorithm is considerably more accurate than the supervised learning approaches implemented within the machine learning library of Spark, whereas it outperforms them in terms of execution speed by a significant margin. © 2018 IEEE.	en
dc.language.iso	en	en
dc.source	Proceedings - IEEE 16th International Conference on Dependable, Autonomic and Secure Computing, IEEE 16th International Conference on Pervasive Intelligence and Computing, IEEE 4th International Conference on Big Data Intelligence and Computing and IEEE 3rd Cyber Science and Technology Congress, DASC-PICom-DataCom-CyberSciTec 2018	en
dc.source.uri	https://www.scopus.com/inward/record.uri?eid=2-s2.0-85056835173&doi=10.1109%2fDASC%2fPiCom%2fDataCom%2fCyberSciTec.2018.00140&partnerID=40&md5=004eb42f35ee78003707bfaa707505e0
dc.subject	Artificial intelligence	en
dc.subject	Classification (of information)	en
dc.subject	Clustering algorithms	en
dc.subject	Data mining	en
dc.subject	Digital libraries	en
dc.subject	Learning systems	en
dc.subject	Search engines	en
dc.subject	Dimensionality reduction	en
dc.subject	Effective algorithms	en
dc.subject	Experimental evaluation	en
dc.subject	High dimensional data	en
dc.subject	Large amounts of data	en
dc.subject	Robust classification	en
dc.subject	Sparse random projections	en
dc.subject	Supervised learning approaches	en
dc.subject	Big data	en
dc.subject	Institute of Electrical and Electronics Engineers Inc.	en
dc.title	Supervised papers classification on large-scale high-dimensional data with apache spark	en
dc.type	conferenceItem	en

Fichier(s) constituant ce document

Fichiers	Taille	Format	Vue
Il n'y a pas de fichiers associés à ce document.

Ce document figure dans la(les) collection(s) suivante(s)

Δημοσιεύσεις σε περιοδικά, συνέδρια, κεφάλαια βιβλίων κλπ. [19705]

Afficher la notice abrégée