Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms
Ημερομηνία
2020Γλώσσα
en
Λέξη-κλειδί
Επιτομή
Big Data analytics is presently one of the most emerging areas of research for both organizations and enterprises. The requirement for deployment of efficient machine learning algorithms over huge amounts of data led to the development of parallelization frameworks and of specialized libraries (like Mahout and MLlib) which implement the most important among these algorithms. Moreover, the recent advances in storage technology resulted in the introduction of high-performing devices, broadly known as Solid State Drives (SSDs). Compared to the traditional Hard Drives (HDDs), SSDs offer considerably higher performance and lower power consumption. Motivated by these appealing features and the growing necessity for efficient large-scale data processing, we compared the performance of several machine learning algorithms on MapReduce clusters whose nodes are equipped with HDDs, SSDs, and devices which implement the latest 3D XPoint technology. In particular, we evaluate several dataset preprocessing methods like vectorization and dimensionality reduction, two supervised classifiers, Naive Bayes and Linear Regression, and the popular k-Means clustering algorithm. We use an experimental cluster equipped with the three aforementioned storage devices under different configurations, and two large datasets, Wikipedia and HIGGS. The experiments showed that the benefits which derive from the usage of SSDs depend on the cluster setup and the nature of the applied algorithms. © 2020 World Scientific Publishing Company.