A Performance Evaluation of Distributed Deep Learning Frameworks on CPU Clusters Using Image Classification Workloads
Ημερομηνία
2021Γλώσσα
en
Λέξη-κλειδί
Επιτομή
Over the recent years, deep learning is widely being used in a variety of different fields and applications. The constant growth of data used to train complex models, has opened research in the distributed learning. In this domain, two main architectures are used to train models in a distribution fashion, all-reduce and parameter server. Both support synchronous learning, while parameter server also supports asynchronous learning. These architectures are adopted by tech companies, which have developed multiple systems for this purpose. Among the most popular and widely used distributed deep learning systems are Google TensorFlow, Facebook PyTorch and Apache MXNet. In this paper, we quantify the performance gap between these systems and present a detailed analysis to discuss the parameters that affect their execution time. Overall, in synchronous learning setups, TensorFlow is slower compared to PyTorch by average 2.65X, while the latter lags MXNet by average 1.38X. Regarding asynchronous learning, MXNet is faster by average 3.22X in respect with TensorFlow. © 2021 IEEE.
Collections
Related items
Showing items related by title, author, creator and subject.
-
Μηχανική και ενισχυτική μάθηση μέσω του αλγορίθμου Q-learning
Μπάτσιος, Ιωάννης (2021) -
Motivating Engineer Students in E-learning Courses with Problem Based Learning and Self-Regulated Learning on the apT2CLE4‘Research Methods’ Environment
Paraskeva F., Alexiou A., Bouta H., Mysirlaki S., Sotiropoulos D.J., Souki A.-M. (2019)More and more university programs try to establish an understanding of research methodology with relevant courses at undergraduate schools. Engineer students should have adequate academic training and experience to gain ...