publication . Preprint . Conference object . 2017

Performance Characterization of Multi-threaded Graph Processing Applications on Many-Integrated-Core Architecture

Lei Jiang; Langshi Chen; Judy Qiu;
Open Access English
  • Published: 15 Aug 2017
Abstract
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of terascale integration. Among emerging killer applications, parallel graph processing has been a critical technique to analyze connected data. In this paper, we empirically evaluate various computing platforms including an Intel Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor codenamed Knights Landing (KNL) in the domain of parallel graph processing. We show that the KNL gains encouraging performance when processing graphs, so that it can become a promising solution to accelerating multi-threaded graph applications. We further characterize the impact of KNL arch...
Subjects
free text keywords: Computer Science - Distributed, Parallel, and Cluster Computing, Graph theory, Cache, SIMD, Xeon Phi, Vector processor, Intel iPSC, Computer science, Computer architecture, Parallel computing, Xeon, CUDA
Related Organizations
47 references, page 1 of 4

[1] M. Ahmad and O. Khan, “GPU concurrency choices in graph analytics,” in IEEE International Symposium on Workload Characterization, pages 1-10, 2016.

[2] M. Ahmad, et al., “CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores,” in IEEE International Symposium on Workload Characterization, pages 44-55, 2015.

[3] T. Barnes, et al., “Evaluating and Optimizing the NERSC Workload on Knights Landing,” in International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, pages 43-53, 2016.

[4] S. Beamer, et al., “Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server,” in IEEE International Symposium on Workload Characterization, pages 56-65, 2015.

[5] M. Burtscher, et al., “A quantitative study of irregular programs on GPUs,” in IEEE International Symposium on Workload Characterization, pages 141-151, 2012. [OpenAIRE]

[6] C. Cantalupo, et al., “memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies.” Technical report, Sandia National Laboratories, Albuquerque, NM, 2015.

[7] L. Chen, et al., “Efficient and Simplified Parallel Graph Processing over CPU and MIC,” in IEEE International Parallel and Distributed Processing Symposium, pages 819-828, 2015.

[8] L. Chen, et al., “Exploiting Recent SIMD Architectural Advances for Irregular Applications,” in International Symposium on Code Generation and Optimization, pages 47-58, 2016.

[9] Y. Chen, et al., “Deconstructing Iterative Optimization,” ACM Transactions on Architecture and Code Optimization, 9(3):21:1-21:30, October 2012.

[10] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Transaction on Mathematical Software, 38(1):1:1- 1:25, December 2011.

[11] M. Deveci, et al., “Parallel Graph Coloring for Manycore Architectures,” in IEEE International Parallel and Distributed Processing Symposium, pages 892-901, 2016. [OpenAIRE]

[12] E. Elsen and V. Vaidyanathan, “VertexAPI2-A Vertex-Program API for Large Graph Computations on the GPU,” , 2014, https://github.com/RoyalCaliber/vertexAPI2 .

[13] A. Ganapathi, et al., “A Case for Machine Learning to Optimize Multicore Performance,” in USENIX Conference on Hot Topics in Parallelism, pages 1-1, 2009.

[14] J. E. Gonzalez, et al., “PowerGraph: Distributed Graph-parallel Computation on Natural Graphs,” in USENIX Conference on Operating Systems Design and Implementation, pages 17-30, 2012.

[15] W.-S. Han, et al., “TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 77-85, 2013.

47 references, page 1 of 4
Abstract
Intel Xeon Phi many-integrated-core (MIC) architectures usher in a new era of terascale integration. Among emerging killer applications, parallel graph processing has been a critical technique to analyze connected data. In this paper, we empirically evaluate various computing platforms including an Intel Xeon E5 CPU, a Nvidia Geforce GTX1070 GPU and an Xeon Phi 7210 processor codenamed Knights Landing (KNL) in the domain of parallel graph processing. We show that the KNL gains encouraging performance when processing graphs, so that it can become a promising solution to accelerating multi-threaded graph applications. We further characterize the impact of KNL arch...
Subjects
free text keywords: Computer Science - Distributed, Parallel, and Cluster Computing, Graph theory, Cache, SIMD, Xeon Phi, Vector processor, Intel iPSC, Computer science, Computer architecture, Parallel computing, Xeon, CUDA
Related Organizations
47 references, page 1 of 4

[1] M. Ahmad and O. Khan, “GPU concurrency choices in graph analytics,” in IEEE International Symposium on Workload Characterization, pages 1-10, 2016.

[2] M. Ahmad, et al., “CRONO: A Benchmark Suite for Multithreaded Graph Algorithms Executing on Futuristic Multicores,” in IEEE International Symposium on Workload Characterization, pages 44-55, 2015.

[3] T. Barnes, et al., “Evaluating and Optimizing the NERSC Workload on Knights Landing,” in International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, pages 43-53, 2016.

[4] S. Beamer, et al., “Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server,” in IEEE International Symposium on Workload Characterization, pages 56-65, 2015.

[5] M. Burtscher, et al., “A quantitative study of irregular programs on GPUs,” in IEEE International Symposium on Workload Characterization, pages 141-151, 2012. [OpenAIRE]

[6] C. Cantalupo, et al., “memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies.” Technical report, Sandia National Laboratories, Albuquerque, NM, 2015.

[7] L. Chen, et al., “Efficient and Simplified Parallel Graph Processing over CPU and MIC,” in IEEE International Parallel and Distributed Processing Symposium, pages 819-828, 2015.

[8] L. Chen, et al., “Exploiting Recent SIMD Architectural Advances for Irregular Applications,” in International Symposium on Code Generation and Optimization, pages 47-58, 2016.

[9] Y. Chen, et al., “Deconstructing Iterative Optimization,” ACM Transactions on Architecture and Code Optimization, 9(3):21:1-21:30, October 2012.

[10] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Transaction on Mathematical Software, 38(1):1:1- 1:25, December 2011.

[11] M. Deveci, et al., “Parallel Graph Coloring for Manycore Architectures,” in IEEE International Parallel and Distributed Processing Symposium, pages 892-901, 2016. [OpenAIRE]

[12] E. Elsen and V. Vaidyanathan, “VertexAPI2-A Vertex-Program API for Large Graph Computations on the GPU,” , 2014, https://github.com/RoyalCaliber/vertexAPI2 .

[13] A. Ganapathi, et al., “A Case for Machine Learning to Optimize Multicore Performance,” in USENIX Conference on Hot Topics in Parallelism, pages 1-1, 2009.

[14] J. E. Gonzalez, et al., “PowerGraph: Distributed Graph-parallel Computation on Natural Graphs,” in USENIX Conference on Operating Systems Design and Implementation, pages 17-30, 2012.

[15] W.-S. Han, et al., “TurboGraph: A Fast Parallel Graph Engine Handling Billion-scale Graphs in a Single PC,” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 77-85, 2013.

47 references, page 1 of 4
Powered by OpenAIRE Research Graph
Any information missing or wrong?Report an Issue