Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors

descriptionPublicationkeyboard_double_arrow_right Article 01 Jun 2016 English Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Parallel and Distributed Systems, volume 27, pages 1,713-1,726 (issn: 1045-9219,

Copyright policy )Funded by:EC | PRACE-4IP, UKRI | Development of PEMD for N..., EC | PRACE-5IP

Authors: Karsavuran M.O.; Akbudak K.; Aykanat C.;

doi: 10.1109/tpds.2015.2453970

handle: 11693/36500

Locality-Aware Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication on Many-Core Processors

- Summary
- Subjects
- Related research
  (9)
- Metrics

Abstract

Sparse matrix-vector and matrix-transpose-vector multiplication ( $\mathrm {{Sp}MM^TV}$ ) repeatedly performed as $z\leftarrow {A^T}x$ and $y\leftarrow A\ z$ (or $y\leftarrow A\ w$ ) for the same sparse matrix $A$ is a kernel operation widely used in various iterative solvers. One important optimization for serial $\mathrm {{Sp}MM^TV}$ is reusing $A$ -matrix nonzeros, which halves the memory bandwidth requirement. However, thread-level parallelization of $\mathrm {{Sp}MM^TV}$ that reuses $A$ -matrix nonzeros necessitates concurrent writes to the same output-vector entries. These concurrent writes can be handled in two ways: via atomic updates or thread-local temporary output vectors that will undergo a reduction operation, both of which are not efficient or scalable on processors with many cores and complicated cache-coherency protocols. In this work, we identify five quality criteria for efficient and scalable thread-level parallelization of $\mathrm {{Sp}MM^TV}$ that utilizes one-dimensional (1D) matrix partitioning. We also propose two locality-aware 1D partitioning methods, which achieve reusing $A$ -matrix nonzeros and intermediate $z$ -vector entries; exploiting locality in accessing $x$ -, $y$ -, and -vector entries; and reducing the number of concurrent writes to the same output-vector entries. These two methods utilize rowwise and columnwise singly bordered block-diagonal (SB) forms of $A$ . We evaluate the validity of our methods on a wide range of sparse matrices. Experiments on the 60-core cache-coherent Intel Xeon Phi processor show the validity of the identified quality criteria and the validity of the proposed methods in practice. The results also show that the performance improvement from reusing $A$ -matrix nonzeros compensates for the overhead of concurrent writes through the proposed SB-based methods.

Related Organizations

Bilkent University
Turkey

Keywords

sparse matrix, singly bordered block-diagonal form, Iterative methods, Matrix reordering, matrix reordering, Bordered block diagonal form, Sparse matrix-vector multiplication, Vectors, Matrix algebra, Sparse matrix, sparse matrix-vector multiplication, Intel many integrated core architecture (Intel MIC), Singly bordered block-diagonal form, Intel Xeon Phi, Sparse matrices, 518, Cache locality, Computer architecture, Intel Many Integrated Core Architecture (Intel MIC), Integrated core

9 Research products, page 1 of 1

Integrating genomic information and productivity and climate-adaptability traits into a regional white spruce breeding program
2022IsAmongTopNSimilarDocuments
SENSITIVITY ANALYSIS OF FUZZY RELATION EQUATIONS
1991IsAmongTopNSimilarDocuments
Irreducible divisors of A-matrices and their applications to multivariable control systems
1983IsAmongTopNSimilarDocuments
Matrix characterizations of Riordan arrays
2015IsAmongTopNSimilarDocuments
Parallel solution of unstructured sparse finite element equations
2002IsAmongTopNSimilarDocuments
Robustness of Thirty Meter Telescope primary mirror control
2010IsAmongTopNSimilarDocuments
Algebraic aspects of some Riordan arrays related to binary words avoiding a pattern
2011IsAmongTopNSimilarDocuments
Exactly/Nearly Unbiased Estimation of Autocovariances of a Univariate Time Series With Unknown Mean
2016IsAmongTopNSimilarDocuments
A Method of Constructing the Half-Rate QC-LDPC Codes with Linear Encoder, Maximum Column Weight Three and Inevitable Girth 26
2014IsAmongTopNSimilarDocuments

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	17
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Top 10%
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Top 10%
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Top 10%

Found an issue? Give us feedback

17

Top 10%

Green

bronze

Fields of Science

Fields of Science

Funded by

EC| PRACE-4IP, UKRI| Development of PEMD for Nuclear Coolant Systems, EC| PRACE-5IP