
arXiv: 1202.3173
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA'11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of nodes ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms.
13 pages, 3 figures
FOS: Computer and information sciences, Numerical Analysis (math.NA), Computational Complexity (cs.CC), Computer Science - Computational Complexity, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, FOS: Mathematics, Mathematics - Combinatorics, Data Structures and Algorithms (cs.DS), Mathematics - Numerical Analysis, Distributed, Parallel, and Cluster Computing (cs.DC), Combinatorics (math.CO), F.2.1, 68W40, 68W10
FOS: Computer and information sciences, Numerical Analysis (math.NA), Computational Complexity (cs.CC), Computer Science - Computational Complexity, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, FOS: Mathematics, Mathematics - Combinatorics, Data Structures and Algorithms (cs.DS), Mathematics - Numerical Analysis, Distributed, Parallel, and Cluster Computing (cs.DC), Combinatorics (math.CO), F.2.1, 68W40, 68W10
| selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | 82 | |
| popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network. | Top 10% | |
| influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically). | Top 10% | |
| impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network. | Top 10% |
