Name: CMD: A Cache-Assisted GPU Memory Deduplication Architecture
Keywords: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture

descriptionPublicationkeyboard_double_arrow_right Article , Preprint 01 Oct 2025Embargo end date: 01 Jan 2024Publisher:Institute of Electrical and Electronics Engineers (IEEE)Journal:IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, volume 44, pages 3,752-3,763 (issn: 0278-0070, eissn: 1937-4151,

Authors: Wei Zhao; Dan Feng; Wei Tong; Xueliang Wei; Bing Wu;

doi: 10.1109/tcad.2025.3552674 , 10.48550/arxiv.2408.09483

arXiv: http://arxiv.org/abs/2408.09483

CMD: A Cache-Assisted GPU Memory Deduplication Architecture

- Summary
- Subjects
- Metrics

Abstract

Massive off-chip accesses in GPUs are the main performance bottleneck, and we divided these accesses into three types: (1) Write, (2) Data-Read, and (3) Read-Only. Besides, We find that many writes are duplicate, and the duplication can be inter-dup and intra-dup. While inter-dup means different memory blocks are identical, and intra-dup means all the 4B elements in a line are the same. In this work, we propose a cache-assisted GPU memory deduplication architecture named CMD to reduce the off-chip accesses via utilizing the data duplication in GPU applications. CMD includes three key design contributions which aim to reduce the three kinds of accesses: (1) A novel GPU memory deduplication architecture that removes the inter-dup and inter-dup lines. As for the inter-dup detection, we reduce the extra read requests caused by the traditional read-verify hash process. Besides, we design several techniques to manage duplicate blocks. (2) We propose a cache-assisted read scheme to reduce the reads to duplicate data. When an L2 cache miss wants to read the duplicate block, if the reference block has been fetched to L2 and it is clean, we can copy it to the L2 missed block without accessing off-chip DRAM. As for the reads to intra-dup data, CMD uses the on-chip metadata cache to get the data. (3) When a cache line is evicted, the clean sectors in the line are invalidated while the dirty sectors are written back. However, most read-only victims are re-referenced from DRAM more than twice. Therefore, we add a full-associate FIFO to accommodate the read-only (it is also clean) victims to reduce the re-reference counts. Experiments show that CMD can decrease the off-chip accesses by 31.01%, reduce the energy by 32.78% and improve performance by 37.79%. Besides, CMD can improve the performance of memory-intensive workloads by 50.18%.

Related Organizations

Huazhong University of Science and Technology
China (People's Republic of)
Wuhan National Laboratory for Optoelectronics
China (People's Republic of)

Keywords

FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture

Impact byBIP!

	selected citations These citations are derived from selected sources. This is an alternative to the "Influence" indicator, which also reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	0
	popularity This indicator reflects the "current" impact/attention (the "hype") of an article in the research community at large, based on the underlying citation network.	Average
	influence This indicator reflects the overall/total impact of an article in the research community at large, based on the underlying citation network (diachronically).	Average
	impulse This indicator reflects the initial momentum of an article directly after its publication, based on the underlying citation network.	Average

Found an issue? Give us feedback

Average

Green