Advanced Micro Devices, Inc., “AMD Accelerated Processing Units,” http://www. amd.com/us/products/technologies/apu/Pages/apu.aspx.
 Advanced Micro Devices, Inc., “OpenCL: The Future of Accelerated Application Performance Is Now,” https://www.amd.com/Documents/FirePro_OpenCL_ Whitepaper.pdf.
 N. Agarwal, D. Nellans, M. O'Connor, S. W. Keckler, and T. F. Wenisch, “Unlocking Bandwidth for GPUs in CC-NUMA Systems,” in HPCA, 2015. [OpenAIRE]
 J. Ahn, S. Jin, and J. Huh, “Revisiting Hardware-Assisted Page Walks for Virtualized Systems,” in ISCA, 2012.
 J. Ahn, S. Jin, and J. Huh, “Fast Two-Level Address Translation for Virtualized Systems,” IEEE TC, 2015.
 R. Ausavarungnirun, K. Chang, L. Subramanian, G. Loh, and O. Mutlu, “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” in ISCA, 2012.
 R. Ausavarungnirun, S. Ghose, O. Kayıran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu, “Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance,” in PACT, 2015.
 R. Ausavarungnirun, J. Landgraf, V. Miller, S. Ghose, J. Gandhi, C. J. Rossbach, and O. Mutlu, “Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes,” in MICRO, 2017. [OpenAIRE]
 R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. Rossbach, and O. Mutlu, “MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency,” in ASPLOS, 2018. [OpenAIRE]
 A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, “Analyzing CUDA Workloads Using a Detailed GPU Simulator,” in ISPASS, 2009. [OpenAIRE]
 T. W. Barr, A. L. Cox, and S. Rixner, “Translation Caching: Skip, Don't Walk (the Page Table),” in ISCA, 2010.
 T. W. Barr, A. L. Cox, and S. Rixner, “SpecTLB: A Mechanism for Speculative Address Translation,” in ISCA, 2011.
 A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, “E cient Virtual Memory for Big Memory Servers,” in ISCA, 2013.
 A. Bhattacharjee, “Large-Reach Memory Management Unit Caches,” in MICRO, 2013.
 A. Bhattacharjee, D. Lustig, and M. Martonosi, “Shared Last-level TLBs for Chip Multiprocessors,” in HPCA, 2011.