[PAST EVENT] Mohamed Assem Ibrahim, Computer Science - Dissertation Proposal
Graphics Processing Units (GPUs) architectures have become a critical component in most computing systems as they provide orders of magnitude faster and more energy-efficient execution for many general-purpose (GPGPU) applications. Unlike CPUs, which typically have limited multi-threading capabilities, GPUs launch thousands of threads across multiple cores to exploit the high thread-level parallelism in GPGPU applications. To match the increasing computational demands of GPGPU applications and to improve peak compute throughput, the core counts in GPUs have been increasing with every generation. However, the famous memory wall is a major performance determinant in GPUs. In other words, in most cases, peak throughput in GPUs is ultimately dictated by memory bandwidth. Therefore, to serve the memory demands of thousands of concurrently executing threads, GPUs are equipped with several sources of bandwidth such as on-chip private/shared caching resources and off-chip high bandwidth memories. However, the existing sources of bandwidth are often not sufficient for achieving optimal GPU performance. A straightforward approach to mitigate this issue is to scale the on/off-chip memory resources. However, the memory bandwidth scaling is constrained by cost and power budgets of the system, and more importantly the I/O limitations of the off-chip memories. This makes the memory bandwidth a scarce and valuable resource. Therefore, it is important to conserve and improve memory bandwidth utilization.
To achieve the aforementioned goal, this dissertation focuses on improving on-chip cache performance for GPUs. In particular, we will improve the on-chip cache bandwidth by managing data replication across L1 caches via rethinking the cache hierarchy and the interconnect design. Such data replication stems from the private nature of the L1 caches and inter-core locality. Specifically, each GPU core can independently request and store a given cache line (in its local L1 cache) while being oblivious to the previous requests of other cores. This dissertation treats inter-core locality (i.e., data replication) as a double-edged sword, and proposes the following. First, this dissertation shows that efficient inter-core communication can exploit data replication across the L1 caches to unlock an additional potential source of on-chip bandwidth, which we call as remote-core bandwidth. We propose to efficiently coordinate the data movement across GPU cores to exploit this remote-core bandwidth by investigating: a) which data is replicated across cores, b) which cores have the replicated data, and c) how to fetch the replicated data as soon as possible. Second, this dissertation shows that if data replication is eliminated (or reduced), then the L1 caches can effectively cache more data, leading to higher hit rates and more on-chip bandwidth. We propose a renovated L1 cache design that eliminates data replication collectively across the L1s using efficient inter-core. Third, to improve the performance, area, and energy efficiency of the renovated L1 cache design, this dissertation proposes co-designing the GPU cache hierarchy and interconnect to limit data replication across L1s and increase their bandwidth utilization.
Finally, the future work will explore other designs and techniques to improve on-chip bandwidth utilization by considering other bandwidth sources (e.g., shared memory and L2 cache). Altogether, this dissertation develops several innovative techniques to improve the efficiency of the GPU on-chip memory system, which are necessary to address the memory wall problem.
Mohamed Assem Ibrahim is a Ph.D. Candidate in the Department of Computer Science at William & Mary under the supervision of Professor Adwait Jog. Mohamed’s research interests lie in the broad area of computer architecture, with an emphasis on designing high-performance and energy-efficient GPU architectures. His research has been published in PACT, and two more papers are under peer-review. Additionally, he has co-authored papers at other major computer architecture conferences such as MICRO, HPCA, and ICS. Mohamed worked as an intern with AMD Research in the summer of 2018. Before joining William & Mary, he received his bachelor's and master's degrees in Computer Engineering at Cairo University, Egypt.