[PAST EVENT] Mohamed Ibrahim, Computer Science - Ph.D. Dissertation Defense
Zoom Meeting Link: https://cwm.zoom.us/j/9809984990
To match the increasing computational demands of GPGPU applications and to improve peak compute throughput, the core counts in GPUs have been increasing with every generation. However, the famous memory wall is a major performance determinant in GPUs. In other words, in most cases, peak throughput in GPUs is ultimately dictated by memory bandwidth. Therefore, to serve the memory demands of thousands of concurrently executing threads, GPUs are equipped with several sources of bandwidth such as on-chip private/shared caching resources and off-chip high bandwidth memories. However, the existing sources of bandwidth are often not sufficient for achieving optimal GPU performance. Therefore, it is important to conserve and improve memory bandwidth utilization.
To achieve this goal, this dissertation focuses on improving on-chip cache bandwidth by managing cache line (data) replication across L1 caches via rethinking the cache hierarchy and interconnect design. First, this dissertation shows that efficient inter-core communication can exploit data replication across the L1s to unlock an additional source of on-chip bandwidth, which we call remote-core bandwidth. We propose to exploit this remote-core bandwidth by investigating: a) which data is replicated across cores, b) which cores have the replicated data, and c) how to fetch the replicated data as soon as possible. Second, this dissertation shows that if data replication is eliminated (or reduced), then the L1s can effectively cache more data, leading to higher hit rates and more on-chip bandwidth. We propose designing a shared L1 cache organization, which restricts each core to cache only a unique slice of the address range, eliminating data replication. We develop lightweight mechanisms to: a) reduce the inter-core communication overheads and b) to identify applications that prefer the private L1 organization and hence execute them accordingly. Finally, to improve the performance, area, and energy efficiency of the shared L1 organization, this dissertation proposes a DC-L1 (DeCoupled-L1) cache, an L1 cache separated from the GPU core. We show how the decoupled nature of the DC-L1 caches provides an opportunity to aggregate the L1s and enables low-overhead efficient data placement designs. These optimizations reduce data replication across the L1s and increase their bandwidth utilization.
Altogether, this dissertation develops several innovative techniques to improve the efficiency of the GPU on-chip memory system, which are necessary to address the memory wall problem. The future work will explore other designs and techniques to improve on-chip bandwidth utilization by considering other bandwidth sources (e.g., scratchpad and L2 cache).
Mohamed Assem Ibrahim is a Ph.D. Candidate in the Department of Computer Science at William & Mary under the supervision of Professor Adwait Jog. Mohamed’s research interests lie in the broad area of computer architecture, with an emphasis on designing high-performance and energy-efficient GPU architectures. His Ph.D. research has been published at top venues: PACT 2019, PACT 2020, and HPCA 2021. Additionally, he has co-authored papers at other major computer architecture conferences such as MICRO and ICS. Mohamed worked as an intern with AMD Research in the summer of 2018 and the summer/fall of 2020. Before joining William & Mary, he received his bachelor's and master's degrees in Computer Engineering at Cairo University, Egypt.