[PAST EVENT] Bin Nie, Computer Science - Oral Exam
Over the past decade, GPUs have become an integral part of mainstream high performance computing (HPC) facilities. Since applications running on HPCs are normally long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPCs face severe resilience challenges as progressing towards exascale, it is imperative to develop better understanding on the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications. To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine-learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results. To understand the effects of soft errors at the application-level, an effective error-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the original tremendous unreachable exhaustive error injection locations to a manageable size while still preserving remarkable accuracy. The results of this research provide valuable insights toward reliable GPU computing.
Bin Nie is a fourth-year Ph.D. candidate at William & Mary, advised by Dr. Evgenia Smirni. She received her bachelor's degree in Software Engineering from Xiamen University in 2012 and master's degree in Computer Science from Fordham University in 2014. Her research interests reside in reliability in GPUs and GPGPU applications.