A&S Graduate Studies
[PAST EVENT] Bin Nie, Computer Science - Ph.D. Defense
Abstract: Over the past decade, GPUs have become an integral part of mainstream high-performance computing (HPC) facilities. Since applications running on HPC systems are usually long-running, any error or failure could result in significant loss in scientific productivity and system resources. Even worse, since HPC systems face severe resilience challenges as progressing towards exascale, it is imperative to develop a better understanding of the reliability of GPUs. This dissertation fills this gap by providing an understanding of the effects of soft errors on the entire system and on specific applications.
To understand system-level reliability, a large-scale study on GPU soft errors in the field is conducted. The occurrences of GPU soft errors are linked to several temporal and spatial features, such as specific workloads, node location, temperature, and power consumption. Further, machine learning models are proposed to predict error occurrences on GPU nodes so as to proactively and dynamically turning on/off the costly error protection mechanisms based on prediction results.
To understand the effects of soft errors at the application level, an effective fault-injection framework is designed aiming to understand the reliability and resilience characteristics of GPGPU applications. This framework is effective in terms of reducing the original tremendous fault injection locations to a manageable size while still preserving remarkable accuracy. This framework is validated with both single-bit and multi-bit fault models for various GPGPU benchmarks. Lastly, taking advantage of the proposed fault-injection framework, this dissertation develops a hierarchical approach to understanding the error resilience characteristics of GPGPU applications at different levels, including kernel, CTA, and warp levels. In addition, given that some corrupted application outputs due to soft errors are acceptable, we present a use case to show how to enable low-overhead yet reliable GPU computing for GPGPU applications.
Bio: Bin Nie is a fifth-year Ph.D. candidate at William & Mary, advised by Dr. Evgenia Smirni. She received her bachelor's degree in Software Engineering from Xiamen University in 2012 and master's degree in Computer Science from Fordham University in 2014. Her research interests reside in reliability in GPUs and GPGPU applications.