[PAST EVENT] Lishan Yang, Computer Science - PhD dissertation proposal
Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications but remain susceptible to transient hardware faults (soft errors) that can easily compromise application output. One of the major challenges in the domain of GPU reliability is to accurately measure general purpose GPU (GPGPU) application resilience to transient faults. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on the application error resilience is impractical. Alternatively, fault site selection techniques have been proposed to approach high accuracy with less fault injection experiments. However, most of the existing methods in the literature only focus on the single-bit fault model and only one input.
In this dissertation, we offer solutions to the two problems above: we extend a progressive fault site pruning technique for two multi-bit fault models: (a) multi-bit faults in the same word; (b) multiple single-bit faults in different words accessed by the same thread, and we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by focusing on the effect of input size on the application resilience profile. The proposed solutions to above problems use the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level that effectively allows to prune the fault site space. For multi-bit faults, we extend fault site pruning in the challenging cases of different multi-bit fault models. SUGAR is based on the discovery that a small fraction of the input is sufficient to estimate the application resilience with high accuracy, this discovery dramatically reduces the duration of experimentation. Key of the SUGAR estimation methodology is repeating thread patterns that develop as a function of the input size. These patterns allow for accurate prediction of application error resilience for arbitrarily large inputs.
With the presence of input-aware estimation strategies, we are able to pinpoint the vulnerabilities in a GPGPU application, and propose low overhead protection techniques accordingly. Based on the variety of thread resilience in GPGPU applications, we propose a methodology that identifies the resilience of threads and aims to map threads with the same resilience characteristics to the same warp. This method takes advantage of a general purpose GPU application hierarchical organization in threads, warps, and cooperative thread arrays. Our technique allows engaging partial replication mechanisms for error detection/correction at the warp level. By exploring 12 benchmarks (17 kernels) from 4 benchmark suites, we illustrate that threads can be remapped into reliable or unreliable warps with only minimal introduced overhead, and then enable selective protection via replication to those groups of threads that truly need it. Furthermore, we show that thread remapping to different warps does not sacrifice application performance. We show how this remapping facilitates warp replication for error detection and/or correction and achieves a significant reduction of execution cycles, comparing to standard techniques.
Lishan Yang is a Ph.D. candidate in the Computer Science Department at William & Mary, under the supervision of Prof. Evgenia Smirni. Her research interest falls in GPU architecture, reliability analysis, performance analysis, workload characterization of large scale systems, reliability of HPC and large scale systems. Her Ph.D. research has been published in top conferences (MICRO, ICSE, Sigmetrics). Before coming to W&M, she received her bachelor's degree in computer science from University of Science and Technology of China (USTC) in 2016