[PAST EVENT] Colloquium: Adaptive Resilience for Extreme Scale Systems

February 11, 2014
8am - 9am
Location
McGlothlin-Street Hall, Room 020
251 Jamestown Rd
Williamsburg, VA 23185Map this location
Dong Li, Oak Ridge National Laboratory

Abstract:

The path to exascale computing poses several research challenges, including massive parallelism, resilience, and hardware heterogeneity. Resilience, i.e., providing efficiency and correctness in the presence of faults, is one of the most important challenges as extreme scale systems scale up in component count and component reliability decreases. The current resilience mechanisms often employ a one-size-fits-all approach and come with large performance and power overhead. They lack the capability to adaptively response to the application needs, and lack holistic reliability management.

In this talk, I will present our recent progresses in addressing the above resilience problems. In particular, I will focus on coordinating algorithm-based fault tolerance and hardware error-correcting code (ECC) with a co-design and adaptive policy to direct end-to-end, overall resilience for the application and architecture. In addition, I will briefly discuss our efforts of using a data-centric approach to understand application vulnerability with a binary instrumentation tool. Finally, I will conclude with an overview of my future research to address other critical research challenges for extreme scale systems.


Bio:

Dr. Dong Li is a research scientist with the Future Technologies Group in the Computer Science and Mathematics Division of Oak Ridge National Laboratory. He received his Ph.D. in computer science from Virginia Tech in 2011. His research generally focuses on high performance computing (HPC). The core theme of his research is to study how to enable scalable and efficient execution of scientific applications on increasingly complex HPC systems. His work creates innovation in runtime, architecture, and performance modeling to address challenges of massive parallelism, resilience and hardware heterogeneity for future extreme scale systems.