[PAST EVENT] Colloquium Talk: AI in HPC Systems Stack: Storage Systems and Deep Learning Systems

January 26, 2022
9am - 10am
Location
McGlothlin-Street Hall, Online on Zoom
251 Jamestown Rd
Williamsburg, VA 23185Map this location

Abstract:

In high-performance computing (HPC), scientific codes have been evolving continuously. Moving from numerical simulations and analyses to AI/ML-based applications, scientific codes execute on larger computational scales and issue massive data movements periodically for network communication and I/O at application runtime. In this talk, I will mainly discuss two of my recent works on understanding/improving the performance of HPC I/O subsystems by leveraging AI/ML algorithms and optimizing network communication in large-scale deep learning systems. In particular, for the work in HPC I/O, I will talk about the challenges and our ML-based solutions of benchmarking, modeling, and tuning the performance of supercomputer I/O systems based on the system design, deployment and configuration. For the work in AI systems, I will discuss our proposals in a popular collective communication library for deep learning frameworks, Horovod, which introduces a decentralized coordination scheme and a grouping mechanism in the Horovod’s control plane and data plane, separately. 


Bio:

Dr. Bing Xie is an HPC research scientist at the Oak Ridge Leadership Computing Facility (OLCF) of Oak Ridge National Laboratory (ORNL). Bing received a Ph.D. in Computer Science from Duke University in 2017 and joined ORNL in the same year. She conducts computer systems research with a strong publication record spanning multiple research areas, including large-scale parallel file systems, deep learning systems, and resource management. Her works are presented at major conferences and journals, such as SC, ACM TOS, NSDI, HPDC, IPDPS. Bing is a winner of IEEE-CS TCHPC early career researchers award in 2021. Her work on parallel file system performance study is nominated as a best paper and a best student paper at SC in 2012. Her improvements on HFD5, a widely used HPC I/O library, were adopted by OLCF. Her work on Horovod is incorporated in Horovod v0.20.1. 

Contact

Adwait Jog