[PAST EVENT] Probir Roy, Computer Science - Oral Defense
Understanding memory and computational inefficiency in the software and hardware system is evident for achieving optimal performance. Furthermore, it is vital to have insightful guidance for applying various optimization to the software system. In many cases, we realize application behavior through memory access pattern. However, pinpointing performance issues and providing insightful guidance in real-world applications are often challenging. While static analysis fails to address the dynamic nature of the applications, existing runtime analysis tools typically leverage heavyweight memory instrumentations, which hinders the applicability of these tools for real long-running programs. Alternatively, measurement-based techniques impose less overhead. These profilers collect raw performance data from the underlying hardware architecture and pinpoint various data-locality issues and computational inefficiency in the application. However, identifying the complex performance issues such as cache miss classification or high-level optimization guidance such as structure splitting from raw performance data are challenging and yet to explore. To address these issues, we present a series of measurement-based analysis techniques that we employ to have deep insight toward optimization guidance.
In the first project, we address the challenges of structure splitting. Structure splitting, a code transformation technique, can significantly improve memory locality. However, pinpointing inefficient code and providing insightful guidance for structure splitting is challenging. Existing tools typically leverage heavyweight memory instrumentations, which hinders the applicability of these tools for real long-running programs. To address this issue, we develop StructSlim, a profiler to pinpoint top candidates that benefit from structure splitting. StructSlim makes three unique contributions. First, it adopts lightweight address sampling to collect and analyze memory traces. Second, StructSlim employs a set of novel methods to determine memory access patterns to guide structure splitting. We also formally prove that our method has high accuracy even with sparse memory access samples. Third, StructSlim scales on multithreaded machines. StructSlim works on fully optimized, unmodified binary executables independently from their compiler and language, incurring around 7% runtime overhead. To evaluate StructSlim, we study seven sequential and parallel benchmarks. With the guidance of StructSlim, we can significantly improve all these benchmarks; the speedup is up to 1.37x.
The second project studies the scalability challenges of simultaneous multithreading (SMT). Modern architectures employ SMT to increase thread-level parallelism. SMT threads share many functional units and the entire memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging because they typically spawn threads within Single Program Multiple Data (SPMD) models. Since these threads have similar resource requirements, their contention cannot be easily mitigated through simple thread scheduling. To address this critical issue, we first vigorously conduct a systematic performance evaluation on a wide range of representative HPC and CMP applications on three mainstream SMT architectures and quantify their performance sensitivity to SMT effects. Then we introduce a simple scheme for SMT-aware code optimization which aims to reduce the memory contention across SMT threads. Finally, we develop a lightweight performance tool, named SMTAnalyzer, to conclusively identify the optimization opportunities in the source code of multithreaded programs. Experiments on three SMT architectures (i.e., Intel Xeon, IBM POWER7, and Intel Xeon Phi) demonstrate that our proposed SMT-aware optimization scheme can significantly improve the performance for general HPC applications.
The third project addresses the unique challenges of cache miss-classification. As set-associative caches are widely used in modern architectures, capacity and conflict cache misses co-exist. These two types of cache misses require different optimization strategies. While cache misses are commonly studied using cache simulators, state-of-the-art simulators usually incur hundreds to thousands of times a program's execution runtime. Moreover, a simulator has difficulty in simulating complex real hardware. To overcome these limitations, measurement methods are proposed to directly monitor program execution on real hardware via performance monitoring units. However, existing measurement-based tools either focus on capacity cache misses or do not distinguish capacity and conflict cache misses. In this project, we design and implement CCProf, a lightweight measurement-based profiler that identifies conflict cache misses and associates them with program source code and data structures. CCProf incurs moderate runtime overhead that is at least an order of magnitude lower than simulators. With the evaluation on a number of representative programs, CCProf is able to guide optimizations on cache conflict misses and obtain nontrivial speedups.
In the final project, we study scalability challenge of a contemporary deep learning framework. Deep Learning Neural Networks (DNNs), have become increasingly popular in industry and academia for their powerful capability in pattern classification, image processing, and speech recognition. Recently, they have been widely adopted in the HPC environment for solving complex problems related to modeling, runtime prediction, and big data analysis. Current state-of-the-art designs for DNN on modern multi- and many-core CPU architectures, such as variants of Caffe, have reported promising performance in speedup and scalability, comparable with the GPU implementations. However, modern CPU architectures employ Non-Uniform Memory Access(NUMA) technique to integrate multiple sockets, which incurs unique challenges for designing highly efficient CNN frameworks. Without careful design, DNN frameworks can easily suffer from long memory latency due to a large number of memory accesses to remote NUMA domains, resulting in poor scalability. To address this challenge, we propose NUMA-aware multi-solver based CNN design, named NUMA-Caffe, for accelerating deep learning neural networks on multi- and many-core CPU architectures. NUMA-Caffe is independent of DNN network topology, does not impact network convergence rates, and provides superior scalability to the existing Caffe variants. Through a thorough empirical study on four contemporary NUMA-based multi- and many-core architectures, our experimental results demonstrate that NUMA-Caffe significantly outperforms the state-of-the-art Caffe designs regarding both throughput and scalability.
Probir Roy is a Ph.D. candidate in the Department of Computer Science at William & Mary. He is advised by Professor Xu Liu. He received his B.S. in Computer Science and Engineering from Bangladesh University of Engineering and Technology in 2009. His research interests include program analysis, high-performance computing, and operating system. Common threads of his research are to develop tools and techniques to improve software performance and developer productivity.