[PAST EVENT] Qihan Wang, Computer Science - Oral Preliminary Exam for the PhD.
With the enlarging computation capacity of general Graphics Processing Units (GPUs), leveraging GPUs to accelerate parallel applications has become a critical topic in academia and industry. However, a wide range of irregular applications with a computation-/memory-intensive nature cannot easily achieve high GPU utilization. The challenges mainly involve the following aspects: first, data dependence leads to a coarse-grained kernel; second, heavy GPU memory usage may cause frequent memory evictions and extra overhead of I/O; third, specific computation patterns produce memory redundancies; last, workload balance and data reusability conjunctly benefit the overall performance, but there may exist a dynamic trade-off between them.
Targeting these challenges, this dissertation proposes multiple optimizations to accelerate parallel irregular applications on GPU architectures. The dissertation focuses on two real-world applications as case studies: one is the eALS-based matrix factorization recommendation system; the other is calculating many-body correlation functions in a large-scale scientific system.
To parallelize the eALS-based recommendation system, this dissertation proposes an efficient CPU/GPU heterogeneous recommendation system, HEALS. HEALS employs newly designed architecture-adaptive data formats to achieve load balance and good data locality on CPU and GPU. To mitigate the data dependence, HEALS presents a CPU/GPU collaboration model for both task parallelism and data parallelism. Additionally, HEALS optimizes this collaboration model with kernel-communication overlapping and dynamic workload partition. HEALS also applies various kernel parallel techniques for better GPU utilization: loop unrolling, vectorization, and warp reduction.
To accelerate the calculations of many-body correlation functions, this dissertation presents MemHC, an optimized systematic GPU memory management framework utilizing a series of new memory reduction designs. These designs include GPU memory allocation, CPU/GPU communications, and GPU memory oversubscription. MemHC employs duplication-aware management and lazy release of GPU memories to corresponding host managing for better data reusability. MemHC also implements data reorganization and on-demand synchronization to eliminate redundant (or unnecessary) data transfer. Moreover, MemHC designs a novel Least Recently Used (LRU) eviction policy called Pre-Protected LRU to reduce evictions and increase memory hits.
To further optimize many-body correlation functions on GPU clusters, this dissertation implements an enhanced multi-GPU scheduling framework, MICCO, particularly by taking both data dimension (e.g., data reuse and data eviction) and computation dimension into account. This work first performs a comprehensive study on the interplay of data reuse and load balance and brings up two new concepts: local reuse pattern and reuse bound for the optimal trade-off between them. Based on this study, MICCO designs a heuristic scheduling algorithm and a machine-learning-based regression model to generate the optimal setting of reuse bounds.
In summary, this dissertation efficiently accelerates two typical irregular applications on GPUs by building three frameworks, including CPU/GPU collaboration, GPU memory management, and multi-GPU scheduling. The ongoing and future works focus on GPU topology-aware optimizations and extensive implementation from single-node to multi-node GPU clusters.
Qihan Wang is a Ph.D. Candidate in the Department of Computer Science at William & Mary. Her Ph.D. advisor is Prof. Bin Ren. Her research interests mainly include High Performance Computing, GPU architectures, and machine learning. Her Ph.D. research works have been accepted by IPDPS 2022, TACO 2021, HiPC 2021, and Smart Health 2020. Previously, she received her Bachelor of Software Engineering at Beihang University in 2017.