[PAST EVENT] Hongyuan Liu, Computer Science - Dissertation Defense
The big-data era has brought new challenges to computer architectures due to the large-scale computation and data. Moreover, this problem becomes critical in several domains where the computation is also irregular, among which we focus on automata processing in this dissertation. Automata are widely used in applications from different domains such as network intrusion detection, machine learning, and parsing. Large-scale automata processing is challenging for traditional von Neumann architectures. To this end, many accelerator prototypes have been proposed. Micron's Automata Processor (AP) is an example. However, as a spatial architecture, it is unable to handle large automata programs without repeated reconfiguration and re-execution. We found a large number of automata states are never enabled in the execution but still configured on the AP chips, leading to its underutilization. To address this issue, we proposed a lightweight offline profiling technique to predict the never-enabled states and keep them out of the AP. Furthermore, we develop SparseAP, a new execution mode for AP to handle the misprediction efficiently. Our software and hardware co-optimization obtains 2.1x speedup over the baseline AP execution across 26 applications.
Since the AP is not publicly available, we aim to reduce the performance gap between a general-purpose accelerator---Graphics Processing Unit (GPU) and AP. We identify excessive data movement in the GPU memory hierarchy and propose optimization techniques to reduce the data movement. Although our optimization techniques significantly alleviate these memory-related bottlenecks, a side effect of them is the static assignment of work to cores. This leads to poor compute utilization as GPU cores are wasted on idle automata states. Therefore, we propose a new dynamic scheme that effectively balances compute utilization with reduced memory usage. Our combined optimizations provide a significant improvement over the previous state-of-the-art GPU implementations of automata. Moreover, they enable current GPUs to outperform the AP across several applications while performing within an order of magnitude for the rest of them.
To make automata processing on GPU more generic to tasks with different amounts of parallelism, we propose AsyncAP, a lightweight approach that scales with the input length. Threads run asynchronously in AsyncAP, alleviating the bottleneck of thread block synchronization. The evaluation and detailed analysis demonstrate that AsyncAP achieves significant speedup or at least comparable performance under various scenarios for most of the applications.
The future work aims to design automatic ways to generate optimizations and mappings between automata and computation resources for different GPUs. We will broaden the scope of this dissertation to domains such as graph computing.
Hongyuan Liu is a Ph.D. candidate in the Department of Computer Science at William & Mary under the supervision of Professor Adwait Jog. Hongyuan's research interests lie in the broad area of computer architecture, emphasizing domain-specific accelerators and GPUs. His Ph.D. research has been published in MICRO 2018, ASPLOS 2020, and CLUSTER 2021. Hongyuan interned with Intel in Fall 2019 and worked as a visiting student with Argonne National Lab in Spring 2021. Before joining William & Mary, he received his bachelor's degree from Shandong University and master's degree from the University of Hong Kong.