2021
IEEE High Performance Extreme Computing
Virtual Conference
20 - 24 September 2021

4-S1: Graph Challenge Special (17:30-19:30)
Organizer(s): Jeremy Kepner

Fast Sparse Deep Neural Network Inference with Flexible SpMM Optimization Space Exploration
Jie Xin (Huazhong University of Science and Technology); Xianqi Ye (Huazhong University of Science and Technology); Long Zheng
(Huazhong University of Science and Technology)*; Qinggang Wang (Huazhong University of Science and Technology); Yu Huang
(Huazhong University of Science and Technology); Pengcheng Yao (Huazhong University of Science and Technology); Linchen Yu
(Huazhong University of Science and Technology); Xiaofei Liao (HUST); Hai Jin (Huazhong University of Science and Technology)
Deep neural networks (DNN) have been widely used in many fields. With the ever-increasing model size, the DNN scalability suffers. Sparse deep neural
networks (SpDNN) are promising to resolve this problem, but the sparse data makes it difficult to execute efficiently on GPUs due to load imbalance and
irregular memory accesses. The recent MIT/IEEE/Amazon GraphChallenge has shown several big advances to fit sparse DNNs into GPUs, but we observe
that none of these earlier efforts can be an absolute winner for all dataset cases due to their limited optimization space considerations. In this paper, we
identify some new opportunities in optimizing the execution of SpDNN via a comprehensive analysis of previous works. Based on this new large design
space of SpDNN, we present sparsity-aware SpMM algorithms that can systematically explore a performance-optimal solution of SpDNN execution on GPU,
and further generate optimized SpMM kernel implementations. Compared to the 2020 HPEC Sparse DNN Challenge champions, our approach achieves up
to 55.6 TeraEdges per second inference throughput with the speedups of up to 13.74x [1] and 22.29x [2] on a single NVIDIA V100 GPU. We also show that
our approach under 4 GPUs can be even superior to the 2020 Challenge champion using 768 GPUs in many cases. The source codes are available at
https://github.com/CGCL-codes/Graphchallenge21.
Faster Stochastic Block Partition using Aggressive Initial Merging, Compressed Representation, and Parallelism Control
Ahsen J Uppal (The George Washington University)*; H. Howie Huang (The George Washington University); Jaeseok Choi (The
George Washington University); Thomas Rolinger (Laboratory for Physical Sciences)
The community detection problem continues to be challenging, particularly for large graph data. Although optimal graph partitioning is NP-hard, stochastic
methods, such as in the IEEE HPEC GraphChallenge, can provide good approximate solutions in reasonable time. But the scalability with increasing graph
size of such solutions remains a challenge. In this work, we describe three new techniques to speed up the stochastic block partition algorithm. The first
technique relies on reducing the initial number of communities via aggressive agglomerative merging (a portion of the algorithm with high parallel scalability)
to quickly reduce the amount of data that must be processed, resulting in an independent speedup of 1.85x for a 200k node graph. Our second technique
uses a novel compressed data structure to store the main bookkeeping information of the algorithm. Our compressed representation allows the processing
of large graphs that would otherwise be impossible due to memory constraints, and has a speedup of up to 1.19x over our uncompressed baseline
representation. The third technique carefully manages the amount of parallelism during different phases of the algorithm. Compared to our best baseline
configuration using a fixed number of threads, this technique yields an independent speedup of 2.26x for a 200k node graph. Combined together, our
techniques result in speedups of 3.78x for a 50k node graph, 4.71x for a 100k node graph, and 5.13x for a 200k node graph over our previous best parallel
algorithm.
Towards Distributed Square Counting in Large Graphs
Trevor Steil (Lawrence Livermore National Laboratory)*; Geoffrey Sanders (LLNL); Roger Pearce (Lawrence Livermore National
Laboroatry)
Counting all squares (4-cycles) in a large graph is an expensive but important graph computation for network scientists to understand the higher-order
structures within their large relational datasets. This is particularly true for bipartite graphs, where 4-cycles are fundamental (as one of the smallest motifs not
enumerable locally at vertices or edges) and highly useful for measuring notions of connectivity and clustering. We build a distributed and asynchronous
approach to counting squares and demonstrate its ability to perform this task on large real-world datasets. We employ several techniques from previous
approaches, most notably degree-based vertex ordering, and base our approach on them. We note that a bipartite graph with one vertex set significantly
smaller than the other is a special and common case, and we describe the benefits of our approach applied to this case. Although our approach is able to
count hundreds of trillions of squares in a massive distributed graph, we report poor scaling, setting the stage for much algorithmic improvement by the
GraphChallenge community. We discuss the bottlenecks within our approach that must be mitigated in order to achieve a highly performant and scalable
method.
Sparse Deep Neural Network Acceleration on HBM-Enabled FPGA Platform
Abhishek K Jain (Xilinx)*; Sharan Kumar (Xilinx); Aashish Tripathi (Xilinx); Dinesh Gaitonde (Xilinx)
Domain-specific architectures (DSAs) for deep neural networks (DNNs) are becoming mainstream because of high energy-efficiency and performance.
Currently, most of these DSAs perform dense linear algebra efficiently by exploiting temporal and spatial locality, high data reuse and regular memory
access patterns. However, state-of-the-art DNN models require DSAs with very high amount of compute, interconnect and memory resources. Since
sparsity in DNN parameters can be exploited to reduce the resources and complexity of the DSAs, emerging DSAs are adding native support for sparsity.
This paper presents a sparse DNN inference engine on HBM-enabled Alveo U280 platform, built around FPGA-optimized DSA blocks. The DSA blocks have
native support for completely unstructured sparsity. We demonstrate the effectiveness of proposed inference engine by running inference models of the
Sparse DNN Challenge. Results for the challenge benchmarks show that the proposed inference engine and multi-DSA parallelization on HBM-enabled
Alveo U280 achieve up to 3.7 billion edges per second inference throughput.
Productive High-Performance k-Truss Decomposition on GPU Using Linear Algebra
Runze Wang (Huazhong University of Science and Technology)*; Linchen Yu (Huazhong University of Science and Technology);
Qinggang Wang (Huazhong University of Science and Technology
); Jie Xin (Huazhong University of Science and Technology); Long
Zheng (Huazhong University of Science and Technology)
The k-truss decomposition algorithm aims to find the cohesive subgraphs in a graph. It includes a two-step procedure of triangle counting and weak edge
deletion. Due to the significant differences of graph typologies, the best performer appearing in the last year’s MIT/IEEE/Amazon GraphChallenge presents
to manually explore a well-optimized combination from many k-truss techniques for each graph dataset, but this requires expertise from multiple disciplines.
Linear algebra provides a unified framework to write parallel and scalable k-truss decomposition algorithm easily. However, the work imbalance and low
memory efficiency remain. In this work, we propose a productive yet efficient k-truss decomposition implementation on GPU based on fine-grained linear
algebra. We propose a preprocessing method that renumbers the vertices to eliminate the longest adjacency list to balance the workloads and putting the
adjacency list of the same source vertex into shared memory that exploits the shared memory effectively to improve memory access efficiency. Results show
that our approach outperforms a linear algebra-based implementation– eager (a champion in 2019) by up to 24.81x and a manual optimization
KTRUSSEXPLORER by up to 2.69x.
HyKernel: A Hybrid Selection of One/Two-Phase Kernels for Triangle Counting on GPUs
Mohammad Almasri (University of Illinois at Urbana-Champaign)*; Neo Vasudeva (University of Illinois at Urbana-Champaign);
Rakesh Nagi (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J. Watson Research Center); Wen-Mei Hwu
(University of Illinois at Urbana-Champaign)
We present a new two-phase dynamic triangle counting approach on GPUs. The approach targets graphs with high maximum degrees and high degree
standard deviations. The first phase analyzes the edges of the graph and determines the most appropriate list intersection method and thread granularity
that achieve the highest performance for each edge. The second phase performs the list intersection with the thread granularity selected in the first phase.
Furthermore, we improve our static triangle counting kernels used in the two-phase approach, namely the two-pointer and binary-search intersections. We
introduce a hierarchical binary-search approach that loads the initial levels of the logical search tree into the GPU fast shared memory to reduce redundant
accesses to global memory. The search space reduction is another improvement to our static kernels that reduces the size of adjacency lists when orienting
the graph by vertex ID. We also present HyKernel, a hybrid approach that selects between our two-phase dynamic and our 2019 one-phase dynamic
approaches using simple heuristics. Lastly, a study of the impact of two graph orientation approaches on the graph preprocessing times and kernel times for
our 2019 approach and the proposed two-phase approach are provided. For evaluation, we compare our two-phase and HyKernel approaches with our
2019 submission and the 2019 Champion approach, H-Index. When orienting the graph by vertex ID, our proposed two-phase and HyKernel approaches
significantly outperform the H-Index with geomean speedups of 4.0x (up to 31.2x) and 5.1x (up to 31.2x), resp. Our proposed two-phase and HyKernel
approaches significantly outperform the H-Index with geomean speedups of 4.7x (up to 26.8x) and 7.2x (up to 43.6x) with vertex degree orientation, resp.
DPGS Graph Summarization Preserves Community Structure
L JK Durbeck (Virginia Tech)*; Peter Athanas (Virginia Tech)
Community detection, or clustering, is an NP-hard problem for which computationally efficient approximations have long been sought. Methods have sought
a balance between accuracy, simplicity, latency, and memory usage. In this paper, we assess whether a recently developed graph summarization technique
preserves the underlying community structure of the graph. The technique is Degree-Preserving Graph Summarization (DPGS). It lossily encodes the graph
using fewer nodes and edges but does not assume a uniform distribution of node degrees, as is common in most Minimum Description Length-based GS
methods. Instead, it strives to improve the reconstruction model for graphs with realistic skewed degrees using the configuration model. We show that the
supernodes produced by DPGS capture the latent communities within the graph for a series of graphs with known community structure, while reducing the
graph description length by 25%.

2021 Abstract Book