2019 IEEE High Performance
Extreme Computing Conference
(HPEC ‘19)
Twenty-third Annual HPEC Conference
24 - 26 September 2019
Westin Hotel, Waltham, MA USA
Thursday, September 25, 2019
IEEE/MIT/Amazon Graph Challenge Awards
3:00-5:00 in Eden Vale B
Chair: Jeremy Kepner/ MIT Lincoln Laboratory
[Champion] A GPU Implementation of the Sparse Deep Neural Network Graph Challenge
Mauro Bisson, Massimiliano Fatica (NVIDIA)
This paper presents a CUDA implementation of the latest addition to the Graph Challenge, the inference computation on a collection of
large sparse deep neural networks. A single Tesla V100 can compute the inference at 3.7 TeraEdges/s. Using the managed memory API
available in CUDA allows for simple and efficient distribution of these computations across a multi-GPU NVIDIA DGX-2 server.
[Champion] H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUs
Santosh Pandey (UMass Lowell), Xiaoye S. Li, Aydin Buluc (LBNL), Jiejun Xu (HRL), Hang Liu (UMass Lowell)
Triangle counting is a graph algorithm that calculates the number of triangles involving each vertex in a graph. Briefly, a triangle
encompasses three vertices from a graph, where every vertex possesses at least one incidental edge to the other two vertices from the
triangle. Consequently, list intersection, which identifies the incidental edges, becomes the core algorithm for triangle counting. At the
meantime, attracted by the enormous potentials of Graphics Processing Units (GPUs), an array of efforts has surged to deploy triangle
counting on GPUs. While state-of-the-art intersection algorithms, such as merge- path and binary-search, perform well on traditional
multi-core CPU systems, deploying them on massively parallel GPUs turns out to be challenging. Particularly, merge-path based
approach experiences the hardship of evenly distributing the workload across vast GPU threads and irregular memory accesses. Binary-
search based approach suffers from the potential high time complexity problem. Further, both algorithms require sorted neighbor list from
the input graphs, which involves nontrivial preprocessing overhead. To this end, we introduce H-INDEX, a hash-indexing assisted triangle
counting that overcomes all the shortcomings from the state-of-the-art algorithms. Notably, H-INDEX is the first work to achieve beyond
100 billion TEPS computing rate for triangle counting problem.
[Champion] Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via
SuiteSparse:GraphBLAS
Timothy A. Davis, Mohsen Aznaveh, and Scott Kolodziej (Texas A&M Univ.)
SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for
creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS
achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational
kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing
code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists
of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The
sequential performance of the GraphBLAS solution is 3x to 5x faster than the MATLAB reference implementation. OpenMP parallelism
gives an additional 10x to 15x speedup on a 20-core Intel processor, 17x on an IBM Power8 system, and 20x on a Power9 system, for the
largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort
that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm.
[Champion] Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU
Mark Blanco, Tze Meng Low (CMU), Kyungjoo Kim (Sandia)
In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We
address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained
parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU
execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake
CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.26-1.48x improvement on the CPU and a 9.97-16.92x
improvement on the GPU due to our fine-grained parallel formulation.
[Champion] One Quadrillion Triangles Queried on One Million Processors
Roger Pearce, Trevor Steil, Benjamin Priest, Geoffrey Sanders (LLNL)
We update our prior 2017 Graph Challenge submission on large scale triangle counting in distributed memory by demonstrating scaling
and validation on trillion-edge scale-free graphs. We incorporate recent distributed communication optimizations developed for irregular
communication workloads, and demonstrate scaling up to 1.5 million cores of IBM BG/Q Sequoia at LLNL. We validate our
implementation using non- stochastic Kronecker graph generation where ground-truth local and global triangle counts are known, and
model our Kronecker graph inputs after the Graph500RMAT inputs. To our knowledge, our results are the largest triangle count
experiments on synthetic scale-free graphs to date.
[Innovation Award] Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous Environments
Abdurrahman Yasşar (GA Tech), Sivasankaran Rajamanickam, Jonathan Berry, Michael Wolf (Sandia), Jeffrey S. Young, Ümit V.
Çatalyürek (GA Tech)
Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using
algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic
formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task
based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year's graph challenge submission.
This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and
friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous
state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication
of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to
very low overhead costs.
[Innovation Award] Scalable Triangle Counting on Distributed-Memory Systems
Seher Acer (Sandia), Abdurrahman Yasşar (GA Tech), Sivasankaran Rajamanickam, Michael Wolf (Sandia), Ümit V. Çatalyürek† (GA
Tech)
Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the “Static
Graph Challenge”. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation.
Our framework uses MPI and Cilk for exploiting the benefits of distributed-memory and shared-memory parallelism, respectively. The
problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise
partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very
good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph,
which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we
demonstrate speed up of 2.7x on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph
when the graph is not replicated.
[Innovation Award] Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels
J. Austin Ellis and Sivasankaran Rajamanickam (Sandia)
Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks.
Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy
can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and
implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for
DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in
Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse
networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node
performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference,
ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared
to our highly optimized multithreaded implementation on a single Skylake node.
[Innovation Award] Scaling and Quality of Modularity Optimization Methods for Graph Clustering
Sayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo (PNNL), Ananth Kalyanaraman (Washington State Univ.)
Real-world graphs exhibit structures known as “communities” or “clusters” consisting of a group of vertices with relatively high connectivity
between them, as compared to the rest of the vertices in the network. Graph clustering or community detection is a fundamental graph
operation used to analyze real-world graphs occurring in the areas of compu- tational biology, cybersecurity, electrical grids, etc. Similar to
other graph algorithms, owing to irregular memory accesses and inherently sequential nature, current algorithms for community detection
are challenging to parallelize. However, in order to analyze large networks, it is important to develop scalable parallel implementations of
graph clustering that are capable of exploiting the architectural features of modern supercomputers. In response to the 2019 Streaming
Graph Challenge, we present quality and performance analysis of our distributed- memory community detection using Vite, which is our
distributed memory implementation of the popular Louvain method, on the ALCF Theta supercomputer. Clustering methods such as
Louvain that rely on modularity maximization are known to suffer from the resolution limit problem, preventing identification of clusters of
certain sizes. Hence, we also include quality analysis of our shared-memory implementation of the Fast-tracking Resistance method, in
comparison with Louvain on the challenge datasets. Furthermore, we introduce an edge-balanced graph distribution for our distributed
memory implementation, that significantly reduces communication, offering up to 80% improvement in the overall execution time. In
addition to performance/quality analysis, we also include details on the power/energy consump- tion, and memory traffic of the distributed-
memory clustering implementation using real-world graphs with over a billion edges.
[Innovation Award] Direction-Optimizing Label Propagation Algorithm
Xu Liu (Washington State Univ., PNNL), Jesun Sahariar Firoz, Marcin Zalewski, Mahantesh Halappanavar, Kevin J. Baker (PNNL),
Andrew Lumsdaine (PNNL, Univ. of Washington), Assefaw H. Gebremedhin∗ (Washington State Univ.)
Designing a scalable algorithm for community detection is challenging due to the simultaneous need for both high performance and
quality of solution. We propose a new distributed algorithm for community detection based on a novel Label Propagation algorithm. The
algorithm is inspired by the direction optimization technique in graph traversal algorithms, relies on the use of frontiers, and alternates
between abstractions called label push and label pull. This organization creates flexibility and affords us with opportunities for balancing
performance and quality of solution. We implement our algorithm in distributed memory with the active-message based asynchronous
many-task runtime AM++. We experiment with two seeding strategies for the initial seeding stage, namely, random seeding and degree
seeding. With the Graph Challenge dataset, our distributed implementation, in conjunction with the runtime support, detects the
communities in graphs having 20 million vertices in less than one second while achieving reasonably high quality of solution.
[Student Innovation Award] Fast Stochastic Block Partitioning via Sampling
Frank Wanye (Virginia Tech), Vitaliy Gleyzer (MIT LL), Wu-chun Feng (Virginia Tech)
Community detection in graphs, also known as graph partitioning, is a well-studied NP-hard problem. Various heuristic approaches have
been adopted to tackle this problem in polynomial time. One such approach, as outlined in the IEEE HPEC Graph Challenge, is Bayesian
statistics-based stochastic block partitioning. This method delivers high-quality partitions in sub-quadratic runtime, but it fails to scale to
very large graphs. In this paper, we present sampling as an avenue for speeding up the algorithm on large graphs. We first show that
existing sampling techniques can preserve a graph’s community structure. We then show that sampling for stochastic block partitioning
can be used to produce a speedup of between 2.18X and 7.26X for graph sizes between 5,000 and 50,000 vertices without a significant
loss in partitioning accuracy.
[Student Innovation Award] Accelerating DNN Inference with GraphBLAS and the GPU
Xiaoyun Wang, Zhongyi Lin, Carl Yang, John D. Owens (UC Davis)
This work addresses the 2019 Sparse Deep Neural Network Graph Challenge with an implementation of this challenge using the
GraphBLAS programming model. We demonstrate our solution to this challenge with GraphBLAST, a GraphBLAS implementation on the
GPU, and compare it to SuiteSparse, a GraphBLAS implementation on the CPU. The GraphBLAST implementation is 1.94 times faster
than SuiteSparse; the primary opportunity to increase performance on the GPU is a higher-performance sparse-matrix-times-sparse-
matrix (SpGEMM) kernel.
[Student Innovation Award] Update on k-truss Decomposition on GPU
Mohammad Almasri, Omer Anjum, Carl Pearson, Zaid Qureshi, Vikram S. Mailthody, Rakesh Nagi (UIUC), Jinjun Xiong (IBM TJ Watson),
and Wen-mei Hwu (UIUC)
In this paper, we present an update to our previous submission on k-truss decomposition from Graph Challenge2018. For single GPU k-
truss implementation, we propose multiple algorithmic optimizations that significantly improve performance by up to 35.2x (6.9x on
average) compared to our previous GPU implementation. In addition, we present a scalable multi-GPU implementation in which each
GPU handles a different K value. Compared to our prior multi-GPU implementation, the proposed approach is faster by up to 151.3x
(78.8x on average). In case when the edges with only maximal k-truss are sought, incrementing the ‘k’ value in each iteration is inefficient
particularly for graphs with large maximum k-truss. Thus, we propose binary search for the ‘k’ value to find the maximal k-truss. The
binary search approach on a single GPU is up to101.5 (24.3x on average) faster than our 2018k-truss submission. Lastly, we show that
the proposed binary search finds the maximum k-truss for “Twitter” graph dataset having 2.8 billion bidirectional edges in just 16 minutes
on a single V100 GPU.
[Student Innovation Award] DistTC: High Performance Distributed Triangle Counting
Loc Hoang, Vishwesh Jatala, Xuhao Chen, Udit Agarwal, Roshan Dathathri, Gurbinder Gill, Keshav Pingali (UT Austin)
We describe a novel multi-machine multi-GPU implementation of triangle counting which exploits a novel application-agnostic graph
partitioning strategy that eliminates almost all inter-host communication during triangle counting. Experimental results show that this
distributed triangle counting implementation can handle very large graphs such as clueweb12, which has almost one billion vertices and
37 billion edges, and it is up to 1.6× faster than TriCore, the 2018 Graph Challenge champion.