2019 IEEE High Performance Extreme Computing Conference(HPEC ‘19)Twenty-third Annual HPEC Conference24 - 26 September 2019Westin Hotel, Waltham, MA USA
Thursday, September 25, 2019IEEE/MIT/Amazon Graph Challenge Awards 3:00-5:00 in Eden Vale BChair: Jeremy Kepner/ MIT Lincoln Laboratory[Champion] A GPU Implementation of the Sparse Deep Neural Network Graph ChallengeMauro Bisson, Massimiliano Fatica (NVIDIA)This paper presents a CUDA implementation of the latest addition to the Graph Challenge, the inference computation on a collection of large sparse deep neural networks. A single Tesla V100 can compute the inference at 3.7 TeraEdges/s. Using the managed memory API available in CUDA allows for simple and efficient distribution of these computations across a multi-GPU NVIDIA DGX-2 server.[Student Innovation Award] Accelerating DNN Inference with GraphBLAS and the GPUXiaoyun Wang, Zhongyi Lin, Carl Yang, John D. Owens (UC Davis)This work addresses the 2019 Sparse Deep Neural Network Graph Challenge with an implementation of this challenge using the GraphBLAS programming model. We demonstrate our solution to this challenge with GraphBLAST, a GraphBLAS implementation on the GPU, and compare it to SuiteSparse, a GraphBLAS implementation on the CPU. The GraphBLAST implementation is 1.94 times faster than SuiteSparse; the primary opportunity to increase performance on the GPU is a higher-performance sparse-matrix-times-sparse-matrix (SpGEMM) kernel.[Innovation Award] Direction-Optimizing Label Propagation AlgorithmXu Liu (Washington State Univ., PNNL), Jesun Sahariar Firoz, Marcin Zalewski, Mahantesh Halappanavar, Kevin J. Baker (PNNL), Andrew Lumsdaine (PNNL, Univ. of Washington), Assefaw H. Gebremedhin∗ (Washington State Univ.)Designing a scalable algorithm for community detection is challenging due to the simultaneous need for both high performance and quality of solution. We propose a new distributed algorithm for community detection based on a novel Label Propagation algorithm. The algorithm is inspired by the direction optimization technique in graph traversal algorithms, relies on the use of frontiers, and alternates between abstractions called label push and label pull. This organization creates flexibility and affords us with opportunities for balancing performance and quality of solution. We implement our algorithm in distributed memory with the active-message based asynchronous many-task runtime AM++. We experiment with two seeding strategies for the initial seeding stage, namely, random seeding and degree seeding. With the Graph Challenge dataset, our distributed implementation, in conjunction with the runtime support, detects the communities in graphs having 20 million vertices in less than one second while achieving reasonably high quality of solution.[Student Innovation Award] DistTC: High Performance Distributed Triangle CountingLoc Hoang, Vishwesh Jatala, Xuhao Chen, Udit Agarwal, Roshan Dathathri, Gurbinder Gill, Keshav Pingali (UT Austin)We describe a novel multi-machine multi-GPU implementation of triangle counting which exploits a novel application-agnostic graph partitioning strategy that eliminates almost all inter-host communication during triangle counting. Experimental results show that this distributed triangle counting implementation can handle very large graphs such as clueweb12, which has almost one billion vertices and 37 billion edges, and it is up to 1.6× faster than TriCore, the 2018 Graph Challenge champion.[Champion] Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPUMark Blanco, Tze Meng Low (CMU), Kyungjoo Kim (Sandia)In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.26-1.48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation.[Student Innovation Award] Fast Stochastic Block Partitioning via SamplingFrank Wanye (Virginia Tech), Vitaliy Gleyzer (MIT LL), Wu-chun Feng (Virginia Tech)Community detection in graphs, also known as graph partitioning, is a well-studied NP-hard problem. Various heuristic approaches have been adopted to tackle this problem in polynomial time. One such approach, as outlined in the IEEE HPEC Graph Challenge, is Bayesian statistics-based stochastic block partitioning. This method delivers high-quality partitions in sub-quadratic runtime, but it fails to scale to very large graphs. In this paper, we present sampling as an avenue for speeding up the algorithm on large graphs. We first show that existing sampling techniques can preserve a graph’s community structure. We then show that sampling for stochastic block partitioning can be used to produce a speedup of between 2.18X and 7.26X for graph sizes between 5,000 and 50,000 vertices without a significant loss in partitioning accuracy.[Champion] H-INDEX: Hash-Indexing for Parallel Triangle Counting on GPUsSantosh Pandey (UMass Lowell), Xiaoye S. Li, Aydin Buluc (LBNL), Jiejun Xu (HRL), Hang Liu (UMass Lowell)Triangle counting is a graph algorithm that calculates the number of triangles involving each vertex in a graph. Briefly, a triangle encompasses three vertices from a graph, where every vertex possesses at least one incidental edge to the other two vertices from the triangle. Consequently, list intersection, which identifies the incidental edges, becomes the core algorithm for triangle counting. At the meantime, attracted by the enormous potentials of Graphics Processing Units (GPUs), an array of efforts has surged to deploy triangle counting on GPUs. While state-of-the-art intersection algorithms, such as merge- path and binary-search, perform well on traditional multi-core CPU systems, deploying them on massively parallel GPUs turns out to be challenging. Particularly, merge-path based approach experiences the hardship of evenly distributing the workload across vast GPU threads and irregular memory accesses. Binary-search based approach suffers from the potential high time complexity problem. Further, both algorithms require sorted neighbor list from the input graphs, which involves nontrivial preprocessing overhead. To this end, we introduce H-INDEX, a hash-indexing assisted triangle counting that overcomes all the shortcomings from the state-of-the-art algorithms. Notably, H-INDEX is the first work to achieve beyond 100 billion TEPS computing rate for triangle counting problem.[Innovation Award] Linear Algebra-Based Triangle Counting via Fine-Grained Tasking on Heterogeneous EnvironmentsAbdurrahman Yasşar (GA Tech), Sivasankaran Rajamanickam, Jonathan Berry, Michael Wolf (Sandia), Jeffrey S. Young, Ümit V. Çatalyürek (GA Tech)Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year's graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.[Champion] One Quadrillion Triangles Queried on One Million ProcessorsRoger Pearce, Trevor Steil, Benjamin Priest, Geoffrey Sanders (LLNL)We update our prior 2017 Graph Challenge submission on large scale triangle counting in distributed memory by demonstrating scaling and validation on trillion-edge scale-free graphs. We incorporate recent distributed communication optimizations developed for irregular communication workloads, and demonstrate scaling up to 1.5 million cores of IBM BG/Q Sequoia at LLNL. We validate our implementation using non- stochastic Kronecker graph generation where ground-truth local and global triangle counts are known, and model our Kronecker graph inputs after the Graph500RMAT inputs. To our knowledge, our results are the largest triangle count experiments on synthetic scale-free graphs to date.[Innovation Award] Scalable Inference for Sparse Deep Neural Networks using Kokkos KernelsJ. Austin Ellis and Sivasankaran Rajamanickam (Sandia)Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.[Innovation Award] Scalable Triangle Counting on Distributed-Memory SystemsSeher Acer (Sandia), Abdurrahman Yasşar (GA Tech), Sivasankaran Rajamanickam, Michael Wolf (Sandia), Ümit V. Çatalyürek† (GA Tech)Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the “Static Graph Challenge”. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk for exploiting the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate speed up of 2.7x on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.[Innovation Award] Scaling and Quality of Modularity Optimization Methods for Graph ClusteringSayan Ghosh, Mahantesh Halappanavar, Antonino Tumeo (PNNL), Ananth Kalyanaraman (Washington State Univ.)Real-world graphs exhibit structures known as “communities” or “clusters” consisting of a group of vertices with relatively high connectivity between them, as compared to the rest of the vertices in the network. Graph clustering or community detection is a fundamental graph operation used to analyze real-world graphs occurring in the areas of compu- tational biology, cybersecurity, electrical grids, etc. Similar to other graph algorithms, owing to irregular memory accesses and inherently sequential nature, current algorithms for community detection are challenging to parallelize. However, in order to analyze large networks, it is important to develop scalable parallel implementations of graph clustering that are capable of exploiting the architectural features of modern supercomputers. In response to the 2019 Streaming Graph Challenge, we present quality and performance analysis of our distributed- memory community detection using Vite, which is our distributed memory implementation of the popular Louvain method, on the ALCF Theta supercomputer. Clustering methods such as Louvain that rely on modularity maximization are known to suffer from the resolution limit problem, preventing identification of clusters of certain sizes. Hence, we also include quality analysis of our shared-memory implementation of the Fast-tracking Resistance method, in comparison with Louvain on the challenge datasets. Furthermore, we introduce an edge-balanced graph distribution for our distributed memory implementation, that significantly reduces communication, offering up to 80% improvement in the overall execution time. In addition to performance/quality analysis, we also include details on the power/energy consump- tion, and memory traffic of the distributed-memory clustering implementation using real-world graphs with over a billion edges.A GPU Implementation of the Sparse Deep Neural Network Graph ChallengeMauro Bisson, Massimiliano Fatica (NVIDIA)The MIT/IEEE/Amazon GraphChallenge.org encourages community approaches to developing new solutions for analyzing graphs and sparse data. Sparse AI analytics present unique scalability difficulties. The proposed Sparse Deep Neural Network (DNN) Challenge draws upon prior challenges from machine learning, high performance computing, and visual analytics to create a challenge that is reflective of emerging sparse AI systems. The Sparse DNN Challenge is based on a mathematically well-defined DNN inference computation and can be implemented in any programming environment. Sparse DNN inference is amenable to both vertex-centric implementations and array-based implementations (e.g., using the GraphBLAS.org standard). The computations are simple enough that performance predictions can be made based on simple computing hardware models. The input data sets are derived from the MNIST handwritten letters. The surrounding I/O and verification provide the context for each sparse DNN inference that allows rigorous definition of both the input and the output. Furthermore, since the proposed sparse DNN challenge is scalable in both problem size and hardware, it can be used to measure and quantitatively compare a wide range of present day and future systems. Ref- erence implementations have been implemented and their serial and parallel performance have been measured. Specifications, data, and software are publicly available at GraphChallenge.org[Student Innovation Award] Update on k-truss Decomposition on GPUMohammad Almasri, Omer Anjum, Carl Pearson, Zaid Qureshi, Vikram S. Mailthody, Rakesh Nagi (UIUC), Jinjun Xiong (IBM TJ Watson), and Wen-mei Hwu (UIUC)In this paper, we present an update to our previous submission on k-truss decomposition from Graph Challenge2018. For single GPU k-truss implementation, we propose multiple algorithmic optimizations that significantly improve performance by up to 35.2x (6.9x on average) compared to our previous GPU implementation. In addition, we present a scalable multi-GPU implementation in which each GPU handles a different K value. Compared to our prior multi-GPU implementation, the proposed approach is faster by up to 151.3x (78.8x on average). In case when the edges with only maximal k-truss are sought, incrementing the ‘k’ value in each iteration is inefficient particularly for graphs with large maximum k-truss. Thus, we propose binary search for the ‘k’ value to find the maximal k-truss. The binary search approach on a single GPU is up to101.5 (24.3x on average) faster than our 2018k-truss submission. Lastly, we show that the proposed binary search finds the maximum k-truss for “Twitter” graph dataset having 2.8 billion bidirectional edges in just 16 minutes on a single V100 GPU.[Champion] Write Quick, Run Fast: Sparse Deep Neural Network in 20 Minutes of Development Time via SuiteSparse:GraphBLASTimothy A. Davis, Mohsen Aznaveh, and Scott Kolodziej (Texas A&M Univ.)SuiteSparse:GraphBLAS is a full implementation of the GraphBLAS standard, which provides a powerful and expressive framework for creating graph algorithms based on the elegant mathematics of sparse matrix operations on a semiring. Algorithms written in GraphBLAS achieve high performance with minimal development time. Using GraphBLAS, it took a mere 20 minutes to write a first-cut computational kernel that solves the Sparse Deep Neural Network Graph Challenge. Understanding the problem description and file format, writing code to read in the files that define the problem, and comparing our results with the reference solution took a full day. The kernel consists of a single for-loop around 4 lines of code, all of which are calls to GraphBLAS, and it worked perfectly the first time it was compiled. The sequential performance of the GraphBLAS solution is 3x to 5x faster than the MATLAB reference implementation. OpenMP parallelism gives an additional 10x to 15x speedup on a 20-core Intel processor, 17x on an IBM Power8 system, and 20x on a Power9 system, for the largest problems. Since SuiteSparse:GraphBLAS does not yet employ MPI, this was added at the application level, a development effort that took one week, primarily because of difficulties in resolving a load-balancing issue in the MPI-based parallel algorithm.