2018 IEEE High Performane Extreme Computing (HPEC 2016)

Designed and maintained by Ballos Associates Web Design

2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA

GraphChallenge.org: Raising the Bar on Graph Analytic Performance Jeremy Kepner (MIT Lincoln Laboratory)* The rise of graph analytic systems has created a need for new ways to measure and compare the capabilities of graph processing systems. The MIT/Amazon/IEEE Graph Challenge has been developed to provide a well-defined community venue for stimulating research and highlighting innovations in graph analysis software, hardware, algorithms, and systems. GraphChallenge.org provides a wide range of pre-parsed graph data sets, graph generators, mathematically defined graph algorithms, example serial implementations in a variety of languages, and specific metrics for measuring performance. Graph Challenge 2017 received 22 submissions by 111 authors from 36 organizations. The submissions highlighted graph analytic innovations in hardware, software, algorithms, systems, and visualization. These submissions produced many comparable performance measurements that can be used for assessing the current state of the art of the field. There were numerous submissions that implemented the triangle counting challenge and resulted in over 350 distinct measurements. Analysis of these submissions show that their execution time is a strong function of the number of edges in the graph, $N_e$, and is typically proportional to $N_e^{4/3}$ for large values of $N_e$. Combining the model fits of the submissions presents a picture of the current state of the art of graph analysis, which is typically $10^8$ edges processed per second for graphs with $10^8$ edges. These results are $30$ times faster than serial implementations commonly used by many graph analysts and underscore the importance of making these performance benefits available to the broader community. Graph Challenge provides a clear picture of current graph analysis systems and underscores the need for new innovations to achieve high performance on very large graphs. Fast Triangle Counting Using Cilk Abdurrahman Yaşar (Georgia Institute of Technology)*; Siva Rajamanickam (Sandia National Laboratories); Michael M Wolf (Sandia National Laboratories); Jonathan Berry (Sandia National Laboratory); Ümit V. Çatalyürek (Georgia Institute of Technology) Triangle counting is a representative graph analysis algorithm with several applications. It is also one of the three benchmarks used in the IEEE HPEC Graph Challenge. Triangle counting can be expressed as a graph algorithm and in a linear algebra formulation using sparse matrix-matrix multiplication (SpGEMM). The linear algebra formulation using the SpGEMM algorithm in the Kokkoskernels library was one of the fastest implementations for the triangle counting problem in last year’s Graph Challenge. This paper improves upon that work by developing an SpGEMM implementation that relies on a highly efficient, work- stealing, multithreaded runtime. We demonstrate that this new implementation results in improving the triangle counting runtime up to 5 times to 12 times on different architectures. This new implementation also breaks the 10^9 barrier for the rate measure on a single node for the triangle counting problem. We also compare the linear algebra formulation with a traditional graph based formulation. The linear algebra implementation is up to 2.96 times to 7 times faster on different architectures compared to the graph based implementation. Furthermore, we present analysis of the scaling of the triangle counting implementation as the graph sizes increase using both synthetic and real graphs from the graph challenge data set. High-Performance Triangle Counting on GPUs Yang Hu (The George Washington University); Hang Liu (UMass Lowell); H. Howie Huang (The George Washington University)* Counting triangles in a network is a primary step toward making sense of social networks. For instance, a graph with a large number of triangles is regarded as a “tightly knit community” with high degree of trust, because in this case all the friends (neighbors) of one vertex are also friends (connected) to each other. This work focuses on using Graphics Processing Units (GPUs) to accelerate triangle counting. To accommodate large graphs, the stat-of-the-art GPU-based triangle counting project – TriCore – simply stores the entire graph in the secondary storage to achieve communication free multi-GPU triangle counting. Our key observation is that each modern GPU server (Table I) often installs multiple GPUs which can easily overwhelm the disk bandwidth. Therefore, this paper introduces a new design for workload balancing to partition the graph and the workload in order to buffer each partitioned data in the CPU memory for faster data provisioning. Taken together, this work is the first, to the best of our knowledge, to advance the rate of triangle counting beyond 10^9 traversed edges per second (TEPS), as well as the first project that achieves > 10^8 TEPS for graphs with more than ten billion edges. Update on Static Graph Challenge on GPU Massimiliano Fatica (Nvidia); Mauro Bisson (NVIDIA)* This paper presents an updated CUDA implementation of the triangle counting and k-truss decomposition, the two analytics of the Subgraph Isomorphism Graph Challenge. Algorithmic improvements and a new GPU architecture (Volta) resulted in an order of magnitude speedup. K-Truss Decomposition for Scale-Free Graphs at Scale in Distributed Memory Roger Pearce (Lawrence Livermore National Laboratory)*; Geoffrey Sanders (LLNL) We describe and update to our prior 2017 Graph Challenge submission on large scale triangle counting in distributed memory, by extending it to compute the full k-truss decomposition of large scale-free graphs. We build on heuristics to minimize ‘wedge checks’, by operating on an ordered directed graph, and describe an algorithm to ‘unroll’ triangle counts when they are scheduled for pruning by the k-truss decomposition. Our k-truss algorithm is implemented using HavoqGT, an asynchronous vertex-centric graph analytics framework for distributed memory. We present a brief experimental evaluation on two large real scale-free graphs: a 128B edge web-graph and a 1.4B edge twitter follower graph. Fast and Adaptive List Intersections on the GPU James Fox (Georgia Institute of Technology)*; Oded Green (NVIDIA); Kasimir Gabert (Georgia Institute of Technology); Xiaojing An (Georgia Institute of Technology); David Bader (Georgia Institute of Technology) List intersections are ubiquitous and can be found in a wide range of applications, including triangle counting and finding the maximal k-truss, both of which are part of the HPEC Static Graph Challenge. For many graph based problems it is necessary to find intersections for a very large number of lists---these lists tend to vary greatly in size and are difficult to efficiently load- balance. Numerous parallel algorithms on list intersections for triangle counting have been proposed, but load-balancing decisions are typically made at a global level. In this paper we present an efficient and adaptive approach to load-balancing at a finer granularity. Our approach assigns a different number of threads for different intersections in order to effectively utilize the resources of the GPU. We show the applicability of our load-balancing method to two different intersection methods, one search- based and one merge-based. Our algorithm outperforms several recent triangle counting algorithms, including recent HPEC Graph Challenge Champions. Collaborative (CPU + GPU) Algorithms for Triangle Counting and Truss Decomposition Vikram Sharma Mailthody (University of Illinois at Urbana-Champaign)*; Ketan Date (University of Illinois at Urbana-Champaign); Zaid Qureshi (University of Illinois at Urbana-Champaign ); Carl Pearson (University of Illinois at Urbana-Champaign); Rakesh Nagi (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J. Watson Research Center); Wen-Mei Hwu (University of Illinois at Urbana-Champaign) In this paper, we present an update to our previous submission from Graph Challenge 2017. This work describes and evaluates new software algorithm optimizations undertaken for our 2018 year submission on Collaborative CPU+GPU Algorithms for Triangle Counting and Truss Decomposition. First, we describe four major optimizations for the triangle counting which improved performance by up to 117x over our prior submission. Additionally, we show that our triangle-counting algorithm is on average 151.7x faster than NVIDIA's NVGraph library (max 476x) for SNAP datasets. Second, we propose a novel parallel k-truss decomposition algorithm that is time-efficient and is up to 13.9x faster than our previous submission. Third, we evaluate the effect of generational hardware improvements between the IBM "Minsky'' (POWER8, P100, NVLink 1.0) and "Newell'' (POWER9, V100, NVLink 2.0) platforms. Lastly, the software optimizations presented in this work and the hardware improvements in the Newell platform enable analytics and discovery on large graphs with millions of nodes and billions of edges in less than a minute. In sum, the new algorithmic implementations are significantly faster and can handle much larger "big'' graphs. Preliminary Exploration on Large-Scale Triangle Counting in Shared-Memory Multicore System Jiyuan Zhang (CMU)*; Franz Franchetti (CMU) As triangle counting is becoming a widely-used building block for a large amount of graph analytics applications, there is a growing need to make it run fast and scalable to large scales. In this work we conduct a preliminary exploration on the optimizations of triangle counting algorithms on shared-memory system with large dataset. Linear Algebraic Formulation of Edge-centric K-truss Algorithms with Adjacency Matrices Tze Meng Low (Carnegie Mellon University)*; Daniele Spampinato (Carnegie Mellon University); Anurag Kutuluru (Carnegie Mellon University); Upasana Sridhar (Carnegie Mellon University); Doru Thom Popovici (Carnegie Mellon University); Franz Franchetti (Carnegie Mellon University); Scott McMillan (CMU Software Engineering Institute) Edge-centric algorithms using the linear algebraic approach typically require the use of both the incidence and adjacency matrices. Due to the two different data formats, the information contained in the graph is replicated, thereby incurring both time and space for the replication. Using K-truss as an example, we demonstrate that an edge-centric K-truss algorithm, the Eager K- truss algorithm, can be identified from a linear algebraic formulation using only the adjacency matrix. In addition, we demonstrate that our implementation of the linear algebraic edge-centric K-truss algorithm out-performs a Galois K-truss implementation by an average (over 53 graphs) of more than 13 times, and up to 71 times. Discovering k-Trusses in Large-Scale Networks Alessio Conte (National Institute of Informatics, Japan)*; Daniele De Sensi (Universita' di Pisa); Roberto Grossi (Università di Pisa); Andrea Marino (University of Pisa); Luca Versari (Università di Pisa) A k-truss is a subgraph where every edge belongs to at least k-2 triangles in the subgraph. The truss decomposition assigns each edge the maximum k for which the edge belongs to a k-truss, and the trussness of a graph is the maximum among its edges. Discovery algorithms for k-trusses and truss decomposition provide useful insight for graph analytics (such as community detection). Even though they take polynomial time, on massive networks they suffer from handling a potentially cubic number of wedges: algorithms either need a long time to recompute triangles several times, have high memory usage, or rely on the large number of cores on graphic units. In this paper we describe EXTRUSS, a highly optimized algorithm for truss decomposition which outperforms existing algorithms. We then introduce a faster algorithm, HYBTRUSS, which finds the trussness of a graph using less time and space than EXTRUSS. Our algorithms take the best of existing approaches having good performance, low memory usage, and no need for sophisticated hardware systems. We compare our algorithms with the state-of-the-art on a set of real-world and synthetic networks. EXTRUSS processes graphs with over a billion edges, which seems difficult for the competitors, and our HYBTRUSS is the first algorithm able to find the trussness of a graph with over 25 billion edges.

Wednesday September 26, 2018

IEEE/DARPA/Amazon Graph Challenge Awards 3:00-5:00 in Eden Vale B Chair: Jeremy Kepner / MIT

Designed and maintained by Ballos Associates Web Design

IEEE Nondiscrimination Policy

GraphChallenge.org: Raising the Bar on Graph Analytic Performance Jeremy Kepner (MIT Lincoln Laboratory)* The rise of graph analytic systems has created a need for new ways to measure and compare the capabilities of graph processing systems. The MIT/Amazon/IEEE Graph Challenge has been developed to provide a well-defined community venue for stimulating research and highlighting innovations in graph analysis software, hardware, algorithms, and systems. GraphChallenge.org provides a wide range of pre-parsed graph data sets, graph generators, mathematically defined graph algorithms, example serial implementations in a variety of languages, and specific metrics for measuring performance. Graph Challenge 2017 received 22 submissions by 111 authors from 36 organizations. The submissions highlighted graph analytic innovations in hardware, software, algorithms, systems, and visualization. These submissions produced many comparable performance measurements that can be used for assessing the current state of the art of the field. There were numerous submissions that implemented the triangle counting challenge and resulted in over 350 distinct measurements. Analysis of these submissions show that their execution time is a strong function of the number of edges in the graph, $N_e$, and is typically proportional to $N_e^{4/3}$ for large values of $N_e$. Combining the model fits of the submissions presents a picture of the current state of the art of graph analysis, which is typically $10^8$ edges processed per second for graphs with $10^8$ edges. These results are $30$ times faster than serial implementations commonly used by many graph analysts and underscore the importance of making these performance benefits available to the broader community. Graph Challenge provides a clear picture of current graph analysis systems and underscores the need for new innovations to achieve high performance on very large graphs. Fast Triangle Counting Using Cilk Abdurrahman Yaşar (Georgia Institute of Technology)*; Siva Rajamanickam (Sandia National Laboratories); Michael M Wolf (Sandia National Laboratories); Jonathan Berry (Sandia National Laboratory); Ümit V. Çatalyürek (Georgia Institute of Technology) Triangle counting is a representative graph analysis algorithm with several applications. It is also one of the three benchmarks used in the IEEE HPEC Graph Challenge. Triangle counting can be expressed as a graph algorithm and in a linear algebra formulation using sparse matrix-matrix multiplication (SpGEMM). The linear algebra formulation using the SpGEMM algorithm in the Kokkoskernels library was one of the fastest implementations for the triangle counting problem in last year’s Graph Challenge. This paper improves upon that work by developing an SpGEMM implementation that relies on a highly efficient, work-stealing, multithreaded runtime. We demonstrate that this new implementation results in improving the triangle counting runtime up to 5 times to 12 times on different architectures. This new implementation also breaks the 10^9 barrier for the rate measure on a single node for the triangle counting problem. We also compare the linear algebra formulation with a traditional graph based formulation. The linear algebra implementation is up to 2.96 times to 7 times faster on different architectures compared to the graph based implementation. Furthermore, we present analysis of the scaling of the triangle counting implementation as the graph sizes increase using both synthetic and real graphs from the graph challenge data set. High-Performance Triangle Counting on GPUs Yang Hu (The George Washington University); Hang Liu (UMass Lowell); H. Howie Huang (The George Washington University)* Counting triangles in a network is a primary step toward making sense of social networks. For instance, a graph with a large number of triangles is regarded as a “tightly knit community” with high degree of trust, because in this case all the friends (neighbors) of one vertex are also friends (connected) to each other. This work focuses on using Graphics Processing Units (GPUs) to accelerate triangle counting. To accommodate large graphs, the stat-of-the-art GPU- based triangle counting project – TriCore – simply stores the entire graph in the secondary storage to achieve communication free multi- GPU triangle counting. Our key observation is that each modern GPU server (Table I) often installs multiple GPUs which can easily overwhelm the disk bandwidth. Therefore, this paper introduces a new design for workload balancing to partition the graph and the workload in order to buffer each partitioned data in the CPU memory for faster data provisioning. Taken together, this work is the first, to the best of our knowledge, to advance the rate of triangle counting beyond 10^9 traversed edges per second (TEPS), as well as the first project that achieves > 10^8 TEPS for graphs with more than ten billion edges. Update on Static Graph Challenge on GPU Massimiliano Fatica (Nvidia); Mauro Bisson (NVIDIA)* This paper presents an updated CUDA implementation of the triangle counting and k-truss decomposition, the two analytics of the Subgraph Isomorphism Graph Challenge. Algorithmic improvements and a new GPU architecture (Volta) resulted in an order of magnitude speedup. K-Truss Decomposition for Scale-Free Graphs at Scale in Distributed Memory Roger Pearce (Lawrence Livermore National Laboratory)*; Geoffrey Sanders (LLNL) We describe and update to our prior 2017 Graph Challenge submission on large scale triangle counting in distributed memory, by extending it to compute the full k-truss decomposition of large scale- free graphs. We build on heuristics to minimize ‘wedge checks’, by operating on an ordered directed graph, and describe an algorithm to ‘unroll’ triangle counts when they are scheduled for pruning by the k- truss decomposition. Our k-truss algorithm is implemented using HavoqGT, an asynchronous vertex-centric graph analytics framework for distributed memory. We present a brief experimental evaluation on two large real scale-free graphs: a 128B edge web-graph and a 1.4B edge twitter follower graph. Fast and Adaptive List Intersections on the GPU James Fox (Georgia Institute of Technology)*; Oded Green (NVIDIA); Kasimir Gabert (Georgia Institute of Technology); Xiaojing An (Georgia Institute of Technology); David Bader (Georgia Institute of Technology) List intersections are ubiquitous and can be found in a wide range of applications, including triangle counting and finding the maximal k- truss, both of which are part of the HPEC Static Graph Challenge. For many graph based problems it is necessary to find intersections for a very large number of lists---these lists tend to vary greatly in size and are difficult to efficiently load-balance. Numerous parallel algorithms on list intersections for triangle counting have been proposed, but load-balancing decisions are typically made at a global level. In this paper we present an efficient and adaptive approach to load- balancing at a finer granularity. Our approach assigns a different number of threads for different intersections in order to effectively utilize the resources of the GPU. We show the applicability of our load-balancing method to two different intersection methods, one search-based and one merge-based. Our algorithm outperforms several recent triangle counting algorithms, including recent HPEC Graph Challenge Champions. Collaborative (CPU + GPU) Algorithms for Triangle Counting and Truss Decomposition Vikram Sharma Mailthody (University of Illinois at Urbana- Champaign)*; Ketan Date (University of Illinois at Urbana- Champaign); Zaid Qureshi (University of Illinois at Urbana-Champaign ); Carl Pearson (University of Illinois at Urbana-Champaign); Rakesh Nagi (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J. Watson Research Center); Wen-Mei Hwu (University of Illinois at Urbana-Champaign) In this paper, we present an update to our previous submission from Graph Challenge 2017. This work describes and evaluates new software algorithm optimizations undertaken for our 2018 year submission on Collaborative CPU+GPU Algorithms for Triangle Counting and Truss Decomposition. First, we describe four major optimizations for the triangle counting which improved performance by up to 117x over our prior submission. Additionally, we show that our triangle-counting algorithm is on average 151.7x faster than NVIDIA's NVGraph library (max 476x) for SNAP datasets. Second, we propose a novel parallel k-truss decomposition algorithm that is time-efficient and is up to 13.9x faster than our previous submission. Third, we evaluate the effect of generational hardware improvements between the IBM "Minsky'' (POWER8, P100, NVLink 1.0) and "Newell'' (POWER9, V100, NVLink 2.0) platforms. Lastly, the software optimizations presented in this work and the hardware improvements in the Newell platform enable analytics and discovery on large graphs with millions of nodes and billions of edges in less than a minute. In sum, the new algorithmic implementations are significantly faster and can handle much larger "big'' graphs. Preliminary Exploration on Large-Scale Triangle Counting in Shared-Memory Multicore System Jiyuan Zhang (CMU)*; Franz Franchetti (CMU) As triangle counting is becoming a widely-used building block for a large amount of graph analytics applications, there is a growing need to make it run fast and scalable to large scales. In this work we conduct a preliminary exploration on the optimizations of triangle counting algorithms on shared-memory system with large dataset. Linear Algebraic Formulation of Edge-centric K-truss Algorithms with Adjacency Matrices Tze Meng Low (Carnegie Mellon University)*; Daniele Spampinato (Carnegie Mellon University); Anurag Kutuluru (Carnegie Mellon University); Upasana Sridhar (Carnegie Mellon University); Doru Thom Popovici (Carnegie Mellon University); Franz Franchetti (Carnegie Mellon University); Scott McMillan (CMU Software Engineering Institute) Edge-centric algorithms using the linear algebraic approach typically require the use of both the incidence and adjacency matrices. Due to the two different data formats, the information contained in the graph is replicated, thereby incurring both time and space for the replication. Using K-truss as an example, we demonstrate that an edge-centric K- truss algorithm, the Eager K-truss algorithm, can be identified from a linear algebraic formulation using only the adjacency matrix. In addition, we demonstrate that our implementation of the linear algebraic edge-centric K-truss algorithm out-performs a Galois K-truss implementation by an average (over 53 graphs) of more than 13 times, and up to 71 times. Discovering k-Trusses in Large-Scale Networks Alessio Conte (National Institute of Informatics, Japan)*; Daniele De Sensi (Universita' di Pisa); Roberto Grossi (Università di Pisa); Andrea Marino (University of Pisa); Luca Versari (Università di Pisa) A k-truss is a subgraph where every edge belongs to at least k-2 triangles in the subgraph. The truss decomposition assigns each edge the maximum k for which the edge belongs to a k-truss, and the trussness of a graph is the maximum among its edges. Discovery algorithms for k-trusses and truss decomposition provide useful insight for graph analytics (such as community detection). Even though they take polynomial time, on massive networks they suffer from handling a potentially cubic number of wedges: algorithms either need a long time to recompute triangles several times, have high memory usage, or rely on the large number of cores on graphic units. In this paper we describe EXTRUSS, a highly optimized algorithm for truss decomposition which outperforms existing algorithms. We then introduce a faster algorithm, HYBTRUSS, which finds the trussness of a graph using less time and space than EXTRUSS. Our algorithms take the best of existing approaches having good performance, low memory usage, and no need for sophisticated hardware systems. We compare our algorithms with the state-of-the-art on a set of real-world and synthetic networks. EXTRUSS processes graphs with over a billion edges, which seems difficult for the competitors, and our HYBTRUSS is the first algorithm able to find the trussness of a graph with over 25 billion edges.

Wednesday September 26, 2018

IEEE/DARPA/Amazon Graph Challenge Awards 3:00-5:00 in Eden Vale B Chair: Jeremy Kepner / MIT

HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA