2018 IEEE High Performance
Extreme Computing Conference
(HPEC ‘18)
Twenty-second Annual HPEC Conference
25 - 27 September 2018
Westin Hotel, Waltham, MA USA
GraphChallenge.org: Raising the Bar on Graph Analytic Performance
Jeremy Kepner (MIT Lincoln Laboratory)*
The rise of graph analytic systems has created a need for new ways to measure and compare the capabilities of graph
processing systems. The MIT/Amazon/IEEE Graph Challenge has been developed to provide a well-defined community venue
for stimulating research and highlighting innovations in graph analysis software, hardware, algorithms, and systems.
GraphChallenge.org provides a wide range of pre-parsed graph data sets, graph generators, mathematically defined graph
algorithms, example serial implementations in a variety of languages, and specific metrics for measuring performance. Graph
Challenge 2017 received 22 submissions by 111 authors from 36 organizations. The submissions highlighted graph analytic
innovations in hardware, software, algorithms, systems, and visualization. These submissions produced many comparable
performance measurements that can be used for assessing the current state of the art of the field. There were numerous
submissions that implemented the triangle counting challenge and resulted in over 350 distinct measurements. Analysis of these
submissions show that their execution time is a strong function of the number of edges in the graph, $N_e$, and is typically
proportional to $N_e^{4/3}$ for large values of $N_e$. Combining the model fits of the submissions presents a picture of the
current state of the art of graph analysis, which is typically $10^8$ edges processed per second for graphs with $10^8$ edges.
These results are $30$ times faster than serial implementations commonly used by many graph analysts and underscore the
importance of making these performance benefits available to the broader community. Graph Challenge provides a clear picture
of current graph analysis systems and underscores the need for new innovations to achieve high performance on very large
graphs.
Fast Triangle Counting Using Cilk
Abdurrahman Yaşar (Georgia Institute of Technology)*; Siva Rajamanickam (Sandia National Laboratories); Michael M Wolf
(Sandia National Laboratories); Jonathan Berry (Sandia National Laboratory); Ümit V. Çatalyürek (Georgia Institute of
Technology)
Triangle counting is a representative graph analysis algorithm with several applications. It is also one of the three benchmarks
used in the IEEE HPEC Graph Challenge. Triangle counting can be expressed as a graph algorithm and in a linear algebra
formulation using sparse matrix-matrix multiplication (SpGEMM). The linear algebra formulation using the SpGEMM algorithm in
the Kokkoskernels library was one of the fastest implementations for the triangle counting problem in last year’s Graph
Challenge. This paper improves upon that work by developing an SpGEMM implementation that relies on a highly efficient, work-
stealing, multithreaded runtime. We demonstrate that this new implementation results in improving the triangle counting runtime
up to 5 times to 12 times on different architectures. This new implementation also breaks the 10^9 barrier for the rate measure on
a single node for the triangle counting problem. We also compare the linear algebra formulation with a traditional graph based
formulation. The linear algebra implementation is up to 2.96 times to 7 times faster on different architectures compared to the
graph based implementation. Furthermore, we present analysis of the scaling of the triangle counting implementation as the
graph sizes increase using both synthetic and real graphs from the graph challenge data set.
High-Performance Triangle Counting on GPUs
Yang Hu (The George Washington University); Hang Liu (UMass Lowell); H. Howie Huang (The George Washington University)*
Counting triangles in a network is a primary step toward making sense of social networks. For instance, a graph with a large
number of triangles is regarded as a “tightly knit community” with high degree of trust, because in this case all the friends
(neighbors) of one vertex are also friends (connected) to each other. This work focuses on using Graphics Processing Units
(GPUs) to accelerate triangle counting. To accommodate large graphs, the stat-of-the-art GPU-based triangle counting project –
TriCore – simply stores the entire graph in the secondary storage to achieve communication free multi-GPU triangle counting.
Our key observation is that each modern GPU server (Table I) often installs multiple GPUs which can easily overwhelm the disk
bandwidth. Therefore, this paper introduces a new design for workload balancing to partition the graph and the workload in order
to buffer each partitioned data in the CPU memory for faster data provisioning. Taken together, this work is the first, to the best of
our knowledge, to advance the rate of triangle counting beyond 10^9 traversed edges per second (TEPS), as well as the first
project that achieves > 10^8 TEPS for graphs with more than ten billion edges.
Update on Static Graph Challenge on GPU
Massimiliano Fatica (Nvidia); Mauro Bisson (NVIDIA)*
This paper presents an updated CUDA implementation of the triangle counting and k-truss decomposition, the two analytics of
the Subgraph Isomorphism Graph Challenge. Algorithmic improvements and a new GPU architecture (Volta) resulted in an
order of magnitude speedup.
K-Truss Decomposition for Scale-Free Graphs at Scale in Distributed Memory
Roger Pearce (Lawrence Livermore National Laboratory)*; Geoffrey Sanders (LLNL)
We describe and update to our prior 2017 Graph Challenge submission on large scale triangle counting in distributed memory, by
extending it to compute the full k-truss decomposition of large scale-free graphs. We build on heuristics to minimize ‘wedge
checks’, by operating on an ordered directed graph, and describe an algorithm to ‘unroll’ triangle counts when they are scheduled
for pruning by the k-truss decomposition. Our k-truss algorithm is implemented using HavoqGT, an asynchronous vertex-centric
graph analytics framework for distributed memory. We present a brief experimental evaluation on two large real scale-free
graphs: a 128B edge web-graph and a 1.4B edge twitter follower graph.
Fast and Adaptive List Intersections on the GPU
James Fox (Georgia Institute of Technology)*; Oded Green (NVIDIA); Kasimir Gabert (Georgia Institute of Technology); Xiaojing
An (Georgia Institute of Technology); David Bader (Georgia Institute of Technology)
List intersections are ubiquitous and can be found in a wide range of applications, including triangle counting and finding the
maximal k-truss, both of which are part of the HPEC Static Graph Challenge. For many graph based problems it is necessary to
find intersections for a very large number of lists---these lists tend to vary greatly in size and are difficult to efficiently load-
balance. Numerous parallel algorithms on list intersections for triangle counting have been proposed, but load-balancing
decisions are typically made at a global level. In this paper we present an efficient and adaptive approach to load-balancing at a
finer granularity. Our approach assigns a different number of threads for different intersections in order to effectively utilize the
resources of the GPU. We show the applicability of our load-balancing method to two different intersection methods, one search-
based and one merge-based. Our algorithm outperforms several recent triangle counting algorithms, including recent HPEC
Graph Challenge Champions.
Collaborative (CPU + GPU) Algorithms for Triangle Counting and Truss Decomposition
Vikram Sharma Mailthody (University of Illinois at Urbana-Champaign)*; Ketan Date (University of Illinois at Urbana-Champaign);
Zaid Qureshi (University of Illinois at Urbana-Champaign ); Carl Pearson (University of Illinois at Urbana-Champaign); Rakesh
Nagi (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J. Watson Research Center); Wen-Mei Hwu
(University of Illinois at Urbana-Champaign)
In this paper, we present an update to our previous submission from Graph Challenge 2017. This work describes and evaluates
new software algorithm optimizations undertaken for our 2018 year submission on Collaborative CPU+GPU Algorithms for
Triangle Counting and Truss Decomposition. First, we describe four major optimizations for the triangle counting which improved
performance by up to 117x over our prior submission. Additionally, we show that our triangle-counting algorithm is on average
151.7x faster than NVIDIA's NVGraph library (max 476x) for SNAP datasets. Second, we propose a novel parallel k-truss
decomposition algorithm that is time-efficient and is up to 13.9x faster than our previous submission. Third, we evaluate the effect
of generational hardware improvements between the IBM "Minsky'' (POWER8, P100, NVLink 1.0) and "Newell'' (POWER9,
V100, NVLink 2.0) platforms. Lastly, the software optimizations presented in this work and the hardware improvements in the
Newell platform enable analytics and discovery on large graphs with millions of nodes and billions of edges in less than a minute.
In sum, the new algorithmic implementations are significantly faster and can handle much larger "big'' graphs.
Preliminary Exploration on Large-Scale Triangle Counting in Shared-Memory Multicore System
Jiyuan Zhang (CMU)*; Franz Franchetti (CMU)
As triangle counting is becoming a widely-used building block for a large amount of graph analytics applications, there is a
growing need to make it run fast and scalable to large scales. In this work we conduct a preliminary exploration on the
optimizations of triangle counting algorithms on shared-memory system with large dataset.
Linear Algebraic Formulation of Edge-centric K-truss Algorithms with Adjacency Matrices
Tze Meng Low (Carnegie Mellon University)*; Daniele Spampinato (Carnegie Mellon University); Anurag Kutuluru (Carnegie
Mellon University); Upasana Sridhar (Carnegie Mellon University); Doru Thom Popovici (Carnegie Mellon University); Franz
Franchetti (Carnegie Mellon University); Scott McMillan (CMU Software Engineering Institute)
Edge-centric algorithms using the linear algebraic approach typically require the use of both the incidence and adjacency
matrices. Due to the two different data formats, the information contained in the graph is replicated, thereby incurring both time
and space for the replication. Using K-truss as an example, we demonstrate that an edge-centric K-truss algorithm, the Eager K-
truss algorithm, can be identified from a linear algebraic formulation using only the adjacency matrix. In addition, we demonstrate
that our implementation of the linear algebraic edge-centric K-truss algorithm out-performs a Galois K-truss implementation by an
average (over 53 graphs) of more than 13 times, and up to 71 times.
Discovering k-Trusses in Large-Scale Networks
Alessio Conte (National Institute of Informatics, Japan)*; Daniele De Sensi (Universita' di Pisa); Roberto Grossi (Università di
Pisa); Andrea Marino (University of Pisa); Luca Versari (Università di Pisa)
A k-truss is a subgraph where every edge belongs to at least k-2 triangles in the subgraph. The truss decomposition assigns
each edge the maximum k for which the edge belongs to a k-truss, and the trussness of a graph is the maximum among its
edges. Discovery algorithms for k-trusses and truss decomposition provide useful insight for graph analytics (such as community
detection). Even though they take polynomial time, on massive networks they suffer from handling a potentially cubic number of
wedges: algorithms either need a long time to recompute triangles several times, have high memory usage, or rely on the large
number of cores on graphic units. In this paper we describe EXTRUSS, a highly optimized algorithm for truss decomposition
which outperforms existing algorithms. We then introduce a faster algorithm, HYBTRUSS, which finds the trussness of a graph
using less time and space than EXTRUSS. Our algorithms take the best of existing approaches having good performance, low
memory usage, and no need for sophisticated hardware systems. We compare our algorithms with the state-of-the-art on a set
of real-world and synthetic networks. EXTRUSS processes graphs with over a billion edges, which seems difficult for the
competitors, and our HYBTRUSS is the first algorithm able to find the trussness of a graph with over 25 billion edges.
Wednesday September 26, 2018
IEEE/DARPA/Amazon Graph Challenge Awards
3:00-5:00 in Eden Vale B
Chair: Jeremy Kepner / MIT