2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
Graphs, Networks & Sparse Data 1 10:20-12:00 in Eden Vale C1 Chair: Michael Wolf / Sandia [Best Paper Finalist] Efficient implementation of sparse matrix- sparse vector multiplication for large scale graph analytics Mauricio Serrano (IBM) We developed a parallel algorithm to improve the cache behavior and overall performance for multiplication of sparse matrices with sparse vectors (SpMSpV), an operation used increasingly in large graph analytics, particularly dynamic graphs in social networks and homeland security applications. The proposed algorithm builds upon the two-phase approach of partitioning the multiplication into a scaling phase and an aggregation phase, to achieve more cache-friendly access patterns individually in each phase [6], [3]. However, to handle dynamic graphs and achieve better load balancing for parallel implementation, we use a combination of private and shared bins, with synchronized access to shared bins to exchange the product terms between the two phases. The new algorithm accumulates product terms in private bins for each thread. The algorithm then performs a bulk transfer between a private bin and a shared bin, when the private bin becomes full. Then results are aggregated from the shared bins. In addition, we employ heuristics to decide the best algorithm for SpMSpV based on the number of nonzeros involved in the operation. When the number of nonzeros is large, it may be better to perform the operation as SpMV (sparse matrix times dense vector) despite the added conversion cost. Also, if the number of nonzeros is low it is advantageous to use a simplified algorithm. We compared our algorithm with existing algorithms for SpMSpV, and our evaluation shows that execution time is reduced by several times when large graphs are considered. Combinatorial Multigrid: Advanced Preconditioners For Ill-Conditioned Linear Systems M. Harper Langston, Pierre-David Letourneau, Richard Lethin, James Ezick, Mitchell Harris (Reservoir Labs) The Combinatorial Multigrid (CMG) technique is a practical and adaptable solver and combinatorial preconditioner for solving certain classes of large, sparse systems of linear equations. CMG is similar to Algebraic Multigrid (AMG) but replaces large groupings of fine- level variables with a single coarse-level one, resulting in simple and fast interpolation schemes.  These schemes further provide control over the refinement strategies at different levels of the solver hierarchy depending on the condition number of the system being solved.  While many pre-existing solvers may be able to solve large, sparse systems with relatively low complexity, inversion may require $O(n^2)$ space; whereas, if we know that a linear operator has $\tilde{n} = O(n)$ nonzero elements, we desire to use $O(n)$ space in order to reduce communication as much as possible.  Being able to invert sparse linear systems of equations, asymptotically as fast as the values can be read from memory, has been identified by the DARPA and the Department of Energy as increasingly necessary for scalable solvers and energy-efficient algorithm in scientific computing. Further, as industry and government agencies move towards exascale, fast solvers and communication-avoidance will be more necessary   In this paper, we present an optimized implementation of the Combinatorial Multigrid in C using Petsc and analyze the solution of various systems using the CMG approach as a preconditioner on much larger problems than have been presented thus far.  We compare the number of iterations, setup times and solution times against other popular preconditioners for such systems, including Incomplete Cholesky and a Multigrid approach in Pets' against common problems, further exhibiting superior performance by the CMG.   Concurrent Katz Centrality for Streaming Graphs Chunxing Yin, Jason Riedy (Georgia Institute of Technology) Most current frameworks for streaming graph analysis “stop the world” and halt ingesting data while updating analysis results. Others add overhead for different forms of version control. In both methods, adding additional analysis kernels adds additional overhead to the entire system. A new formal model of concurrent analysis lets some algorithms, those valid for the model, update results concurrently with data ingest without synchronization. Additional kernels incur very little overhead. Here we present the first experimental results for the new model, considering the performance and result latency of updating Katz centrality on a low-power edge platform. The Katz centrality values remain close to the synchronous algorithm while reducing latency delay from 12.8× to 179×. Graph Algorithms in PGAS: Chapel and UPC++ Jesun Sahariar Firoz, Louis Jenkins, Marcin Zalewski, Cliff Joslyn, Mark Raugas (PNNL) The Partitioned Global Address Space (PGAS) programming model can be implemented either with programming language features or with runtime library APIs, each implementation favoring different aspects (e.g., productivity, abstraction, flexibility, or performance). Certain language and runtime features, such as collectives, explicit and asynchronous communication primitives, and constructs facilitating overlap of communication and computation (such as futures and conjoined futures) can enable better performance and scaling for irregular applications, in particular for distributed graph analytics. We compare graph algorithms in one of each of these environments: the Chapel PGAS programming language and the the UPC++ PGAS runtime library. We implement algorithms for breadth-first search and triangle counting graph kernels in both environments. We discuss the code in each of the environments, and compile performance data on a Cray Aries and an Infiniband platform. Our results show that the library-based approach of UPC++ currently provides strong performance while Chapel provides a high-level abstraction that, harder to optimize, still provides comparable performance.
Wednesday, September 25, 2019
Graphs, Networks & Sparse Data 1 10:20-12:00 in Eden Vale C1 Chair: Michael Wolf / Sandia [Best Paper Finalist] Efficient implementation of sparse matrix- sparse vector multiplication for large scale graph analytics Mauricio Serrano (IBM) We developed a parallel algorithm to improve the cache behavior and overall performance for multiplication of sparse matrices with sparse vectors (SpMSpV), an operation used increasingly in large graph analytics, particularly dynamic graphs in social networks and homeland security applications. The proposed algorithm builds upon the two-phase approach of partitioning the multiplication into a scaling phase and an aggregation phase, to achieve more cache- friendly access patterns individually in each phase [6], [3]. However, to handle dynamic graphs and achieve better load balancing for parallel implementation, we use a combination of private and shared bins, with synchronized access to shared bins to exchange the product terms between the two phases. The new algorithm accumulates product terms in private bins for each thread. The algorithm then performs a bulk transfer between a private bin and a shared bin, when the private bin becomes full. Then results are aggregated from the shared bins. In addition, we employ heuristics to decide the best algorithm for SpMSpV based on the number of nonzeros involved in the operation. When the number of nonzeros is large, it may be better to perform the operation as SpMV (sparse matrix times dense vector) despite the added conversion cost. Also, if the number of nonzeros is low it is advantageous to use a simplified algorithm. We compared our algorithm with existing algorithms for SpMSpV, and our evaluation shows that execution time is reduced by several times when large graphs are considered. Combinatorial Multigrid: Advanced Preconditioners For Ill- Conditioned Linear Systems M. Harper Langston, Pierre-David Letourneau, Richard Lethin, James Ezick, Mitchell Harris (Reservoir Labs) The Combinatorial Multigrid (CMG) technique is a practical and adaptable solver and combinatorial preconditioner for solving certain classes of large, sparse systems of linear equations. CMG is similar to Algebraic Multigrid (AMG) but replaces large groupings of fine-level variables with a single coarse-level one, resulting in simple and fast interpolation schemes.  These schemes further provide control over the refinement strategies at different levels of the solver hierarchy depending on the condition number of the system being solved.  While many pre-existing solvers may be able to solve large, sparse systems with relatively low complexity, inversion may require $O(n^2)$ space; whereas, if we know that a linear operator has $\tilde{n} = O(n)$ nonzero elements, we desire to use $O(n)$ space in order to reduce communication as much as possible.  Being able to invert sparse linear systems of equations, asymptotically as fast as the values can be read from memory, has been identified by the DARPA and the Department of Energy as increasingly necessary for scalable solvers and energy-efficient algorithm in scientific computing. Further, as industry and government agencies move towards exascale, fast solvers and communication-avoidance will be more necessary   In this paper, we present an optimized implementation of the Combinatorial Multigrid in C using Petsc and analyze the solution of various systems using the CMG approach as a preconditioner on much larger problems than have been presented thus far.  We compare the number of iterations, setup times and solution times against other popular preconditioners for such systems, including Incomplete Cholesky and a Multigrid approach in Pets' against common problems, further exhibiting superior performance by the CMG.   Concurrent Katz Centrality for Streaming Graphs Chunxing Yin, Jason Riedy (Georgia Institute of Technology) Most current frameworks for streaming graph analysis “stop the world” and halt ingesting data while updating analysis results. Others add overhead for different forms of version control. In both methods, adding additional analysis kernels adds additional overhead to the entire system. A new formal model of concurrent analysis lets some algorithms, those valid for the model, update results concurrently with data ingest without synchronization. Additional kernels incur very little overhead. Here we present the first experimental results for the new model, considering the performance and result latency of updating Katz centrality on a low- power edge platform. The Katz centrality values remain close to the synchronous algorithm while reducing latency delay from 12.8× to 179×. Graph Algorithms in PGAS: Chapel and UPC++ Jesun Sahariar Firoz, Louis Jenkins, Marcin Zalewski, Cliff Joslyn, Mark Raugas (PNNL) The Partitioned Global Address Space (PGAS) programming model can be implemented either with programming language features or with runtime library APIs, each implementation favoring different aspects (e.g., productivity, abstraction, flexibility, or performance). Certain language and runtime features, such as collectives, explicit and asynchronous communication primitives, and constructs facilitating overlap of communication and computation (such as futures and conjoined futures) can enable better performance and scaling for irregular applications, in particular for distributed graph analytics. We compare graph algorithms in one of each of these environments: the Chapel PGAS programming language and the the UPC++ PGAS runtime library. We implement algorithms for breadth-first search and triangle counting graph kernels in both environments. We discuss the code in each of the environments, and compile performance data on a Cray Aries and an Infiniband platform. Our results show that the library-based approach of UPC++ currently provides strong performance while Chapel provides a high-level abstraction that, harder to optimize, still provides comparable performance.
Wednesday, September 25, 2019