2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA
Performance Assessment of Hybrid Parallelism for Large-Scale Reservoir Simulation on Multi- and Many-core Architectures Amani Alonazi (KAUST)*; Marcin Rogowski (Saudi Aramco); Ahmed Zawawi (Saudi Aramco); David Keyes (KAUST) Two trends are reshaping the landscape of petroleum reservoir simulators, one architecturally and one application driven: an increasing number of cores per node and increasing computational intensity arising from higher fidelity physics at each cell. Implicit algebraic solvers being the dominant kernels, we present hybrid MPI and OpenMP implementations of the linear solver of GigaPOWERS, a full-scale real-world massively parallel simulator for black oil and composition models. We also evaluate the impact of explicit communication and computation overlap by including the halo exchange in the task-dependency graph. We analyze the performance of these modifications across multi- and many-core architectures, i.e., Intel Haswell, Skylake, and Knights Landing, using a variety of synthetic and real-world models. The hybrid approach results in up to 50\% reduction of time to solution on a 16 million-cell SPE10-like model on Skylake whereas on a smaller, 1 million-cell, model on Haswell and Knights Landing both implementations achieve very similar performance. In the real-world reservoir simulations, the hybrid parallelism has reduced communication volume, memory consumption, and improved load balancing. Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal Comparisons Peter Jamieson (Miami University)*; Ahmed Sanaullah (Boston University); Martin Herbordt (Boston University) We describe a progressive philosophy to help benchmark systems and designs in the High Performance Computing (HPC) domain.  These systems now include heterogeneous multi-node systems where nodes can include combinations of CPUs, GPUs, FPGAs, and other emerging computing devices.  Because of the heterogeneous nature of these systems, benchmarking and comparison of systems is an ever increasingly complex process.  Currently, there is no benchmark or other process that results good comparisons.  We introduce a set of tenets that will allow us to compare one system to another and to include tool innovations.  These tenets rely on our research community providing what we call a system context and an instance of a specific design (benchmark).  A better benchmarking process will make our future research stronger scientifically, and will allow us to improve future systems, their accompanying compilation/synthesis tools, and the designs that execute on these systems.  To achieve this, we survey existing benchmarks and their accompanying philosophies, and we describe our conceived ideal benchmarking scenario with a set of tenets, some definitions, a methodology, and a discussion that we hope will guide our research community forward. Improving Performance and Scalability of Algebraic Multigrid through a Specialized MATVEC Seyed Majid Rasouli Pichahi (University of Utah)*; Vidhi Zala (University of Utah); Robert Kirby (University of Utah); Hari Sundar (University of Utah) Algebraic Multigrid (AMG) is an extremely popular linear system solver and/or preconditioner approach for matrices obtained from the discretization of elliptic operators. However, its performance and scalability for large systems obtained from un- structured discretizations seem less consistent than for geometric multigrid (GMG). To a large extent, this is due to loss of sparsity at the coarser grids and the resulting increased cost and poor scalability of the matrix-vector multiplication. While there have been attempts to address this concern by designing sparsification algorithms, these affect the overall convergence. In this work, we focus on designing a specialized matrix-vector multiplication (MATVEC) that achieves high performance and scalability for a large variation in the levels of sparsity. We evaluate distributed and shared memory implementations of our MATVEC operator and demonstrate the improvements to its scalability and performance in AMG hierarchy and finally, we compare it with PETSc. Simulation Approach to Sensor Placement Using Unity3D Kimberlee C Chang (MIT Lincoln Laboratory)*; Nicole Lane (MIT Lincoln Laboratory); Andrew Uhmeyer (MIT Lincoln Laboratory); Michael S Jones (MIT Lincoln Laboratory); Matthew Hubbell (MIT Lincoln Laboratory); Albert Reuther (MIT Lincoln Laboratory); Robert Seater (MIT Lincoln Laboratory) 3D game simulation engines have demonstrated utility in the areas of training, scientific analysis, and knowledge solicitation. This paper will make the case for the use of 3D game simulation engines in the field of sensor placement optimization. Our study used a series of parallel simulations in the Unity3D simulation framework to answer the questions: how many sensors of various modalities are required and where they should be placed to meet a desired threat detection threshold? The result is a framework that not only answers this sensor placement question, but can be easily expanded to differing optimization criteria as well as answer how a particular configuration responds to differing crowd flows or informed/non-informed adversaries. Additionally, we demonstrate the scalability of this framework by running parallel instances on a supercomputing grid and illustrate the processing speed gained. A Fast and Efficient Parallel Algorithm for Pruned Landmark Labeling Qing Dong (University of Southern California)*; Kartik Lakhotia (USC); Hanqing Zeng (USC); Rajgopal Kannan (USC); Viktor K Prasanna (Unversity of Southern California); Guna Seetharaman (US Naval Research Laboratory) Hub labeling based shortest distance querying plays a key role in many important networked graph applications, such as route planning, socially-sensitive search and web page ranking. Over the last few years, Pruned Landmark Labeling (PLL) has emerged as the state-of-the-art technique for hub labeling. PLL drastically reduces the complexity of label construction by pruning Shortest-Path Trees (SPTs). However, PLL is inherently sequential, as different SPTs must be constructed in a specific order of source vertices to ensure small label size. Particularly, for large graphs, it takes significant processing time to construct even pruned SPTs from all vertices in the graph. While there are many works on parallelizing single source shortest path, these solutions cannot be directly used for PLL, as pruning and label querying introduce significant additional complexity while restricting parallelism within an SPT. In this paper, we propose a novel, fast and efficient algorithm to significantly accelerate PLL on large graphs based on a two-level parallelization of SPTs: intra- and inter-tree. For intra-tree, we generate pruned SPTs based on a modification of the Bellman-Ford (BF) algorithm. We further optimize BF to reduce SPT label querying and initialization costs. We implement our algorithm using the recently proposed Graph Processing Over Partitions (GPOP) which dramatically improves cache-efficiency and DRAM communication-bandwidth. When pruned SPTs become very small and parallelizing individual SPTs is not advantageous, we switch to inter-tree parallelization and construct multiple trees concurrently in a batch. Experiments conducted on a 36 core (2-way hyperthreaded) Intel Broadwell server show that on some datasets, our proposed parallel algorithm can achieve greater than 35.1× speedup over state-of-the-art sequential algorithm.
Thursday, September 27, 2018
Big Data 3:00-4:40 in Eden Vale C1/C2 Chair: Nikos Pitsianis / ELKE A.P.TH.  (Office of Research Support of Aristotle University)
Performance Assessment of Hybrid Parallelism for Large-Scale Reservoir Simulation on Multi- and Many-core Architectures Amani Alonazi (KAUST)*; Marcin Rogowski (Saudi Aramco); Ahmed Zawawi (Saudi Aramco); David Keyes (KAUST) Two trends are reshaping the landscape of petroleum reservoir simulators, one architecturally and one application driven: an increasing number of cores per node and increasing computational intensity arising from higher fidelity physics at each cell. Implicit algebraic solvers being the dominant kernels, we present hybrid MPI and OpenMP implementations of the linear solver of GigaPOWERS, a full-scale real- world massively parallel simulator for black oil and composition models. We also evaluate the impact of explicit communication and computation overlap by including the halo exchange in the task-dependency graph. We analyze the performance of these modifications across multi- and many-core architectures, i.e., Intel Haswell, Skylake, and Knights Landing, using a variety of synthetic and real-world models. The hybrid approach results in up to 50\% reduction of time to solution on a 16 million-cell SPE10-like model on Skylake whereas on a smaller, 1 million-cell, model on Haswell and Knights Landing both implementations achieve very similar performance. In the real-world reservoir simulations, the hybrid parallelism has reduced communication volume, memory consumption, and improved load balancing. Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal Comparisons Peter Jamieson (Miami University)*; Ahmed Sanaullah (Boston University); Martin Herbordt (Boston University) We describe a progressive philosophy to help benchmark systems and designs in the High Performance Computing (HPC) domain.  These systems now include heterogeneous multi-node systems where nodes can include combinations of CPUs, GPUs, FPGAs, and other emerging computing devices.  Because of the heterogeneous nature of these systems, benchmarking and comparison of systems is an ever increasingly complex process.  Currently, there is no benchmark or other process that results good comparisons.  We introduce a set of tenets that will allow us to compare one system to another and to include tool innovations.  These tenets rely on our research community providing what we call a system context and an instance of a specific design (benchmark).  A better benchmarking process will make our future research stronger scientifically, and will allow us to improve future systems, their accompanying compilation/synthesis tools, and the designs that execute on these systems.  To achieve this, we survey existing benchmarks and their accompanying philosophies, and we describe our conceived ideal benchmarking scenario with a set of tenets, some definitions, a methodology, and a discussion that we hope will guide our research community forward. Improving Performance and Scalability of Algebraic Multigrid through a Specialized MATVEC Seyed Majid Rasouli Pichahi (University of Utah)*; Vidhi Zala (University of Utah); Robert Kirby (University of Utah); Hari Sundar (University of Utah) Algebraic Multigrid (AMG) is an extremely popular linear system solver and/or preconditioner approach for matrices obtained from the discretization of elliptic operators. However, its performance and scalability for large systems obtained from un-structured discretizations seem less consistent than for geometric multigrid (GMG). To a large extent, this is due to loss of sparsity at the coarser grids and the resulting increased cost and poor scalability of the matrix-vector multiplication. While there have been attempts to address this concern by designing sparsification algorithms, these affect the overall convergence. In this work, we focus on designing a specialized matrix- vector multiplication (MATVEC) that achieves high performance and scalability for a large variation in the levels of sparsity. We evaluate distributed and shared memory implementations of our MATVEC operator and demonstrate the improvements to its scalability and performance in AMG hierarchy and finally, we compare it with PETSc. Simulation Approach to Sensor Placement Using Unity3D Kimberlee C Chang (MIT Lincoln Laboratory)*; Nicole Lane (MIT Lincoln Laboratory); Andrew Uhmeyer (MIT Lincoln Laboratory); Michael S Jones (MIT Lincoln Laboratory); Matthew Hubbell (MIT Lincoln Laboratory); Albert Reuther (MIT Lincoln Laboratory); Robert Seater (MIT Lincoln Laboratory) 3D game simulation engines have demonstrated utility in the areas of training, scientific analysis, and knowledge solicitation. This paper will make the case for the use of 3D game simulation engines in the field of sensor placement optimization. Our study used a series of parallel simulations in the Unity3D simulation framework to answer the questions: how many sensors of various modalities are required and where they should be placed to meet a desired threat detection threshold? The result is a framework that not only answers this sensor placement question, but can be easily expanded to differing optimization criteria as well as answer how a particular configuration responds to differing crowd flows or informed/non-informed adversaries. Additionally, we demonstrate the scalability of this framework by running parallel instances on a supercomputing grid and illustrate the processing speed gained. A Fast and Efficient Parallel Algorithm for Pruned Landmark Labeling Qing Dong (University of Southern California)*; Kartik Lakhotia (USC); Hanqing Zeng (USC); Rajgopal Kannan (USC); Viktor K Prasanna (Unversity of Southern California); Guna Seetharaman (US Naval Research Laboratory) Hub labeling based shortest distance querying plays a key role in many important networked graph applications, such as route planning, socially-sensitive search and web page ranking. Over the last few years, Pruned Landmark Labeling (PLL) has emerged as the state-of-the-art technique for hub labeling. PLL drastically reduces the complexity of label construction by pruning Shortest-Path Trees (SPTs). However, PLL is inherently sequential, as different SPTs must be constructed in a specific order of source vertices to ensure small label size. Particularly, for large graphs, it takes significant processing time to construct even pruned SPTs from all vertices in the graph. While there are many works on parallelizing single source shortest path, these solutions cannot be directly used for PLL, as pruning and label querying introduce significant additional complexity while restricting parallelism within an SPT. In this paper, we propose a novel, fast and efficient algorithm to significantly accelerate PLL on large graphs based on a two-level parallelization of SPTs: intra- and inter-tree. For intra-tree, we generate pruned SPTs based on a modification of the Bellman-Ford (BF) algorithm. We further optimize BF to reduce SPT label querying and initialization costs. We implement our algorithm using the recently proposed Graph Processing Over Partitions (GPOP) which dramatically improves cache-efficiency and DRAM communication-bandwidth. When pruned SPTs become very small and parallelizing individual SPTs is not advantageous, we switch to inter-tree parallelization and construct multiple trees concurrently in a batch. Experiments conducted on a 36 core (2-way hyperthreaded) Intel Broadwell server show that on some datasets, our proposed parallel algorithm can achieve greater than 35.1× speedup over state-of-the-art sequential algorithm.
Thursday, September 27, 2018
Big Data 3:00-4:40 in Eden Vale C1/C2 Chair: Nikos Pitsianis / ELKE A.P.TH.  (Office of Research Support of Aristotle University)
HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA