2018 IEEE High Performance
Extreme Computing Conference
(HPEC ‘18)
Twenty-second Annual HPEC Conference
25 - 27 September 2018
Westin Hotel, Waltham, MA USA
Performance Assessment of Hybrid Parallelism for Large-Scale Reservoir Simulation on Multi- and Many-core
Architectures
Amani Alonazi (KAUST)*; Marcin Rogowski (Saudi Aramco); Ahmed Zawawi (Saudi Aramco); David Keyes (KAUST)
Two trends are reshaping the landscape of petroleum reservoir simulators, one architecturally and one application driven: an
increasing number of cores per node and increasing computational intensity arising from higher fidelity physics at each cell.
Implicit algebraic solvers being the dominant kernels, we present hybrid MPI and OpenMP implementations of the linear solver
of GigaPOWERS, a full-scale real-world massively parallel simulator for black oil and composition models. We also evaluate
the impact of explicit communication and computation overlap by including the halo exchange in the task-dependency graph.
We analyze the performance of these modifications across multi- and many-core architectures, i.e., Intel Haswell, Skylake,
and Knights Landing, using a variety of synthetic and real-world models. The hybrid approach results in up to 50\% reduction
of time to solution on a 16 million-cell SPE10-like model on Skylake whereas on a smaller, 1 million-cell, model on Haswell
and Knights Landing both implementations achieve very similar performance. In the real-world reservoir simulations, the
hybrid parallelism has reduced communication volume, memory consumption, and improved load balancing.
Benchmarking Heterogeneous HPC Systems Including Reconfigurable Fabrics: Community Aspirations for Ideal
Comparisons
Peter Jamieson (Miami University)*; Ahmed Sanaullah (Boston University); Martin Herbordt (Boston University)
We describe a progressive philosophy to help benchmark systems and designs in the High Performance Computing (HPC)
domain. These systems now include heterogeneous multi-node systems where nodes can include combinations of CPUs,
GPUs, FPGAs, and other emerging computing devices. Because of the heterogeneous nature of these systems,
benchmarking and comparison of systems is an ever increasingly complex process. Currently, there is no benchmark or other
process that results good comparisons. We introduce a set of tenets that will allow us to compare one system to another and
to include tool innovations. These tenets rely on our research community providing what we call a system context and an
instance of a specific design (benchmark). A better benchmarking process will make our future research stronger scientifically,
and will allow us to improve future systems, their accompanying compilation/synthesis tools, and the designs that execute on
these systems. To achieve this, we survey existing benchmarks and their accompanying philosophies, and we describe our
conceived ideal benchmarking scenario with a set of tenets, some definitions, a methodology, and a discussion that we hope
will guide our research community forward.
Improving Performance and Scalability of Algebraic Multigrid through a Specialized MATVEC
Seyed Majid Rasouli Pichahi (University of Utah)*; Vidhi Zala (University of Utah); Robert Kirby (University of Utah); Hari
Sundar (University of Utah)
Algebraic Multigrid (AMG) is an extremely popular linear system solver and/or preconditioner approach for matrices obtained
from the discretization of elliptic operators. However, its performance and scalability for large systems obtained from un-
structured discretizations seem less consistent than for geometric multigrid (GMG). To a large extent, this is due to loss of
sparsity at the coarser grids and the resulting increased cost and poor scalability of the matrix-vector multiplication. While
there have been attempts to address this concern by designing sparsification algorithms, these affect the overall convergence.
In this work, we focus on designing a specialized matrix-vector multiplication (MATVEC) that achieves high performance and
scalability for a large variation in the levels of sparsity. We evaluate distributed and shared memory implementations of our
MATVEC operator and demonstrate the improvements to its scalability and performance in AMG hierarchy and finally, we
compare it with PETSc.
Simulation Approach to Sensor Placement Using Unity3D
Kimberlee C Chang (MIT Lincoln Laboratory)*; Nicole Lane (MIT Lincoln Laboratory); Andrew Uhmeyer (MIT Lincoln
Laboratory); Michael S Jones (MIT Lincoln Laboratory); Matthew Hubbell (MIT Lincoln Laboratory); Albert Reuther (MIT
Lincoln Laboratory); Robert Seater (MIT Lincoln Laboratory)
3D game simulation engines have demonstrated utility in the areas of training, scientific analysis, and knowledge solicitation.
This paper will make the case for the use of 3D game simulation engines in the field of sensor placement optimization. Our
study used a series of parallel simulations in the Unity3D simulation framework to answer the questions: how many sensors of
various modalities are required and where they should be placed to meet a desired threat detection threshold? The result is a
framework that not only answers this sensor placement question, but can be easily expanded to differing optimization criteria
as well as answer how a particular configuration responds to differing crowd flows or informed/non-informed adversaries.
Additionally, we demonstrate the scalability of this framework by running parallel instances on a supercomputing grid and
illustrate the processing speed gained.
A Fast and Efficient Parallel Algorithm for Pruned Landmark Labeling
Qing Dong (University of Southern California)*; Kartik Lakhotia (USC); Hanqing Zeng (USC); Rajgopal Kannan (USC); Viktor
K Prasanna (Unversity of Southern California); Guna Seetharaman (US Naval Research Laboratory)
Hub labeling based shortest distance querying plays a key role in many important networked graph applications, such as route
planning, socially-sensitive search and web page ranking. Over the last few years, Pruned Landmark Labeling (PLL) has
emerged as the state-of-the-art technique for hub labeling. PLL drastically reduces the complexity of label construction by
pruning Shortest-Path Trees (SPTs). However, PLL is inherently sequential, as different SPTs must be constructed in a
specific order of source vertices to ensure small label size. Particularly, for large graphs, it takes significant processing time to
construct even pruned SPTs from all vertices in the graph. While there are many works on parallelizing single source shortest
path, these solutions cannot be directly used for PLL, as pruning and label querying introduce significant additional complexity
while restricting parallelism within an SPT. In this paper, we propose a novel, fast and efficient algorithm to significantly
accelerate PLL on large graphs based on a two-level parallelization of SPTs: intra- and inter-tree. For intra-tree, we generate
pruned SPTs based on a modification of the Bellman-Ford (BF) algorithm. We further optimize BF to reduce SPT label
querying and initialization costs. We implement our algorithm using the recently proposed Graph Processing Over Partitions
(GPOP) which dramatically improves cache-efficiency and DRAM communication-bandwidth. When pruned SPTs become
very small and parallelizing individual SPTs is not advantageous, we switch to inter-tree parallelization and construct multiple
trees concurrently in a batch. Experiments conducted on a 36 core (2-way hyperthreaded) Intel Broadwell server show that on
some datasets, our proposed parallel algorithm can achieve greater than 35.1× speedup over state-of-the-art sequential
algorithm.
Thursday, September 27, 2018
Big Data
3:00-4:40 in Eden Vale C1/C2
Chair: Nikos Pitsianis / ELKE A.P.TH. (Office of Research Support of Aristotle University)