2018 IEEE High Performance
Extreme Computing Conference
(HPEC ‘18)
Twenty-second Annual HPEC Conference
25 - 27 September 2018
Westin Hotel, Waltham, MA USA
Utilizing GPU Parallelism to Improve Fast Spherical Harmonic Transforms
Max L Carlson (University of Utah)*; Hari Sundar (University of Utah)
Spherical harmonics form an orthogonal basis for functions that live on the surface of a sphere and are useful for solving partial
differential equations and for numerical integration. The complexity of transforming a set of function samples to their corresponding
spherical harmonic coefficients is largely dominated by the computation of the associated Legendre transform. This associated
Legendre transform requires the computation of (L+1) dense matrix-vector products where L is the order of the spherical harmonic
expansion. Since the number of rows and columns of each of these matrices depends on L, this step is essentially O(L^3). In this
paper, we explore the GPU parallelism available to improve the butterfly compression approach. We present some preliminary
results showing performance increases for large problem sizes and eventually plan to release the MonarchSHT library for GPU
spherical harmonic transforms.
A Multi-GPU PCISPH Implementation with Efficient Memory Transfers
Kevin Verma (ESS Engineering Software Steyr)*; Chong Peng (ESS Engineering Software Steyr); Kamil Szewc (ESS Engineering
Software Steyr); Robert Wille (Nil)
Smoothed Particle Hydrodynamics (SPH) is a particle-based method for fluid flow modeling. One promising variant of SPH is
Predictive-Corrective Incompressible SPH (PCISPH), which employs a dedicate prediction-correction scheme and, by this,
outperforms other SPH variants by almost one order of magnitude. However, similar to other particle-based methods, it suffers from a
huge numerical complexity. In order to simulate real world phenomena, several millions of particles need to be considered. To make
SPH applicable to real world engineering problems, it is hence common to exploit massive parallelism of multi-GPU architectures.
However, certain algorithmic characteristics of PCISPH make it a non-trivial task to efficiently parallelize this method on multi-GPUs.
In this work, we are, for the first time, proposing a multi-GPU implementation for PCISPH. To this end, we are proposing a scheme
which allows to overlap the memory transfers between GPUs by actual computations and, by this, avoids the drawbacks caused by
the mentioned algorithmic characteristics of PCISPH. Experimental evaluations confirm the efficiency of the proposed methods.
GDP: GPU accelerated Detailed Placement
Shounak Dhar (University of Texas at Austin)*; David Z Pan (University of Texas at Austin)
Placement is one of the runtime bottlenecks in an EDA (Elec- tronic Design Automation) tool flow. Detailed placement is an important
part of placement which is hard to parallelize on a large scale. In this paper, we demonstrate GPU acceleration of a dynamic
programming based detailed placement algorithm which solves a generalized version of the Linear Arrangement Problem. Although
we test our algorithm on FPGA benchmarks, it can also be applied to ASIC placement. Similar dynamic programming algorithms
have also been used for simultaneous placement and routing. To the best of our knowledge, this is the first reported GPU
accelerated detailed placement algorithm other than simulated annealing. We achieve upto 7x speedup in runtime over multi-
threaded CPU implementation without any loss of QoR.
Benchmarking Scalability of GPU Accelerated SAR Image Formation
Edward H Hill (BAE Syatems); Thomas J Kragh (BAE Systems)*; Howard E Nichols (BAE Systems); Michael Minardi (AFRL); Steven
Scarborough (AFRL); Alexander Boytim (AFRL)
Commercial off-the-shelf GPU-based computing hardware has enabled real-time SAR Image Formation Processing (IFP) within the
size, weight, power, and cost constraints of a small- to medium-sized Unmanned Aerial Vehicles. This paper presents an analysis of
SAR IFP throughput as a function of multiple radar parameters including range swath, pulse repetition frequency, coherent
integration time, and numerical precision (e.g., single vs. double-precision floating point). This study demonstrates the rates that SAR
images can be continuously processed using a range of nVidia GPUs including server-class (K80), desktop-class (GTX 1080) and
embedded (Jetson TX2) options. The ability to scale the algorithms for multiple simultaneous radar channels and multiple GPUs is
also demonstrated and discussed.
WCET Analysis of GPU L1 Data Caches
Yijie Huangfu (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)*
Graphics Processing Units (GPUs) have become widely used in high-performance computing. For real-time applications demanding
high throughputs, GPUs can provide abundant computing power with high energy efficiency. However, modern GPUs are designed
to boost average-case performance, not for time predictability. Therefore, for hard real-time applications, it is crucial to estimate their
worst-case execution time (WCET) running on GPUs. As the first step toward this goal, this paper applies the abstract interpretation
to GPU L1 data caches to estimate the worst-case L1 data cache miss rate of GPU applications. The experimental results show that
the proposed analyzer can achieve safe and very tight estimation of GPU L1 data cache misses.
Thursday, September 27, 2018
GPU Computing
3:00-4:40 in Eden Vale A1/A2
Chair: Brian Stroka / MITRE