2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA
Utilizing GPU Parallelism to Improve Fast Spherical Harmonic Transforms Max L Carlson (University of Utah)*; Hari Sundar (University of Utah) Spherical harmonics form an orthogonal basis for functions that live on the surface of a sphere and are useful for solving partial differential equations and for numerical integration. The complexity of transforming a set of function samples to their corresponding spherical harmonic coefficients is largely dominated by the computation of the associated Legendre transform. This associated Legendre transform requires the computation of (L+1) dense matrix-vector products where L is the order of the spherical harmonic expansion. Since the number of rows and columns of each of these matrices depends on L, this step is essentially O(L^3). In this paper, we explore the GPU parallelism available to improve the butterfly compression approach. We present some preliminary results showing performance increases for large problem sizes and eventually plan to release the MonarchSHT library for GPU spherical harmonic transforms. A Multi-GPU PCISPH Implementation with Efficient Memory Transfers Kevin Verma (ESS Engineering Software Steyr)*; Chong Peng (ESS Engineering Software Steyr); Kamil Szewc (ESS Engineering Software Steyr); Robert Wille (Nil) Smoothed Particle Hydrodynamics (SPH) is a particle-based method for fluid flow modeling. One promising variant of SPH is Predictive-Corrective Incompressible SPH (PCISPH), which employs a dedicate prediction-correction scheme and, by this, outperforms other SPH variants by almost one order of magnitude. However, similar to other particle-based methods, it suffers from a huge numerical complexity. In order to simulate real world phenomena, several millions of particles need to be considered. To make SPH applicable to real world engineering problems, it is hence common to exploit massive parallelism of multi-GPU architectures. However, certain algorithmic characteristics of PCISPH make it a non-trivial task to efficiently parallelize this method on multi-GPUs. In this work, we are, for the first time, proposing a multi-GPU implementation for PCISPH. To this end, we are proposing a scheme which allows to overlap the memory transfers between GPUs by actual computations and, by this, avoids the drawbacks caused by the mentioned algorithmic characteristics of PCISPH. Experimental evaluations confirm the efficiency of the proposed methods. GDP: GPU accelerated Detailed Placement Shounak Dhar (University of Texas at Austin)*; David Z Pan (University of Texas at Austin) Placement is one of the runtime bottlenecks in an EDA (Elec- tronic Design Automation) tool flow. Detailed placement is an important part of placement which is hard to parallelize on a large scale. In this paper, we demonstrate GPU acceleration of a dynamic programming based detailed placement algorithm which solves a generalized version of the Linear Arrangement Problem. Although we test our algorithm on FPGA benchmarks, it can also be applied to ASIC placement. Similar dynamic programming algorithms have also been used for simultaneous placement and routing. To the best of our knowledge, this is the first reported GPU accelerated detailed placement algorithm other than simulated annealing. We achieve upto 7x speedup in runtime over multi- threaded CPU implementation without any loss of QoR. Benchmarking Scalability of GPU Accelerated SAR Image Formation Edward H Hill (BAE Syatems); Thomas J Kragh (BAE Systems)*; Howard E Nichols (BAE Systems); Michael Minardi (AFRL); Steven Scarborough (AFRL); Alexander Boytim (AFRL) Commercial off-the-shelf GPU-based computing hardware has enabled real-time SAR Image Formation Processing (IFP) within the size, weight, power, and cost constraints of a small- to medium-sized Unmanned Aerial Vehicles. This paper presents an analysis of SAR IFP throughput as a function of multiple radar parameters including range swath, pulse repetition frequency, coherent integration time, and numerical precision (e.g., single vs. double-precision floating point). This study demonstrates the rates that SAR images can be continuously processed using a range of nVidia GPUs including server-class (K80), desktop-class (GTX 1080) and embedded (Jetson TX2) options. The ability to scale the algorithms for multiple simultaneous radar channels and multiple GPUs is also demonstrated and discussed. WCET Analysis of GPU L1 Data Caches Yijie Huangfu (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)* Graphics Processing Units (GPUs) have become widely used in high-performance computing. For real-time applications demanding high throughputs, GPUs can provide abundant computing power with high energy efficiency. However, modern GPUs are designed to boost average-case performance, not for time predictability. Therefore, for hard real-time applications, it is crucial to estimate their worst-case execution time (WCET) running on GPUs. As the first step toward this goal, this paper applies the abstract interpretation to GPU L1 data caches to estimate the worst-case L1 data cache miss rate of GPU applications. The experimental results show that the proposed analyzer can achieve safe and very tight estimation of GPU L1 data cache misses.
Thursday, September 27, 2018
GPU Computing 3:00-4:40 in Eden Vale A1/A2 Chair: Brian Stroka / MITRE
Utilizing GPU Parallelism to Improve Fast Spherical Harmonic Transforms Max L Carlson (University of Utah)*; Hari Sundar (University of Utah) Spherical harmonics form an orthogonal basis for functions that live on the surface of a sphere and are useful for solving partial differential equations and for numerical integration. The complexity of transforming a set of function samples to their corresponding spherical harmonic coefficients is largely dominated by the computation of the associated Legendre transform. This associated Legendre transform requires the computation of (L+1) dense matrix-vector products where L is the order of the spherical harmonic expansion. Since the number of rows and columns of each of these matrices depends on L, this step is essentially O(L^3). In this paper, we explore the GPU parallelism available to improve the butterfly compression approach. We present some preliminary results showing performance increases for large problem sizes and eventually plan to release the MonarchSHT library for GPU spherical harmonic transforms. A Multi-GPU PCISPH Implementation with Efficient Memory Transfers Kevin Verma (ESS Engineering Software Steyr)*; Chong Peng (ESS Engineering Software Steyr); Kamil Szewc (ESS Engineering Software Steyr); Robert Wille (Nil) Smoothed Particle Hydrodynamics (SPH) is a particle-based method for fluid flow modeling. One promising variant of SPH is Predictive- Corrective Incompressible SPH (PCISPH), which employs a dedicate prediction-correction scheme and, by this, outperforms other SPH variants by almost one order of magnitude. However, similar to other particle-based methods, it suffers from a huge numerical complexity. In order to simulate real world phenomena, several millions of particles need to be considered. To make SPH applicable to real world engineering problems, it is hence common to exploit massive parallelism of multi-GPU architectures. However, certain algorithmic characteristics of PCISPH make it a non-trivial task to efficiently parallelize this method on multi-GPUs. In this work, we are, for the first time, proposing a multi- GPU implementation for PCISPH. To this end, we are proposing a scheme which allows to overlap the memory transfers between GPUs by actual computations and, by this, avoids the drawbacks caused by the mentioned algorithmic characteristics of PCISPH. Experimental evaluations confirm the efficiency of the proposed methods. GDP: GPU accelerated Detailed Placement Shounak Dhar (University of Texas at Austin)*; David Z Pan (University of Texas at Austin) Placement is one of the runtime bottlenecks in an EDA (Elec- tronic Design Automation) tool flow. Detailed placement is an important part of placement which is hard to parallelize on a large scale. In this paper, we demonstrate GPU acceleration of a dynamic programming based detailed placement algorithm which solves a generalized version of the Linear Arrangement Problem. Although we test our algorithm on FPGA benchmarks, it can also be applied to ASIC placement. Similar dynamic programming algorithms have also been used for simultaneous placement and routing. To the best of our knowledge, this is the first reported GPU accelerated detailed placement algorithm other than simulated annealing. We achieve upto 7x speedup in runtime over multi- threaded CPU implementation without any loss of QoR. Benchmarking Scalability of GPU Accelerated SAR Image Formation Edward H Hill (BAE Syatems); Thomas J Kragh (BAE Systems)*; Howard E Nichols (BAE Systems); Michael Minardi (AFRL); Steven Scarborough (AFRL); Alexander Boytim (AFRL) Commercial off-the-shelf GPU-based computing hardware has enabled real-time SAR Image Formation Processing (IFP) within the size, weight, power, and cost constraints of a small- to medium-sized Unmanned Aerial Vehicles. This paper presents an analysis of SAR IFP throughput as a function of multiple radar parameters including range swath, pulse repetition frequency, coherent integration time, and numerical precision (e.g., single vs. double-precision floating point). This study demonstrates the rates that SAR images can be continuously processed using a range of nVidia GPUs including server-class (K80), desktop-class (GTX 1080) and embedded (Jetson TX2) options. The ability to scale the algorithms for multiple simultaneous radar channels and multiple GPUs is also demonstrated and discussed. WCET Analysis of GPU L1 Data Caches Yijie Huangfu (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)* Graphics Processing Units (GPUs) have become widely used in high- performance computing. For real-time applications demanding high throughputs, GPUs can provide abundant computing power with high energy efficiency. However, modern GPUs are designed to boost average-case performance, not for time predictability. Therefore, for hard real-time applications, it is crucial to estimate their worst-case execution time (WCET) running on GPUs. As the first step toward this goal, this paper applies the abstract interpretation to GPU L1 data caches to estimate the worst-case L1 data cache miss rate of GPU applications. The experimental results show that the proposed analyzer can achieve safe and very tight estimation of GPU L1 data cache misses.
Thursday, September 27, 2018
GPU Computing 3:00-4:40 in Eden Vale A1/A2 Chair: Brian Stroka / MITRE
HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA