2019 IEEE High Performance
Extreme Computing Conference
(HPEC ‘19)
Twenty-third Annual HPEC Conference
24 - 26 September 2019
Westin Hotel, Waltham, MA USA
Wednesday, September 25, 2019
GPU Applications and Technology
10:20-12:00 in Eden Vale A1/A2
Chair: David Cousins
Progressive Optimization of Batched LU Factorization on GPUs
Ahmad Abdelfattah, Stanimire Tomov, and Jack Dongarra (ICL UTK)
This paper presents a progressive approach for optimizing the batched LU factorization on graphics processing units (GPUs). The paper shows
that the reliance on level-3 BLAS routines for performance does not really pay off, and that it is indeed important to pay attention to the memory-
bound part of the algorithm, especially when the problem size is very small. In this context, we develop a size-aware multi-level blocking
technique that utilizes different granularities for kernel fusion according to the problem size. Our experiments, which are conducted on a Tesla
V100 GPU, show that the multi-level blocking technique achieves speedups for single/double precisions that are up to 3.28x/2.69x against the
generic LAPACK-style implementation. It is also up to 8.72x/7.2x faster than the cuBLAS library for single and double precisions, respectively.
The developed solution is integrated into the open-source MAGMA library.
[Best Student Paper Finalist] Low Overhead Instruction Latencies Characterization for NVIDIA GPGPUs
Yehia Arafa (NMSU), Abdel-Hameed A. Badawy (NMSU, LANL), Gopinath Chennupati (LANL), Nandakishore Santhi (LANL), Stephan Eidenbenz
(LANL)
The last decade has seen a shift in the computer systems industry where heterogeneous computing has become prevalent. Graphics Processing
Units (GPUs) are now present in supercomputers to mobile phones and tablets. GPUs are used for graphics operations as well as general-
purpose computing (GPGPUs) to boost the performance of compute-intensive applications. However, the percentage of undisclosed
characteristics beyond what vendors provide is not small. In this paper, we introduce a very low overhead and portable analysis for exposing the
latency of each instruction executing in the GPU pipeline(s) and the access overhead of the various memory hierarchies found in GPUs at the
microarchitecture level. Furthermore, we show the impact of the various optimizations the CUDA compiler can perform over the various latencies.
We perform our evaluation on seven different high-end NVIDIA GPUs from five different generations/architectures: Kepler, Maxwell, Pascal, Volta,
and Turing. The results in this paper can help architects to have an accurate characterization of the latencies of these GPUs, which will help in
modeling the hardware accurately. Also, software developers can perform informed optimizations to their applications.
Distributed Deep Learning for Precipitation Nowcasting
Siddharth Samsi, Christopher J. Mattioli, Mark S. Veillette (MIT-LL)
Effective training of Deep Neural Networks requires massive amounts of data and compute. As a result, longer times are needed to train complex
models requiring large datasets, which can severely limit research on model development and the exploitation of all available data. In this paper,
this problem is investigated in the context of precipitation nowcasting, a term used to describe highly detailed short-term forecasts of precipitation
and other hazardous weather. Convolutional Neural Networks (CNNs) are a powerful class of models that are well- suited for this task; however,
the high resolution input weather imagery combined with model complexity required to process this data makes training CNNs to solve this task
time consuming. To address this issue, a data-parallel model is implemented where a CNN is replicated across multiple compute nodes and the
training batches are distributed across multiple nodes. By leveraging multiple GPUs, we show that the training time for a given nowcasting model
architecture can be reduced from 59 hours to just over 1 hour. This will allow for faster iterations for improving CNN architectures and will facilitate
future advancement in the area of nowcasting.
Embedded GPU Cluster Computing Framework for Inference of Convolutional Neural Networks
Evan Kain (SHREC @ Pitt), Diego Wildenstein (Arizona State Univ.), Andrew C. Pineda (AFRL)
The growing need for on-board image processing for space vehicles requires computing solutions that are both low-power and high-performance.
Parallel computation using low-power embedded Graphics Processing Units (GPUs) satisfy both requirements. Our experiment involves the use
of OpenMPI domain decomposition of an image processing algorithm based upon a pre-trained convolutional neural network (CNN) developed
by the U.S. Air Force Research Laboratory (AFRL). Our testbed consists of six NVIDIA Jetson TX2 development boards operating in parallel. This
parallel framework results in a speedup of 4.3× on six processing nodes. This approach also leads to a linear decay in parallel efficiency as more
processing nodes are added to the network. By replicating the data across processors in addition to distributing, we also characterize the best-
case impact of adding triple modular redundancy (TMR) to our application.
Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture
Hao Wen, Wei Zhang (Virginia Commonwealth University)*
Unlike the traditional CPU-GPU heterogeneous architecture where CPU and GPU have separate DRAM and memory address space, current
heterogeneous CPU-GPU architectures integrate CPU and GPU in the same die and share the same last level cache (LLC) and memory. For the
two-level cache hierarchy in which CPU and GPU have their own private L1 caches but share the LLC, conflict misses in the LLC between CPU
and GPU may degrade both CPU and GPU performance. In addition, how the CPU and GPU memory requests flows (write back flow from L1
and cache fill flow from main memory) are managed may impact the performance. In this work, we study three different cache requests flow
management policies. The first policy is selective GPU LLC fill, which selectively fills the GPU requests in the LLC. The second policy is selective
GPU L1 write back, which selectively writes back GPU blocks in L1 cache to L2 cache. The final policy is a hybrid policy that combines the first
two, and selectively replaces CPU blocks in the LLC. Our experimental results indicate that the third policy is the best of these three. On average,
it can improve the CPU performance by about 10%, with the highest CPU performance improvement of 22%, with 0.8% averaged GPU
performance overhead.