2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA
GPU 2
3:00-4:40 in Eden Vale A3
Chair: Hahn Kim / Lockheed Martin
Algorithm Flattening: Complete branch elimination
for GPU requires a paradigm shift from CPU
thinking
Lucas Vespa, Alexander Bauman, Jenny Wells, University of
Illinois Springfield
Graphics processing units (GPUs) have
inadvertently become supercomputers in and of
themselves, to the benefit of applications outside
of graphics. Acceleration of multiple orders of
magnitude has been achieved in scientific
computing, co-processing and the like. However,
the Single Instruction Multiple Data (SIMD) design
of GPUs is extremely sensitive to thread
divergence. So much so that performance
improvement from GPUs is all but eviscerated by
thread divergence for many applications. This
problem has driven general purpose GPU
computing in the direction of finding ``appropriate''
applications to accelerate, rather than accelerating
applications with a need for performance
improvements. Thread divergence is generally
caused by branches. Previous research has
addressed the issue of reducing branches, but
none of this work aims to entirely eliminate
branches, because the methods required for
complete branch elimination are a drastic de-
optimization for CPU. We present a de-
optimization for CPU which completely removes all
branches, and results in a significant optimization
for GPU accelerated applications which eliminates
thread divergence, substantially decreases
execution time, allows for the implementation of
algorithms on GPU which previously do not fully
utilize GPU capability and generates deterministic
performance. Our optimization removes branches,
replacing them with a reduced equation, and
achieves a substantial speedup of already GPU
accelerated algorithms and applications.
GPU Implementation of Reverse Coordinate
Conversion for Proteins
Mahsa Bayati, Jaydeep P. Bardhan, Miriam Leeser,
Northeastern Univ.
We describe the first parallel algorithm to obtain
Cartesian coordinates of a molecular structure
from internal coordinates such as bond lengths
and angles. This algorithm is important for
problems such as protein engineering and fitting
protein structures to experimental data, because
different stages of molecular simulation can
require one structure representation or the other,
and many such conversions are required during
the course of a calculation such as molecular
dynamics. Proteins contain thousands to millions
of atoms. Their positions can be represented using
one of two methods: Cartesian or internal
coordinates (bond lengths, angles, etc.). In
molecular dynamics and modeling of proteins in
different conformational states, it is often
necessary to transform one coordinate system to
another. In addition, since proteins change over
time, any computation must be done over
successive time frames, increasing the
computational load. To lessen this computational
load we have applied different parallel techniques
to the protein conversion problem. Converting
Cartesian coordinates to internal ones is easily
parallelized and we have previously reported a
GPU implementation. On the other hand, the
reverse direction, which is used for such tasks as
protein engineering and fitting structures to
experimental data, is much more challenging to
parallelize [1]. The reverse direction is used in
molecular simulations for such tasks as protein
engineering and fitting atomic structures to
experimental data. This computation has inherent
dependency in the data structures because bond
lengths and angles are relative to neighboring
atoms. Existing implementations walk over a
protein structure in a sequential fashion. This
paper presents the first parallel implementation of
internal to Cartesian coordinates, in which
substructures of the protein backbone are
converted into their own local Cartesian coordinate
spaces and then combined using a reduction
technique to find global Cartesian coordinates. We
observe order of magnitude speedup using parallel
processing on a GPU.
Bisection and Twisted SVD on GPU
Lu He, Yan Luo, University of Massachusetts Lowell, Rui Liu,
Wake Forest University, Hengyong Yu, Yu Cao, University of
Massachusetts Lowell, Xuzhou Chen, Fitchburg State
University, Seung Woo Son, University of Massachusetts
Lowell
Singular value decomposition (SVD) is one of the
most important factorizations in matrix
computation. However, computing SVD is still
time-consuming, especially when the dimension of
matrices exceeds tens of thousands. In this paper,
we present a high performance approach called
“Bisection and Twisted” (BT) for solving bidiagonal
SVD. As modern general purpose GPUs have
shown their extreme computational advantages in
parallel computing, we implement the BT algorithm
on single and multiple GPUs. With our carefully
designed GPU kernels, the BT algorithm is about
10 times faster than MKL divide-and-conquer
routine DBDSDC on an 8-core 2.53GHz CPU, and
36 times faster than CULA QR routine DBDSQR
on the same GPUs. Additionally, the BT algorithm
is able to compute SVD for matrices of size 1
million by 1 million with only two GPUs. To the best
of our knowledge, no implementation has achieved
such a scale.
GPU accelerated geometric multigrid method:
comparison with preconditioned conjugate
gradient
Iulian Stroia, Lucian Itu, Cosmin Ni??, Laszlo Laz?r,
Constantin Suciu Siemens
We describe the first parallel algorithm to obtain
Cartesian coordinates of a molecular structure
from internal coordinates such as bond lengths
and angles. This algorithm is important for
problems such as protein engineering and fitting
protein structures to experimental data, because
different stages of molecular simulation can
require one structure representation or the other,
and many such conversions are required during
the course of a calculation such as molecular
dynamics. Proteins contain thousands to millions
of atoms. Their positions can be represented using
one of two methods: Cartesian or internal
coordinates (bond lengths, angles, etc.). In
molecular dynamics and modeling of proteins in
different conformational states, it is often
necessary to transform one coordinate system to
another. In addition, since proteins change over
time, any computation must be done over
successive time frames, increasing the
computational load. To lessen this computational
load we have applied different parallel techniques
to the protein conversion problem. Converting
Cartesian coordinates to internal ones is easily
parallelized and we have previously reported a
GPU implementation. On the other hand, the
reverse direction, which is used for such tasks as
protein engineering and fitting structures to
experimental data, is much more challenging to
parallelize [1]. The reverse direction is used in
molecular simulations for such tasks as protein
engineering and fitting atomic structures to
experimental data. This computation has inherent
dependency in the data structures because bond
lengths and angles are relative to neighboring
atoms. Existing implementations walk over a
protein structure in a sequential fashion. This
paper presents the first parallel implementation of
internal to Cartesian coordinates, in which
substructures of the protein backbone are
converted into their own local Cartesian coordinate
spaces and then combined using a reduction
technique to find global Cartesian coordinates. We
observe order of magnitude speedup using parallel
processing on a GPU.
Full-Chain Benchmarking for Open Architecture
Airborne ISR Systems: A Case Study for GMTI
Radar Applications
Matthias Beebe, Matthew Alexander, Paul Foley, Denise
Galejs, Stephen Mooney, Iulian Popescu, Kevin
Rottman, Meryl Stav
MIT Lincoln Laboratory
As the airborne ISR application space evolves, the
quantities of data acquired by remote sensing
systems such as radar, electro-optical, and
infrared systems are growing larger, and advanced
algorithms are imposing more challenging
computational requirements for real-time
processing. While the difficulties in processing
sensor data in real-time is the topic of extensive
research, the rapidly shifting technology and
application complexity has led to pronounced
system lifecycle challenges, including the constant
threat of technology obsolescence and
unsustainable maintenance costs. One way for
government programs to address this reality
economically is to shift the ISR system acquisition
strategy to facilitate timely, cost-effective insertion
and upgrading of technologies through the
utilization of an open architecture (OA) approach
to system design standards for application ready
processors (ARPs). OA design leverages industry-
standard hardware and middleware, thus engaging
a broader development community and lowering
barriers for third-party application development.
For this approach to succeed without sacrificing
functional capabilities and real-time performance,
effective benchmarks are necessary to ensure that
an ARP system can meet mission constraints and
performance requirements of real-world
applications. This work investigates the
measurement of real-time performance of
commodity high-speed processing solutions and
middleware for airborne systems using OA
composite benchmarks, i.e. benchmarks that
characterize computational performance of the
system as a whole, while also validating OA
principles. For ISR systems, processing
performance can be counterbalanced by size,
weight and power (SWaP) constraints that often
necessitate application-specific configurations (e.g.
mapping and scheduling) for system-level
optimization. Using ground moving target
indicator (GMTI) radar as an example, we
demonstrate the use of an open architecture
benchmarking framework using industry-standard
middleware to indicate the suitability of candidate
systems for ISR applications under constrained
SWaP.
Thursday, September 17