2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA

GPU 2
3:00-4:40 in Eden Vale A3
Chair: Hahn Kim / Lockheed Martin
Algorithm Flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking
Lucas Vespa, Alexander Bauman, Jenny Wells, University of Illinois Springfield
Graphics processing units (GPUs) have inadvertently become supercomputers in and of themselves, to the benefit of
applications outside of graphics. Acceleration of multiple orders of magnitude has been achieved in scientific computing, co-
processing and the like. However, the Single Instruction Multiple Data (SIMD) design of GPUs is extremely sensitive to thread
divergence. So much so that performance improvement from GPUs is all but eviscerated by thread divergence for many
applications. This problem has driven general purpose GPU computing in the direction of finding ``appropriate'' applications to
accelerate, rather than accelerating applications with a need for performance improvements. Thread divergence is generally
caused by branches. Previous research has addressed the issue of reducing branches, but none of this work aims to entirely
eliminate branches, because the methods required for complete branch elimination are a drastic de-optimization for CPU. We
present a de-optimization for CPU which completely removes all branches, and results in a significant optimization for GPU
accelerated applications which eliminates thread divergence, substantially decreases execution time, allows for the
implementation of algorithms on GPU which previously do not fully utilize GPU capability and generates deterministic
performance. Our optimization removes branches, replacing them with a reduced equation, and achieves a substantial speedup
of already GPU accelerated algorithms and applications.
GPU Implementation of Reverse Coordinate Conversion for Proteins
Mahsa Bayati, Jaydeep P. Bardhan, Miriam Leeser, Northeastern Univ.
We describe the first parallel algorithm to obtain Cartesian coordinates of a molecular structure from internal coordinates such
as bond lengths and angles. This algorithm is important for problems such as protein engineering and fitting protein structures to
experimental data, because different stages of molecular simulation can require one structure representation or the other, and
many such conversions are required during the course of a calculation such as molecular dynamics. Proteins contain thousands
to millions of atoms. Their positions can be represented using one of two methods: Cartesian or internal coordinates (bond
lengths, angles, etc.). In molecular dynamics and modeling of proteins in different conformational states, it is often necessary to
transform one coordinate system to another. In addition, since proteins change over time, any computation must be done over
successive time frames, increasing the computational load. To lessen this computational load we have applied different parallel
techniques to the protein conversion problem. Converting Cartesian coordinates to internal ones is easily parallelized and we
have previously reported a GPU implementation. On the other hand, the reverse direction, which is used for such tasks as
protein engineering and fitting structures to experimental data, is much more challenging to parallelize [1]. The reverse direction
is used in molecular simulations for such tasks as protein engineering and fitting atomic structures to experimental data. This
computation has inherent dependency in the data structures because bond lengths and angles are relative to neighboring
atoms. Existing implementations walk over a protein structure in a sequential fashion. This paper presents the first parallel
implementation of internal to Cartesian coordinates, in which substructures of the protein backbone are converted into their own
local Cartesian coordinate spaces and then combined using a reduction technique to find global Cartesian coordinates. We
observe order of magnitude speedup using parallel processing on a GPU.
Bisection and Twisted SVD on GPU
Lu He, Yan Luo, University of Massachusetts Lowell, Rui Liu, Wake Forest University, Hengyong Yu, Yu Cao, University of Massachusetts
Lowell, Xuzhou Chen, Fitchburg State University, Seung Woo Son, University of Massachusetts Lowell
Singular value decomposition (SVD) is one of the most important factorizations in matrix computation. However, computing SVD
is still time-consuming, especially when the dimension of matrices exceeds tens of thousands. In this paper, we present a high
performance approach called “Bisection and Twisted” (BT) for solving bidiagonal SVD. As modern general purpose GPUs have
shown their extreme computational advantages in parallel computing, we implement the BT algorithm on single and multiple
GPUs. With our carefully designed GPU kernels, the BT algorithm is about 10 times faster than MKL divide-and-conquer routine
DBDSDC on an 8-core 2.53GHz CPU, and 36 times faster than CULA QR routine DBDSQR on the same GPUs. Additionally, the
BT algorithm is able to compute SVD for matrices of size 1 million by 1 million with only two GPUs. To the best of our
knowledge, no implementation has achieved such a scale.
GPU accelerated geometric multigrid method: comparison with preconditioned conjugate gradient
Iulian Stroia, Lucian Itu, Cosmin Ni??, Laszlo Laz?r, Constantin Suciu Siemens
We describe the first parallel algorithm to obtain Cartesian coordinates of a molecular structure from internal coordinates such
as bond lengths and angles. This algorithm is important for problems such as protein engineering and fitting protein structures to
experimental data, because different stages of molecular simulation can require one structure representation or the other, and
many such conversions are required during the course of a calculation such as molecular dynamics. Proteins contain thousands
to millions of atoms. Their positions can be represented using one of two methods: Cartesian or internal coordinates (bond
lengths, angles, etc.). In molecular dynamics and modeling of proteins in different conformational states, it is often necessary to
transform one coordinate system to another. In addition, since proteins change over time, any computation must be done over
successive time frames, increasing the computational load. To lessen this computational load we have applied different parallel
techniques to the protein conversion problem. Converting Cartesian coordinates to internal ones is easily parallelized and we
have previously reported a GPU implementation. On the other hand, the reverse direction, which is used for such tasks as
protein engineering and fitting structures to experimental data, is much more challenging to parallelize [1]. The reverse direction
is used in molecular simulations for such tasks as protein engineering and fitting atomic structures to experimental data. This
computation has inherent dependency in the data structures because bond lengths and angles are relative to neighboring
atoms. Existing implementations walk over a protein structure in a sequential fashion. This paper presents the first parallel
implementation of internal to Cartesian coordinates, in which substructures of the protein backbone are converted into their own
local Cartesian coordinate spaces and then combined using a reduction technique to find global Cartesian coordinates. We
observe order of magnitude speedup using parallel processing on a GPU.
Full-Chain Benchmarking for Open Architecture Airborne ISR Systems: A Case Study for GMTI Radar Applications
Matthias Beebe, Matthew Alexander, Paul Foley, Denise Galejs, Stephen Mooney, Iulian Popescu, Kevin Rottman, Meryl Stav
MIT Lincoln Laboratory
As the airborne ISR application space evolves, the quantities of data acquired by remote sensing systems such as radar,
electro-optical, and infrared systems are growing larger, and advanced algorithms are imposing more challenging computational
requirements for real-time processing. While the difficulties in processing sensor data in real-time is the topic of extensive
research, the rapidly shifting technology and application complexity has led to pronounced system lifecycle challenges, including
the constant threat of technology obsolescence and unsustainable maintenance costs. One way for government programs to
address this reality economically is to shift the ISR system acquisition strategy to facilitate timely, cost-effective insertion and
upgrading of technologies through the utilization of an open architecture (OA) approach to system design standards for
application ready processors (ARPs). OA design leverages industry-standard hardware and middleware, thus engaging a
broader development community and lowering barriers for third-party application development. For this approach to succeed
without sacrificing functional capabilities and real-time performance, effective benchmarks are necessary to ensure that an ARP
system can meet mission constraints and performance requirements of real-world applications. This work investigates the
measurement of real-time performance of commodity high-speed processing solutions and middleware for airborne systems
using OA composite benchmarks, i.e. benchmarks that characterize computational performance of the system as a whole, while
also validating OA principles. For ISR systems, processing performance can be counterbalanced by size, weight and power
(SWaP) constraints that often necessitate application-specific configurations (e.g. mapping and scheduling) for system-level
optimization. Using ground moving target indicator (GMTI) radar as an example, we demonstrate the use of an open
architecture benchmarking framework using industry-standard middleware to indicate the suitability of candidate systems for ISR
applications under constrained SWaP.

Thursday, September 17

2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA

GPU 2
3:00-4:40 in Eden Vale A3
Chair: Hahn Kim / Lockheed Martin
Algorithm Flattening: Complete branch elimination
for GPU requires a paradigm shift from CPU
thinking
Lucas Vespa, Alexander Bauman, Jenny Wells, University of
Illinois Springfield
Graphics processing units (GPUs) have
inadvertently become supercomputers in and of
themselves, to the benefit of applications outside
of graphics. Acceleration of multiple orders of
magnitude has been achieved in scientific
computing, co-processing and the like. However,
the Single Instruction Multiple Data (SIMD) design
of GPUs is extremely sensitive to thread
divergence. So much so that performance
improvement from GPUs is all but eviscerated by
thread divergence for many applications. This
problem has driven general purpose GPU
computing in the direction of finding ``appropriate''
applications to accelerate, rather than accelerating
applications with a need for performance
improvements. Thread divergence is generally
caused by branches. Previous research has
addressed the issue of reducing branches, but
none of this work aims to entirely eliminate
branches, because the methods required for
complete branch elimination are a drastic de-
optimization for CPU. We present a de-
optimization for CPU which completely removes all
branches, and results in a significant optimization
for GPU accelerated applications which eliminates
thread divergence, substantially decreases
execution time, allows for the implementation of
algorithms on GPU which previously do not fully
utilize GPU capability and generates deterministic
performance. Our optimization removes branches,
replacing them with a reduced equation, and
achieves a substantial speedup of already GPU
accelerated algorithms and applications.
GPU Implementation of Reverse Coordinate
Conversion for Proteins
Mahsa Bayati, Jaydeep P. Bardhan, Miriam Leeser,
Northeastern Univ.
We describe the first parallel algorithm to obtain
Cartesian coordinates of a molecular structure
from internal coordinates such as bond lengths
and angles. This algorithm is important for
problems such as protein engineering and fitting
protein structures to experimental data, because
different stages of molecular simulation can
require one structure representation or the other,
and many such conversions are required during
the course of a calculation such as molecular
dynamics. Proteins contain thousands to millions
of atoms. Their positions can be represented using
one of two methods: Cartesian or internal
coordinates (bond lengths, angles, etc.). In
molecular dynamics and modeling of proteins in
different conformational states, it is often
necessary to transform one coordinate system to
another. In addition, since proteins change over
time, any computation must be done over
successive time frames, increasing the
computational load. To lessen this computational
load we have applied different parallel techniques
to the protein conversion problem. Converting
Cartesian coordinates to internal ones is easily
parallelized and we have previously reported a
GPU implementation. On the other hand, the
reverse direction, which is used for such tasks as
protein engineering and fitting structures to
experimental data, is much more challenging to
parallelize [1]. The reverse direction is used in
molecular simulations for such tasks as protein
engineering and fitting atomic structures to
experimental data. This computation has inherent
dependency in the data structures because bond
lengths and angles are relative to neighboring
atoms. Existing implementations walk over a
protein structure in a sequential fashion. This
paper presents the first parallel implementation of
internal to Cartesian coordinates, in which
substructures of the protein backbone are
converted into their own local Cartesian coordinate
spaces and then combined using a reduction
technique to find global Cartesian coordinates. We
observe order of magnitude speedup using parallel
processing on a GPU.
Bisection and Twisted SVD on GPU
Lu He, Yan Luo, University of Massachusetts Lowell, Rui Liu,
Wake Forest University, Hengyong Yu, Yu Cao, University of
Massachusetts Lowell, Xuzhou Chen, Fitchburg State
University, Seung Woo Son, University of Massachusetts
Lowell
Singular value decomposition (SVD) is one of the
most important factorizations in matrix
computation. However, computing SVD is still
time-consuming, especially when the dimension of
matrices exceeds tens of thousands. In this paper,
we present a high performance approach called
“Bisection and Twisted” (BT) for solving bidiagonal
SVD. As modern general purpose GPUs have
shown their extreme computational advantages in
parallel computing, we implement the BT algorithm
on single and multiple GPUs. With our carefully
designed GPU kernels, the BT algorithm is about
10 times faster than MKL divide-and-conquer
routine DBDSDC on an 8-core 2.53GHz CPU, and
36 times faster than CULA QR routine DBDSQR
on the same GPUs. Additionally, the BT algorithm
is able to compute SVD for matrices of size 1
million by 1 million with only two GPUs. To the best
of our knowledge, no implementation has achieved
such a scale.
GPU accelerated geometric multigrid method:
comparison with preconditioned conjugate
gradient
Iulian Stroia, Lucian Itu, Cosmin Ni??, Laszlo Laz?r,
Constantin Suciu Siemens
We describe the first parallel algorithm to obtain
Cartesian coordinates of a molecular structure
from internal coordinates such as bond lengths
and angles. This algorithm is important for
problems such as protein engineering and fitting
protein structures to experimental data, because
different stages of molecular simulation can
require one structure representation or the other,
and many such conversions are required during
the course of a calculation such as molecular
dynamics. Proteins contain thousands to millions
of atoms. Their positions can be represented using
one of two methods: Cartesian or internal
coordinates (bond lengths, angles, etc.). In
molecular dynamics and modeling of proteins in
different conformational states, it is often
necessary to transform one coordinate system to
another. In addition, since proteins change over
time, any computation must be done over
successive time frames, increasing the
computational load. To lessen this computational
load we have applied different parallel techniques
to the protein conversion problem. Converting
Cartesian coordinates to internal ones is easily
parallelized and we have previously reported a
GPU implementation. On the other hand, the
reverse direction, which is used for such tasks as
protein engineering and fitting structures to
experimental data, is much more challenging to
parallelize [1]. The reverse direction is used in
molecular simulations for such tasks as protein
engineering and fitting atomic structures to
experimental data. This computation has inherent
dependency in the data structures because bond
lengths and angles are relative to neighboring
atoms. Existing implementations walk over a
protein structure in a sequential fashion. This
paper presents the first parallel implementation of
internal to Cartesian coordinates, in which
substructures of the protein backbone are
converted into their own local Cartesian coordinate
spaces and then combined using a reduction
technique to find global Cartesian coordinates. We
observe order of magnitude speedup using parallel
processing on a GPU.
Full-Chain Benchmarking for Open Architecture
Airborne ISR Systems: A Case Study for GMTI
Radar Applications
Matthias Beebe, Matthew Alexander, Paul Foley, Denise
Galejs, Stephen Mooney, Iulian Popescu, Kevin
Rottman, Meryl Stav
MIT Lincoln Laboratory
As the airborne ISR application space evolves, the
quantities of data acquired by remote sensing
systems such as radar, electro-optical, and
infrared systems are growing larger, and advanced
algorithms are imposing more challenging
computational requirements for real-time
processing. While the difficulties in processing
sensor data in real-time is the topic of extensive
research, the rapidly shifting technology and
application complexity has led to pronounced
system lifecycle challenges, including the constant
threat of technology obsolescence and
unsustainable maintenance costs. One way for
government programs to address this reality
economically is to shift the ISR system acquisition
strategy to facilitate timely, cost-effective insertion
and upgrading of technologies through the
utilization of an open architecture (OA) approach
to system design standards for application ready
processors (ARPs). OA design leverages industry-
standard hardware and middleware, thus engaging
a broader development community and lowering
barriers for third-party application development.
For this approach to succeed without sacrificing
functional capabilities and real-time performance,
effective benchmarks are necessary to ensure that
an ARP system can meet mission constraints and
performance requirements of real-world
applications. This work investigates the
measurement of real-time performance of
commodity high-speed processing solutions and
middleware for airborne systems using OA
composite benchmarks, i.e. benchmarks that
characterize computational performance of the
system as a whole, while also validating OA
principles. For ISR systems, processing
performance can be counterbalanced by size,
weight and power (SWaP) constraints that often
necessitate application-specific configurations (e.g.
mapping and scheduling) for system-level
optimization. Using ground moving target
indicator (GMTI) radar as an example, we
demonstrate the use of an open architecture
benchmarking framework using industry-standard
middleware to indicate the suitability of candidate
systems for ISR applications under constrained
SWaP.

Thursday, September 17