2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA

GPU 1
1:00-2:40 in Eden Vale A3
Chair: Dan Campbell / GTRI
Sorting Sixteen Numbers
Ming Ouyang, University of Massachusetts Boston
Sorting is a basic operation in data processing. A common direction in research is to design fast sorting methods for millions of
numbers. The focus of this article is to sort 16 numbers. Let a hextuple be an unordered tuple of 16 numbers. Although the
data may consist of thousands to millions of hextuples, the task is to sort the numbers in each hextuple. GPUs have become
powerful coprocessors to CPUs. Sorting networks, originally meant for hardware implementation, are suitable for sorting many
hextuples on GPUs. Batcher's sorting network for 16 numbers has ten parallel steps, whereas Van Voorhis's network has nine
parallel steps. Software implementations of the former are well-known and publicly available, whereas the latter seems to
remain on paper. The main results in this article are implementations of Van Voorhis's network. After several iterations of
improvement, the final implementation of Van Voorhis's network is more than ten percent faster than the existing code for
Batcher's network. Lastly, insights gained in implementing Van Voorhis's network lead to improved implementation of Batcher's
network. The last result is useful for sorting more than 16 numbers.
GPU Acceleration of Iterative Physical Optics-based Electromagnetic Simulations
Vivek Venugopalan, Cagatay Tokgoz, UTRC
High fidelity prediction of the link budget between a pair of transmitting and receiving antennas in dense and complex
environments is computationally very intensive at high frequencies. Iterative physical optics (IPO) is a scalable solution for
electromagnetic (EM) simulations with complex geometry. In this paper, an efficient and robust solution is presented to predict
the link budget between antennas in a dense environment. The IPO algorithm is accelerated using Nvidia GPUs providing a
speedup of 3584$\times$ for a dense geometry with 141000 triangles at 2.4 GHz. Two Nvidia GPUs with different number of
cores and device memory were targeted for benchmarking the performance of the IPO algorithm. The results indicate that the
GPU implementation of the IPO algorithm is memory bound. Also, the K40c GPU only provides 2$\times$ speedup over the
GTX650M, although it has 7.5$\times$ more cores than the GTX650M.
An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches
Adam McLaughlin Jason Riedy David A. Bader, Georgia Institute of Technology
Optimized GPU kernels are sufficiently complicated to write that they often are specialized to specific input data, target
architectures, or applications. This paper presents a multi-search abstraction for computing multiple breadth-first searches in
parallel and demonstrates a high-performance, general implementation. Our abstraction removes the burden of orchestrating
graph traversal from the user while providing high performance and low energy usage, an often overlooked component of
algorithm design. Energy consumption has become a first-class hardware design constraint for both massive and embedded
computing platforms. Our abstraction can be applied to such problems as the all-pairs shortest-path problem, community
detection, reachability querying, and others. To map graph traversal efficiently to the GPU, our hybrid implementation chooses
between processing active vertices with a single thread or an entire warp based on vertex outdegree. For a set of twelve varied
graphs, the implementation of our abstraction saves 42% time and 62% energy on average compared to representative
implementations of specific applications from existing literature.
Accelerating K-Means Clustering with Parallel Implementations and GPU computing
Janki Bhimani, Miriam Leeser, Ningfang Mi, Northeastern University
K-Means clustering is a popular unsupervised machine learning method which has been used in diverse applications including
image processing, information retrieval, social sciences and weather forecasting. However, clustering is computationally
expensive especially when applied to large datasets. In this paper, we explore accelerating the performance of K-means
clustering using three approaches: 1) shared memory using OpenMP, 2) distributed memory with message passing (MPI), and
3) heterogeneous computing with NVIDIA Graphics Processing Units (GPUs) programmed with CUDA-C. While others have
looked at accelerating K-means clustering, this is the first study that compares these different approaches. In addition, K-means
performance is very sensitive to the initial means chosen. We evaluate different initializations in parallel and chose the best one
to use for the entire algorithm. We evaluate results on a range of images from small (300x300 pixels) to large (1164x1200 pixel)
images. Our results show that all three parallel programming approaches give speedup, with the best results obtained with
OpenMP for smaller images and CUDA-C for larger images. Each of these approaches gives approximately thirty times overall
speedup compared to a sequential implementation of K-means. In addition, our parallel initialization gives an additional 1.5 to
2.5 times speedup over the accelerated parallel versions.

Thursday, September 17

2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA

GPU 1
1:00-2:40 in Eden Vale A3
Chair: Dan Campbell / GTRI
Sorting Sixteen Numbers
Ming Ouyang, University of Massachusetts Boston
Sorting is a basic operation in data processing. A
common direction in research is to design fast
sorting methods for millions of numbers. The
focus of this article is to sort 16 numbers. Let a
hextuple be an unordered tuple of 16 numbers.
Although the data may consist of thousands to
millions of hextuples, the task is to sort the
numbers in each hextuple. GPUs have become
powerful coprocessors to CPUs. Sorting
networks, originally meant for hardware
implementation, are suitable for sorting many
hextuples on GPUs. Batcher's sorting network
for 16 numbers has ten parallel steps, whereas
Van Voorhis's network has nine parallel steps.
Software implementations of the former are well-
known and publicly available, whereas the latter
seems to remain on paper. The main results in
this article are implementations of Van Voorhis's
network. After several iterations of improvement,
the final implementation of Van Voorhis's network
is more than ten percent faster than the existing
code for Batcher's network. Lastly, insights
gained in implementing Van Voorhis's network
lead to improved implementation of Batcher's
network. The last result is useful for sorting more
than 16 numbers.
GPU Acceleration of Iterative Physical Optics-
based Electromagnetic Simulations
Vivek Venugopalan, Cagatay Tokgoz, UTRC
High fidelity prediction of the link budget between
a pair of transmitting and receiving antennas in
dense and complex environments is
computationally very intensive at high
frequencies. Iterative physical optics (IPO) is a
scalable solution for electromagnetic (EM)
simulations with complex geometry. In this paper,
an efficient and robust solution is presented to
predict the link budget between antennas in a
dense environment. The IPO algorithm is
accelerated using Nvidia GPUs providing a
speedup of 3584$\times$ for a dense geometry
with 141000 triangles at 2.4 GHz. Two Nvidia
GPUs with different number of cores and device
memory were targeted for benchmarking the
performance of the IPO algorithm. The results
indicate that the GPU implementation of the IPO
algorithm is memory bound. Also, the K40c GPU
only provides 2$\times$ speedup over the
GTX650M, although it has 7.5$\times$ more
cores than the GTX650M.
An Energy-Efficient Abstraction for Simultaneous
Breadth-First Searches
Adam McLaughlin Jason Riedy David A. Bader, Georgia
Institute of Technology
Optimized GPU kernels are sufficiently
complicated to write that they often are
specialized to specific input data, target
architectures, or applications. This paper
presents a multi-search abstraction for computing
multiple breadth-first searches in parallel and
demonstrates a high-performance, general
implementation. Our abstraction removes the
burden of orchestrating graph traversal from the
user while providing high performance and low
energy usage, an often overlooked component of
algorithm design. Energy consumption has
become a first-class hardware design constraint
for both massive and embedded computing
platforms. Our abstraction can be applied to
such problems as the all-pairs shortest-path
problem, community detection, reachability
querying, and others. To map graph traversal
efficiently to the GPU, our hybrid implementation
chooses between processing active vertices with
a single thread or an entire warp based on vertex
outdegree. For a set of twelve varied graphs, the
implementation of our abstraction saves 42%
time and 62% energy on average compared to
representative implementations of specific
applications from existing literature.
Accelerating K-Means Clustering with Parallel
Implementations and GPU computing
Janki Bhimani, Miriam Leeser, Ningfang Mi, Northeastern
University
K-Means clustering is a popular unsupervised
machine learning method which has been used in
diverse applications including image processing,
information retrieval, social sciences and weather
forecasting. However, clustering is
computationally expensive especially when
applied to large datasets. In this paper, we
explore accelerating the performance of K-means
clustering using three approaches: 1) shared
memory using OpenMP, 2) distributed memory
with message passing (MPI), and 3)
heterogeneous computing with NVIDIA Graphics
Processing Units (GPUs) programmed with
CUDA-C. While others have looked at
accelerating K-means clustering, this is the first
study that compares these different approaches.
In addition, K-means performance is very
sensitive to the initial means chosen. We
evaluate different initializations in parallel and
chose the best one to use for the entire
algorithm. We evaluate results on a range of
images from small (300x300 pixels) to large
(1164x1200 pixel) images. Our results show that
all three parallel programming approaches give
speedup, with the best results obtained with
OpenMP for smaller images and CUDA-C for
larger images. Each of these approaches gives
approximately thirty times overall speedup
compared to a sequential implementation of K-
means. In addition, our parallel initialization
gives an additional 1.5 to 2.5 times speedup over
the accelerated parallel versions.

Thursday, September 17