2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA
GPU 1
1:00-2:40 in Eden Vale A3
Chair: Dan Campbell / GTRI
Sorting Sixteen Numbers
Ming Ouyang, University of Massachusetts Boston
Sorting is a basic operation in data processing. A
common direction in research is to design fast
sorting methods for millions of numbers. The
focus of this article is to sort 16 numbers. Let a
hextuple be an unordered tuple of 16 numbers.
Although the data may consist of thousands to
millions of hextuples, the task is to sort the
numbers in each hextuple. GPUs have become
powerful coprocessors to CPUs. Sorting
networks, originally meant for hardware
implementation, are suitable for sorting many
hextuples on GPUs. Batcher's sorting network
for 16 numbers has ten parallel steps, whereas
Van Voorhis's network has nine parallel steps.
Software implementations of the former are well-
known and publicly available, whereas the latter
seems to remain on paper. The main results in
this article are implementations of Van Voorhis's
network. After several iterations of improvement,
the final implementation of Van Voorhis's network
is more than ten percent faster than the existing
code for Batcher's network. Lastly, insights
gained in implementing Van Voorhis's network
lead to improved implementation of Batcher's
network. The last result is useful for sorting more
than 16 numbers.
GPU Acceleration of Iterative Physical Optics-
based Electromagnetic Simulations
Vivek Venugopalan, Cagatay Tokgoz, UTRC
High fidelity prediction of the link budget between
a pair of transmitting and receiving antennas in
dense and complex environments is
computationally very intensive at high
frequencies. Iterative physical optics (IPO) is a
scalable solution for electromagnetic (EM)
simulations with complex geometry. In this paper,
an efficient and robust solution is presented to
predict the link budget between antennas in a
dense environment. The IPO algorithm is
accelerated using Nvidia GPUs providing a
speedup of 3584$\times$ for a dense geometry
with 141000 triangles at 2.4 GHz. Two Nvidia
GPUs with different number of cores and device
memory were targeted for benchmarking the
performance of the IPO algorithm. The results
indicate that the GPU implementation of the IPO
algorithm is memory bound. Also, the K40c GPU
only provides 2$\times$ speedup over the
GTX650M, although it has 7.5$\times$ more
cores than the GTX650M.
An Energy-Efficient Abstraction for Simultaneous
Breadth-First Searches
Adam McLaughlin Jason Riedy David A. Bader, Georgia
Institute of Technology
Optimized GPU kernels are sufficiently
complicated to write that they often are
specialized to specific input data, target
architectures, or applications. This paper
presents a multi-search abstraction for computing
multiple breadth-first searches in parallel and
demonstrates a high-performance, general
implementation. Our abstraction removes the
burden of orchestrating graph traversal from the
user while providing high performance and low
energy usage, an often overlooked component of
algorithm design. Energy consumption has
become a first-class hardware design constraint
for both massive and embedded computing
platforms. Our abstraction can be applied to
such problems as the all-pairs shortest-path
problem, community detection, reachability
querying, and others. To map graph traversal
efficiently to the GPU, our hybrid implementation
chooses between processing active vertices with
a single thread or an entire warp based on vertex
outdegree. For a set of twelve varied graphs, the
implementation of our abstraction saves 42%
time and 62% energy on average compared to
representative implementations of specific
applications from existing literature.
Accelerating K-Means Clustering with Parallel
Implementations and GPU computing
Janki Bhimani, Miriam Leeser, Ningfang Mi, Northeastern
University
K-Means clustering is a popular unsupervised
machine learning method which has been used in
diverse applications including image processing,
information retrieval, social sciences and weather
forecasting. However, clustering is
computationally expensive especially when
applied to large datasets. In this paper, we
explore accelerating the performance of K-means
clustering using three approaches: 1) shared
memory using OpenMP, 2) distributed memory
with message passing (MPI), and 3)
heterogeneous computing with NVIDIA Graphics
Processing Units (GPUs) programmed with
CUDA-C. While others have looked at
accelerating K-means clustering, this is the first
study that compares these different approaches.
In addition, K-means performance is very
sensitive to the initial means chosen. We
evaluate different initializations in parallel and
chose the best one to use for the entire
algorithm. We evaluate results on a range of
images from small (300x300 pixels) to large
(1164x1200 pixel) images. Our results show that
all three parallel programming approaches give
speedup, with the best results obtained with
OpenMP for smaller images and CUDA-C for
larger images. Each of these approaches gives
approximately thirty times overall speedup
compared to a sequential implementation of K-
means. In addition, our parallel initialization
gives an additional 1.5 to 2.5 times speedup over
the accelerated parallel versions.
Thursday, September 17