2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA
GPU 1 1:00-2:40 in Eden Vale A3 Chair: Dan Campbell / GTRI Sorting Sixteen Numbers Ming Ouyang, University of Massachusetts Boston Sorting is a basic operation in data processing.  A common direction in research is to design fast sorting methods for millions of numbers.  The focus of this article is to sort 16 numbers.  Let a hextuple be an unordered tuple of 16 numbers.  Although the data may consist of thousands to millions of hextuples, the task is to sort the numbers in each hextuple.  GPUs have become powerful coprocessors to CPUs.  Sorting networks, originally meant for hardware implementation, are suitable for sorting many hextuples on GPUs.  Batcher's sorting network for 16 numbers has ten parallel steps, whereas Van Voorhis's network has nine parallel steps.  Software implementations of the former are well-known and publicly available, whereas the latter seems to remain on paper.  The main results in this article are implementations of Van Voorhis's network.  After several iterations of improvement, the final implementation of Van Voorhis's network is more than ten percent faster than the existing code for Batcher's network.  Lastly, insights gained in implementing Van Voorhis's network lead to improved implementation of Batcher's network.  The last result is useful for sorting more than 16 numbers. GPU Acceleration of Iterative Physical Optics-based Electromagnetic Simulations Vivek Venugopalan, Cagatay Tokgoz, UTRC High fidelity prediction of the link budget between a pair of transmitting and receiving antennas in dense and complex environments is computationally very intensive at high frequencies. Iterative physical optics (IPO) is a scalable solution for electromagnetic (EM) simulations with complex geometry. In this paper, an efficient and robust solution is presented to predict the link budget between antennas in a dense environment. The IPO algorithm is accelerated using Nvidia GPUs providing a speedup of 3584$\times$ for a dense geometry with 141000 triangles at 2.4 GHz. Two Nvidia GPUs with different number of cores and device memory were targeted for benchmarking the performance of the IPO algorithm. The results indicate that the GPU implementation of the IPO algorithm is memory bound. Also, the K40c GPU only provides 2$\times$ speedup over the GTX650M, although it has 7.5$\times$ more cores than the GTX650M.   An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin Jason Riedy David A. Bader, Georgia Institute of Technology Optimized GPU kernels are sufficiently complicated to write that they often are specialized to specific input data, target architectures, or applications. This paper presents a multi-search abstraction for computing multiple breadth-first searches in parallel and demonstrates a high-performance, general implementation. Our abstraction removes the burden of orchestrating graph traversal from the user while providing high performance and low energy usage, an often overlooked component of algorithm design. Energy consumption has become a first-class hardware design constraint for both massive and embedded computing platforms.  Our abstraction can be applied to such problems as the all-pairs shortest-path problem, community detection, reachability querying, and others. To map graph traversal efficiently to the GPU, our hybrid implementation chooses between processing active vertices with a single thread or an entire warp based on vertex outdegree. For a set of twelve varied graphs, the implementation of our abstraction saves 42% time and 62% energy on average compared to representative implementations of specific applications from existing literature.   Accelerating K-Means Clustering with Parallel Implementations and GPU computing Janki Bhimani, Miriam Leeser, Ningfang Mi, Northeastern University K-Means clustering is a popular unsupervised machine learning method which has been used in diverse applications including image processing, information retrieval, social sciences and weather forecasting. However, clustering is computationally expensive especially when applied to large datasets.  In this paper, we explore accelerating the performance of K-means clustering using three approaches:  1) shared memory using  OpenMP, 2) distributed memory with message passing (MPI), and 3) heterogeneous computing with NVIDIA Graphics Processing Units (GPUs) programmed with CUDA-C.  While others have looked at accelerating K-means clustering, this is the first study that compares these different approaches.  In addition, K-means performance is very sensitive to the initial means chosen.  We evaluate different initializations in parallel and chose the best one to use for the entire algorithm.  We evaluate results on a range of images from small (300x300 pixels) to large (1164x1200 pixel) images.  Our results show that all three parallel programming approaches give speedup, with the best results obtained with OpenMP for smaller images and CUDA-C for larger images.  Each of these approaches gives approximately thirty times overall speedup compared to a sequential implementation of K-means.  In addition, our parallel initialization gives an additional 1.5 to 2.5 times speedup over the accelerated parallel versions.
Thursday, September 17
2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA
GPU 1 1:00-2:40 in Eden Vale A3 Chair: Dan Campbell / GTRI Sorting Sixteen Numbers Ming Ouyang, University of Massachusetts Boston Sorting is a basic operation in data processing.  A common direction in research is to design fast sorting methods for millions of numbers.  The focus of this article is to sort 16 numbers.  Let a hextuple be an unordered tuple of 16 numbers.  Although the data may consist of thousands to millions of hextuples, the task is to sort the numbers in each hextuple.  GPUs have become powerful coprocessors to CPUs.  Sorting networks, originally meant for hardware implementation, are suitable for sorting many hextuples on GPUs.  Batcher's sorting network for 16 numbers has ten parallel steps, whereas Van Voorhis's network has nine parallel steps.  Software implementations of the former are well- known and publicly available, whereas the latter seems to remain on paper.  The main results in this article are implementations of Van Voorhis's network.  After several iterations of improvement, the final implementation of Van Voorhis's network is more than ten percent faster than the existing code for Batcher's network.  Lastly, insights gained in implementing Van Voorhis's network lead to improved implementation of Batcher's network.  The last result is useful for sorting more than 16 numbers. GPU Acceleration of Iterative Physical Optics- based Electromagnetic Simulations Vivek Venugopalan, Cagatay Tokgoz, UTRC High fidelity prediction of the link budget between a pair of transmitting and receiving antennas in dense and complex environments is computationally very intensive at high frequencies. Iterative physical optics (IPO) is a scalable solution for electromagnetic (EM) simulations with complex geometry. In this paper, an efficient and robust solution is presented to predict the link budget between antennas in a dense environment. The IPO algorithm is accelerated using Nvidia GPUs providing a speedup of 3584$\times$ for a dense geometry with 141000 triangles at 2.4 GHz. Two Nvidia GPUs with different number of cores and device memory were targeted for benchmarking the performance of the IPO algorithm. The results indicate that the GPU implementation of the IPO algorithm is memory bound. Also, the K40c GPU only provides 2$\times$ speedup over the GTX650M, although it has 7.5$\times$ more cores than the GTX650M.   An Energy-Efficient Abstraction for Simultaneous Breadth-First Searches Adam McLaughlin Jason Riedy David A. Bader, Georgia Institute of Technology Optimized GPU kernels are sufficiently complicated to write that they often are specialized to specific input data, target architectures, or applications. This paper presents a multi-search abstraction for computing multiple breadth-first searches in parallel and demonstrates a high-performance, general implementation. Our abstraction removes the burden of orchestrating graph traversal from the user while providing high performance and low energy usage, an often overlooked component of algorithm design. Energy consumption has become a first-class hardware design constraint for both massive and embedded computing platforms.  Our abstraction can be applied to such problems as the all-pairs shortest-path problem, community detection, reachability querying, and others. To map graph traversal efficiently to the GPU, our hybrid implementation chooses between processing active vertices with a single thread or an entire warp based on vertex outdegree. For a set of twelve varied graphs, the implementation of our abstraction saves 42% time and 62% energy on average compared to representative implementations of specific applications from existing literature.   Accelerating K-Means Clustering with Parallel Implementations and GPU computing Janki Bhimani, Miriam Leeser, Ningfang Mi, Northeastern University K-Means clustering is a popular unsupervised machine learning method which has been used in diverse applications including image processing, information retrieval, social sciences and weather forecasting. However, clustering is computationally expensive especially when applied to large datasets.  In this paper, we explore accelerating the performance of K-means clustering using three approaches:  1) shared memory using  OpenMP, 2) distributed memory with message passing (MPI), and 3) heterogeneous computing with NVIDIA Graphics Processing Units (GPUs) programmed with CUDA-C.  While others have looked at accelerating K-means clustering, this is the first study that compares these different approaches.  In addition, K-means performance is very sensitive to the initial means chosen.  We evaluate different initializations in parallel and chose the best one to use for the entire algorithm.  We evaluate results on a range of images from small (300x300 pixels) to large (1164x1200 pixel) images.  Our results show that all three parallel programming approaches give speedup, with the best results obtained with OpenMP for smaller images and CUDA-C for larger images.  Each of these approaches gives approximately thirty times overall speedup compared to a sequential implementation of K- means.  In addition, our parallel initialization gives an additional 1.5 to 2.5 times speedup over the accelerated parallel versions.
Thursday, September 17