Home Welcome Message Committee Invited Speakers Program Demos
2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA
GPU 2 3:00-4:40 in Eden Vale A3 Chair: Hahn Kim / Lockheed Martin Algorithm Flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking Lucas Vespa, Alexander Bauman, Jenny Wells, University of Illinois Springfield Graphics processing units (GPUs) have inadvertently become supercomputers in and of themselves, to the benefit of applications outside of graphics. Acceleration of multiple orders of magnitude has been achieved in scientific computing, co- processing and the like. However, the Single Instruction Multiple Data (SIMD) design of GPUs is extremely sensitive to thread divergence. So much so that performance improvement from GPUs is all but eviscerated by thread divergence for many applications. This problem has driven general purpose GPU computing in the direction of finding ``appropriate'' applications to accelerate, rather than accelerating applications with a need for performance improvements. Thread divergence is generally caused by branches. Previous research has addressed the issue of reducing branches, but none of this work aims to entirely eliminate branches, because the methods required for complete branch elimination are a drastic de-optimization for CPU. We present a de-optimization for CPU which completely removes all branches, and results in a significant optimization for GPU accelerated applications which eliminates thread divergence, substantially decreases execution time, allows for the implementation of algorithms on GPU which previously do not fully utilize GPU capability and generates deterministic performance. Our optimization removes branches, replacing them with a reduced equation, and achieves a substantial speedup of already GPU accelerated algorithms and applications.  GPU Implementation of Reverse Coordinate Conversion for Proteins Mahsa Bayati, Jaydeep P. Bardhan, Miriam Leeser, Northeastern Univ. We describe the first parallel algorithm to obtain Cartesian coordinates of a molecular structure from internal coordinates such as bond lengths and angles. This algorithm is important for problems such as protein engineering and fitting protein structures to experimental data, because different stages of molecular simulation can require one structure representation or the other, and many such conversions are required during the course of a calculation such as molecular dynamics. Proteins contain thousands to millions of atoms. Their positions can be represented using one of two methods: Cartesian or internal coordinates (bond lengths, angles, etc.). In molecular dynamics and modeling of proteins in different conformational states, it is often necessary to transform one coordinate system to another. In addition, since proteins change over time, any computation must be done over successive time frames, increasing the computational load. To lessen this computational load we have applied different parallel techniques to the protein conversion problem. Converting Cartesian coordinates to internal ones is easily parallelized and we have previously reported a GPU implementation. On the other hand, the reverse direction, which is used for such tasks as protein engineering and fitting structures to experimental data, is much more challenging to parallelize [1]. The reverse direction is used in molecular simulations for such tasks as protein engineering and fitting atomic structures to experimental data. This computation has inherent dependency in the data structures because bond lengths and angles are relative to neighboring atoms. Existing implementations walk over a protein structure in a sequential fashion. This paper presents the first parallel implementation of internal to Cartesian coordinates, in which substructures of the protein backbone are converted into their own local Cartesian coordinate spaces and then combined using a reduction technique to find global Cartesian coordinates. We observe order of magnitude speedup using parallel processing on a GPU. Bisection and Twisted SVD on GPU Lu He, Yan Luo, University of Massachusetts Lowell, Rui Liu, Wake Forest University, Hengyong Yu, Yu Cao, University of Massachusetts Lowell, Xuzhou Chen, Fitchburg State University, Seung Woo Son, University of Massachusetts Lowell Singular value decomposition (SVD) is one of the most important factorizations in matrix computation. However, computing SVD is still time-consuming, especially when the dimension of matrices exceeds tens of thousands. In this paper, we present a high performance approach called “Bisection and Twisted” (BT) for solving bidiagonal SVD. As modern general purpose GPUs have shown their extreme computational advantages in parallel computing, we implement the BT algorithm on single and multiple GPUs. With our carefully designed GPU kernels, the BT algorithm is about 10 times faster than MKL divide-and-conquer routine DBDSDC on an 8-core 2.53GHz CPU, and 36 times faster than CULA QR routine DBDSQR on the same GPUs. Additionally, the BT algorithm is able to compute SVD for matrices of size 1 million by 1 million with only two GPUs. To the best of our knowledge, no implementation has achieved such a scale. GPU accelerated geometric multigrid method: comparison with preconditioned conjugate gradient Iulian Stroia, Lucian Itu, Cosmin Ni??, Laszlo Laz?r, Constantin Suciu Siemens We describe the first parallel algorithm to obtain Cartesian coordinates of a molecular structure from internal coordinates such as bond lengths and angles. This algorithm is important for problems such as protein engineering and fitting protein structures to experimental data, because different stages of molecular simulation can require one structure representation or the other, and many such conversions are required during the course of a calculation such as molecular dynamics. Proteins contain thousands to millions of atoms. Their positions can be represented using one of two methods: Cartesian or internal coordinates (bond lengths, angles, etc.). In molecular dynamics and modeling of proteins in different conformational states, it is often necessary to transform one coordinate system to another. In addition, since proteins change over time, any computation must be done over successive time frames, increasing the computational load. To lessen this computational load we have applied different parallel techniques to the protein conversion problem. Converting Cartesian coordinates to internal ones is easily parallelized and we have previously reported a GPU implementation. On the other hand, the reverse direction, which is used for such tasks as protein engineering and fitting structures to experimental data, is much more challenging to parallelize [1]. The reverse direction is used in molecular simulations for such tasks as protein engineering and fitting atomic structures to experimental data. This computation has inherent dependency in the data structures because bond lengths and angles are relative to neighboring atoms. Existing implementations walk over a protein structure in a sequential fashion. This paper presents the first parallel implementation of internal to Cartesian coordinates, in which substructures of the protein backbone are converted into their own local Cartesian coordinate spaces and then combined using a reduction technique to find global Cartesian coordinates. We observe order of magnitude speedup using parallel processing on a GPU.  Full-Chain Benchmarking for Open Architecture Airborne ISR Systems: A Case Study for GMTI Radar Applications Matthias Beebe, Matthew Alexander, Paul Foley, Denise Galejs, Stephen Mooney, Iulian Popescu, Kevin Rottman, Meryl Stav  MIT Lincoln Laboratory As the airborne ISR application space evolves, the quantities of data acquired by remote sensing systems such as radar, electro-optical, and infrared systems are growing larger, and advanced algorithms are imposing more challenging computational requirements for real-time processing.     While the difficulties in processing sensor data in real-time is the topic of extensive research, the rapidly shifting technology and application complexity has led to pronounced system lifecycle challenges, including the constant threat of technology obsolescence and unsustainable maintenance costs.   One way for government programs to address this reality economically is to shift the ISR system acquisition strategy to facilitate timely, cost-effective insertion and upgrading of technologies through the utilization of an open architecture (OA) approach to system design standards for application ready processors (ARPs).  OA design leverages industry-standard hardware and middleware, thus engaging a broader development community and lowering barriers for third-party application development.  For this approach to succeed without sacrificing functional capabilities and real-time performance, effective benchmarks are necessary to ensure that an ARP system can meet mission constraints and performance requirements of real-world applications.    This work investigates the measurement of real-time performance of commodity high-speed processing solutions and middleware for airborne systems using OA composite benchmarks, i.e. benchmarks that characterize computational performance of the system as a whole, while also validating OA principles.    For ISR systems, processing performance can be counterbalanced by size, weight and power (SWaP) constraints that often necessitate application-specific configurations (e.g. mapping and scheduling) for system-level optimization.   Using ground moving target indicator (GMTI) radar as an example, we demonstrate the use of an open architecture benchmarking framework using industry-standard middleware to indicate the suitability of candidate systems for ISR applications under constrained SWaP.
Thursday, September 17
2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA
GPU 2 3:00-4:40 in Eden Vale A3 Chair: Hahn Kim / Lockheed Martin Algorithm Flattening: Complete branch elimination for GPU requires a paradigm shift from CPU thinking Lucas Vespa, Alexander Bauman, Jenny Wells, University of Illinois Springfield Graphics processing units (GPUs) have inadvertently become supercomputers in and of themselves, to the benefit of applications outside of graphics. Acceleration of multiple orders of magnitude has been achieved in scientific computing, co-processing and the like. However, the Single Instruction Multiple Data (SIMD) design of GPUs is extremely sensitive to thread divergence. So much so that performance improvement from GPUs is all but eviscerated by thread divergence for many applications. This problem has driven general purpose GPU computing in the direction of finding ``appropriate'' applications to accelerate, rather than accelerating applications with a need for performance improvements. Thread divergence is generally caused by branches. Previous research has addressed the issue of reducing branches, but none of this work aims to entirely eliminate branches, because the methods required for complete branch elimination are a drastic de- optimization for CPU. We present a de- optimization for CPU which completely removes all branches, and results in a significant optimization for GPU accelerated applications which eliminates thread divergence, substantially decreases execution time, allows for the implementation of algorithms on GPU which previously do not fully utilize GPU capability and generates deterministic performance. Our optimization removes branches, replacing them with a reduced equation, and achieves a substantial speedup of already GPU accelerated algorithms and applications.  GPU Implementation of Reverse Coordinate Conversion for Proteins Mahsa Bayati, Jaydeep P. Bardhan, Miriam Leeser, Northeastern Univ. We describe the first parallel algorithm to obtain Cartesian coordinates of a molecular structure from internal coordinates such as bond lengths and angles. This algorithm is important for problems such as protein engineering and fitting protein structures to experimental data, because different stages of molecular simulation can require one structure representation or the other, and many such conversions are required during the course of a calculation such as molecular dynamics. Proteins contain thousands to millions of atoms. Their positions can be represented using one of two methods: Cartesian or internal coordinates (bond lengths, angles, etc.). In molecular dynamics and modeling of proteins in different conformational states, it is often necessary to transform one coordinate system to another. In addition, since proteins change over time, any computation must be done over successive time frames, increasing the computational load. To lessen this computational load we have applied different parallel techniques to the protein conversion problem. Converting Cartesian coordinates to internal ones is easily parallelized and we have previously reported a GPU implementation. On the other hand, the reverse direction, which is used for such tasks as protein engineering and fitting structures to experimental data, is much more challenging to parallelize [1]. The reverse direction is used in molecular simulations for such tasks as protein engineering and fitting atomic structures to experimental data. This computation has inherent dependency in the data structures because bond lengths and angles are relative to neighboring atoms. Existing implementations walk over a protein structure in a sequential fashion. This paper presents the first parallel implementation of internal to Cartesian coordinates, in which substructures of the protein backbone are converted into their own local Cartesian coordinate spaces and then combined using a reduction technique to find global Cartesian coordinates. We observe order of magnitude speedup using parallel processing on a GPU. Bisection and Twisted SVD on GPU Lu He, Yan Luo, University of Massachusetts Lowell, Rui Liu, Wake Forest University, Hengyong Yu, Yu Cao, University of Massachusetts Lowell, Xuzhou Chen, Fitchburg State University, Seung Woo Son, University of Massachusetts Lowell Singular value decomposition (SVD) is one of the most important factorizations in matrix computation. However, computing SVD is still time-consuming, especially when the dimension of matrices exceeds tens of thousands. In this paper, we present a high performance approach called “Bisection and Twisted” (BT) for solving bidiagonal SVD. As modern general purpose GPUs have shown their extreme computational advantages in parallel computing, we implement the BT algorithm on single and multiple GPUs. With our carefully designed GPU kernels, the BT algorithm is about 10 times faster than MKL divide-and-conquer routine DBDSDC on an 8-core 2.53GHz CPU, and 36 times faster than CULA QR routine DBDSQR on the same GPUs. Additionally, the BT algorithm is able to compute SVD for matrices of size 1 million by 1 million with only two GPUs. To the best of our knowledge, no implementation has achieved such a scale. GPU accelerated geometric multigrid method: comparison with preconditioned conjugate gradient Iulian Stroia, Lucian Itu, Cosmin Ni??, Laszlo Laz?r, Constantin Suciu Siemens We describe the first parallel algorithm to obtain Cartesian coordinates of a molecular structure from internal coordinates such as bond lengths and angles. This algorithm is important for problems such as protein engineering and fitting protein structures to experimental data, because different stages of molecular simulation can require one structure representation or the other, and many such conversions are required during the course of a calculation such as molecular dynamics. Proteins contain thousands to millions of atoms. Their positions can be represented using one of two methods: Cartesian or internal coordinates (bond lengths, angles, etc.). In molecular dynamics and modeling of proteins in different conformational states, it is often necessary to transform one coordinate system to another. In addition, since proteins change over time, any computation must be done over successive time frames, increasing the computational load. To lessen this computational load we have applied different parallel techniques to the protein conversion problem. Converting Cartesian coordinates to internal ones is easily parallelized and we have previously reported a GPU implementation. On the other hand, the reverse direction, which is used for such tasks as protein engineering and fitting structures to experimental data, is much more challenging to parallelize [1]. The reverse direction is used in molecular simulations for such tasks as protein engineering and fitting atomic structures to experimental data. This computation has inherent dependency in the data structures because bond lengths and angles are relative to neighboring atoms. Existing implementations walk over a protein structure in a sequential fashion. This paper presents the first parallel implementation of internal to Cartesian coordinates, in which substructures of the protein backbone are converted into their own local Cartesian coordinate spaces and then combined using a reduction technique to find global Cartesian coordinates. We observe order of magnitude speedup using parallel processing on a GPU.  Full-Chain Benchmarking for Open Architecture Airborne ISR Systems: A Case Study for GMTI Radar Applications Matthias Beebe, Matthew Alexander, Paul Foley, Denise Galejs, Stephen Mooney, Iulian Popescu, Kevin Rottman, Meryl Stav  MIT Lincoln Laboratory As the airborne ISR application space evolves, the quantities of data acquired by remote sensing systems such as radar, electro-optical, and infrared systems are growing larger, and advanced algorithms are imposing more challenging computational requirements for real-time processing.     While the difficulties in processing sensor data in real-time is the topic of extensive research, the rapidly shifting technology and application complexity has led to pronounced system lifecycle challenges, including the constant threat of technology obsolescence and unsustainable maintenance costs.   One way for government programs to address this reality economically is to shift the ISR system acquisition strategy to facilitate timely, cost-effective insertion and upgrading of technologies through the utilization of an open architecture (OA) approach to system design standards for application ready processors (ARPs).  OA design leverages industry- standard hardware and middleware, thus engaging a broader development community and lowering barriers for third-party application development.  For this approach to succeed without sacrificing functional capabilities and real-time performance, effective benchmarks are necessary to ensure that an ARP system can meet mission constraints and performance requirements of real-world applications.    This work investigates the measurement of real-time performance of commodity high-speed processing solutions and middleware for airborne systems using OA composite benchmarks, i.e. benchmarks that characterize computational performance of the system as a whole, while also validating OA principles.    For ISR systems, processing performance can be counterbalanced by size, weight and power (SWaP) constraints that often necessitate application-specific configurations (e.g. mapping and scheduling) for system-level optimization.   Using ground moving target indicator (GMTI) radar as an example, we demonstrate the use of an open architecture benchmarking framework using industry-standard middleware to indicate the suitability of candidate systems for ISR applications under constrained SWaP.
Thursday, September 17
Home