Home Welcome Message Committee Invited Speakers Program Demos
2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA
Manycore Computing 1 1:00-2:40 in Eden Vale A1 - A2 Chair: Patrick Dreher / MIT [Best Paper Finalist] Boosting Irregular Array Reductions through In-lined Block-ordering on Fast Processors Jan Ciesko, Sergi Mateo, Xavier Teruel, Vicenc Beltran, Xavier Martorell, Jesus Labarta, Barcelona Supercomputing Center Array-type reductions represent a frequently occurring algorithmic pattern in many scientific applications. A special case occurs if array elements are accessed in an irregular, often random manner, making their concurrent and scalable execution difficult. In this work we present a new approach that consists of language- and runtime support and targets popular parallel programming models such as OpenMP. Its runtime support implements Privatization with In-lined, Block-ordered Reductions (PIBOR), a new approach that trades processor cycles to increase locality and bandwidth efficiency for such algorithms. A reference implementation in OmpSs, a task-parallel programming model, shows promising results on current multi-core systems. [Best Paper Finalist] MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing Azzam Haidar, Stanimire Tomov, Piotr Luszczek, Jack Dongarra, University of Tennessee Knoxville Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme  to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries for embedded systems challenging. In this paper,  we present the design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra. We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of  performance and energy efficiency, on the Jetson TK1 development kit. We developed performance optimizations for both small and large problems. In contrast to the  corresponding LAPACK algorithms, the new designs target the use of many-cores, readily available now even in mobile devices like the Jetson TK1, e.g., featuring  $192$ CUDA cores. The implementations presented will form the core of a MAGMA Embedded  library, to be released as part of the MAGMA libraries. [Best Paper Finalist]  Optimizing Space Time Adaptive Processing Through Accelerating Memory-bounded Operations Tze Meng Low, Qi Guo, Franz Franchetti, Carnegie Mellon University Space-Time Adaptive Processing (STAP) is a tech- nique for processing signals from multiple antenna elements over multiple time periods for target detection. As STAP algorithms are typical run on airborne platforms, they need to be both high performance and energy-efficient. Due to the high rate of processing required, many existing algorithms focus on reducing the dimensionality of the data, or exploiting structure in the underlying mathematical formulation in order to reduce the total number of floating-point operations (FLOPs), and conse- quently, the time for computation. While such algorithms target the FLOPs- intensive operations within the STAP algorithm, a significant portion of the compute time for most STAP algorithms is actually spent in low-FLOPs, memory-bounded operations. In this paper, we address the computation of these memory-bounded operations within the STAP algorithm using a 3D stacked Logic- in-Memory system. The imminent arrival of 3D stacked memory makes avail high memory bandwidth, which opens up a new and othorgonal dimension for optimizing STAP algorithms. We show that more than 11x improvement in time, and 77x improvement in energy efficiency can be expected when a 3D stack is used together with memory-side accelerators to target the memory- bounded operations within STAP.   [Best Student Paper Finalist] A Near-Real-Time, Parallel and Distributed Adaptive Object Detection and Re-training Framework based on AdaBoost Algorithm Munther Abualkibash, Ausif Mahmood, Saeid Moslehpour, University of Bridgeport Object detection (e.g., face detection) using supervised learning often requires extensive training, resulting in long execution times. If the system  requires retraining to accommodate  a  missed  detection,  waiting  several  hours  or even days in some cases before the system is ready, may not be acceptable in practical  implementations. This paper presents a generalized object detection framework such that the system can efficiently adapt to misclassified data and be retrained within a few minutes. The methodology  developed here is based on the popular AdaBoost algorithm for object detection. To reduce the learning time in object detection, we develop a highly efficient, parallel, and  distributed AdaBoost algorithm that is able to achieve a training execution time of only 1.4 seconds per feature on 25 workstations. Further, we incorporate this parallel object detection  algorithm into an adaptive framework  such  that  a  much  smaller,  optimized  training subset is used to yield high detection rates while further reducing the retraining execution  time. We demonstrate the usefulness of our adaptive framework on face and car detection. Implementing Image Processing Algorithms for the Epiphany Many-Core Coprocessor with Threaded MPI James Ross, U.S. Army Research Laboratory, David Richie, Brown Deer Technology, Song Park, U.S. Army Research Laboratory, Dale Shires, U.S. Army Research Laboratory The Adapteva Epiphany MIMD architecture is a scalable 2D array of RISC cores with minimal un-core functionality connected with a fast 2D mesh Network-on-Chip (NoC). Each mesh nodes contains a RISC CPU core, 32 KB of shared local memory, a mesh network interface, and a dual-channel DMA engine.  The 16-core Epiphany III coprocessor has been integrated into the Parallella minicomputer platform where the RISC array is supported by a dual-core ARM CPU and asymmetric shared-memory access to off-chip global memory. Peak single-precision performance for the Epiphany III is 19.2 GFLOPS with an energy efficiency of 32.3 GFLOPS per watt. The raw performance of the Epiphany III is relatively low compared to modern high- performance CPUs and GPUs; however, the Epiphany architecture provides greater energy efficiency and is designed to be highly scalable. The published road map specifies a scale-out of the existing architecture to exceed 1,000 cores in the near future.  Within this context it is a competitive processor technology comparable to other emerging architectures. Processors based on this architecture exhibit good energy efficiency and scalability via the 2D mesh network, but require a suitable programming model to fully exploit the architecture.  Key to performance with the Epiphany architecture is data re-use, requiring precise control of inter-core communication since the architecture does not provide a hardware cache at any level. The cores can access off-chip mapped memory with a significant performance penalty in both latency and bandwidth relative to accessing neighboring core memory.   In previous work we have demonstrated an efficient parallel programming model for the Epiphany architecture based on the Message Passing Interface (MPI) standard.  Using MPI exploits the similarities between the Epiphany architecture and a conventional parallel distributed cluster. Our approach enables MPI code to execute on the RISC array processor with little modification and achieve high performance.  For the Epiphany architecture, The MPI programming model is a better choice for Epiphany than APIs designed for SMP processors, such as OpenMP and OpenCL, since the latter APIs lack good semantics for controlling inter-core data movement which is critical to achieving high performance for anything but trivially parallel applications on this processor.  Threaded MPI was developed to provide an extremely lightweight implementation of MPI appropriate for threads executing within the restricted context of the Epiphany RISC cores.  Threaded MPI is distinguished from conventional MPI implementations by two critical differences, driven by the fact the device must be accessed as a coprocessor and each core executes threads within a highly constrained set of resources. As a result, the cores are not capable of supporting a full process image or program in the conventional sense, and therefore the conventional MPI model of associating MPI processes to concurrently executing programs is not possible. Instead, coprocessor offload semantics must be used to launch concurrent threads that will then employ conventional MPI semantics for inter-thread communication.  Threaded MPI has exhibited the highest performance reported to date for non-trivially parallel algorithms using a standard programming model for the Epiphany architecture.  We apply the threaded MPI programming for image processing kernels including a 2D Fast Fourier Transform (FFT) with high-pass filter for edge detection, local operators for Gaussian blur and a Sobel filter, Canny edge detection, and Harris corner detection operations.  Conventional MPI parallelization is employed in the implementations, demonstrating the applicability of this parallel programming model for the Epiphany architecture.  Benchmark performance is analyzed for understanding the relative performance of computation and communication.  The impact of the results on performance projections are discussed for RISC arrays on the current Epiphany roadmap scaled to thousands of cores.
Wednesday September 16
2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA
Manycore Computing 1 1:00-2:40 in Eden Vale A1 - A2 Chair: Patrick Dreher / MIT [Best Paper Finalist] Boosting Irregular Array Reductions through In- lined Block-ordering on Fast Processors Jan Ciesko, Sergi Mateo, Xavier Teruel, Vicenc Beltran, Xavier Martorell, Jesus Labarta, Barcelona Supercomputing Center Array-type reductions represent a frequently occurring algorithmic pattern in many scientific applications. A special case occurs if array elements are accessed in an irregular, often random manner, making their concurrent and scalable execution difficult. In this work we present a new approach that consists of language- and runtime support and targets popular parallel programming models such as OpenMP. Its runtime support implements Privatization with In-lined, Block-ordered Reductions (PIBOR), a new approach that trades processor cycles to increase locality and bandwidth efficiency for such algorithms. A reference implementation in OmpSs, a task- parallel programming model, shows promising results on current multi-core systems. [Best Paper Finalist] MAGMA Embedded: Towards a Dense Linear Algebra Library for Energy Efficient Extreme Computing Azzam Haidar, Stanimire Tomov, Piotr Luszczek, Jack Dongarra, University of Tennessee Knoxville Embedded computing, not only in large systems like drones and hybrid vehicles, but also in small portable devices like smart phones and watches, gets more extreme  to meet ever increasing demands for extended and improved functionalities. This, combined with the typical constrains for low power consumption and small sizes, makes the design of numerical libraries for embedded systems challenging. In this paper,  we present the design and implementation of embedded system aware algorithms, that target these challenges in the area of dense linear algebra. We consider the fundamental problems of solving linear systems of equations and least squares problems, using the LU, QR, and Cholesky factorizations, and illustrate our results, both in terms of  performance and energy efficiency, on the Jetson TK1 development kit. We developed performance optimizations for both small and large problems. In contrast to the  corresponding LAPACK algorithms, the new designs target the use of many-cores, readily available now even in mobile devices like the Jetson TK1, e.g., featuring  $192$ CUDA cores. The implementations presented will form the core of a MAGMA Embedded  library, to be released as part of the MAGMA libraries. [Best Paper Finalist]  Optimizing Space Time Adaptive Processing Through Accelerating Memory-bounded Operations Tze Meng Low, Qi Guo, Franz Franchetti, Carnegie Mellon University Space-Time Adaptive Processing (STAP) is a tech- nique for processing signals from multiple antenna elements over multiple time periods for target detection. As STAP algorithms are typical run on airborne platforms, they need to be both high performance and energy-efficient. Due to the high rate of processing required, many existing algorithms focus on reducing the dimensionality of the data, or exploiting structure in the underlying mathematical formulation in order to reduce the total number of floating-point operations (FLOPs), and conse- quently, the time for computation. While such algorithms target the FLOPs-intensive operations within the STAP algorithm, a significant portion of the compute time for most STAP algorithms is actually spent in low-FLOPs, memory-bounded operations. In this paper, we address the computation of these memory-bounded operations within the STAP algorithm using a 3D stacked Logic- in-Memory system. The imminent arrival of 3D stacked memory makes avail high memory bandwidth, which opens up a new and othorgonal dimension for optimizing STAP algorithms. We show that more than 11x improvement in time, and 77x improvement in energy efficiency can be expected when a 3D stack is used together with memory-side accelerators to target the memory- bounded operations within STAP.   [Best Student Paper Finalist] A Near-Real-Time, Parallel and Distributed Adaptive Object Detection and Re-training Framework based on AdaBoost Algorithm Munther Abualkibash, Ausif Mahmood, Saeid Moslehpour, University of Bridgeport Object detection (e.g., face detection) using supervised learning often requires extensive training, resulting in long execution times. If the system  requires retraining to accommodate  a  missed  detection,  waiting  several  hours  or even days in some cases before the system is ready, may not be acceptable in practical  implementations. This paper presents a generalized object detection framework such that the system can efficiently adapt to misclassified data and be retrained within a few minutes. The methodology  developed here is based on the popular AdaBoost algorithm for object detection. To reduce the learning time in object detection, we develop a highly efficient, parallel, and  distributed AdaBoost algorithm that is able to achieve a training execution time of only 1.4 seconds per feature on 25 workstations. Further, we incorporate this parallel object detection  algorithm into an adaptive framework  such  that  a  much  smaller,  optimized  training subset is used to yield high detection rates while further reducing the retraining execution  time. We demonstrate the usefulness of our adaptive framework on face and car detection. Implementing Image Processing Algorithms for the Epiphany Many-Core Coprocessor with Threaded MPI James Ross, U.S. Army Research Laboratory, David Richie, Brown Deer Technology, Song Park, U.S. Army Research Laboratory, Dale Shires, U.S. Army Research Laboratory The Adapteva Epiphany MIMD architecture is a scalable 2D array of RISC cores with minimal un- core functionality connected with a fast 2D mesh Network-on-Chip (NoC). Each mesh nodes contains a RISC CPU core, 32 KB of shared local memory, a mesh network interface, and a dual- channel DMA engine.  The 16-core Epiphany III coprocessor has been integrated into the Parallella minicomputer platform where the RISC array is supported by a dual-core ARM CPU and asymmetric shared-memory access to off-chip global memory. Peak single-precision performance for the Epiphany III is 19.2 GFLOPS with an energy efficiency of 32.3 GFLOPS per watt. The raw performance of the Epiphany III is relatively low compared to modern high- performance CPUs and GPUs; however, the Epiphany architecture provides greater energy efficiency and is designed to be highly scalable. The published road map specifies a scale-out of the existing architecture to exceed 1,000 cores in the near future.  Within this context it is a competitive processor technology comparable to other emerging architectures. Processors based on this architecture exhibit good energy efficiency and scalability via the 2D mesh network, but require a suitable programming model to fully exploit the architecture.  Key to performance with the Epiphany architecture is data re-use, requiring precise control of inter-core communication since the architecture does not provide a hardware cache at any level. The cores can access off-chip mapped memory with a significant performance penalty in both latency and bandwidth relative to accessing neighboring core memory.   In previous work we have demonstrated an efficient parallel programming model for the Epiphany architecture based on the Message Passing Interface (MPI) standard.  Using MPI exploits the similarities between the Epiphany architecture and a conventional parallel distributed cluster. Our approach enables MPI code to execute on the RISC array processor with little modification and achieve high performance.  For the Epiphany architecture, The MPI programming model is a better choice for Epiphany than APIs designed for SMP processors, such as OpenMP and OpenCL, since the latter APIs lack good semantics for controlling inter-core data movement which is critical to achieving high performance for anything but trivially parallel applications on this processor.  Threaded MPI was developed to provide an extremely lightweight implementation of MPI appropriate for threads executing within the restricted context of the Epiphany RISC cores.  Threaded MPI is distinguished from conventional MPI implementations by two critical differences, driven by the fact the device must be accessed as a coprocessor and each core executes threads within a highly constrained set of resources. As a result, the cores are not capable of supporting a full process image or program in the conventional sense, and therefore the conventional MPI model of associating MPI processes to concurrently executing programs is not possible. Instead, coprocessor offload semantics must be used to launch concurrent threads that will then employ conventional MPI semantics for inter-thread communication.  Threaded MPI has exhibited the highest performance reported to date for non-trivially parallel algorithms using a standard programming model for the Epiphany architecture.  We apply the threaded MPI programming for image processing kernels including a 2D Fast Fourier Transform (FFT) with high-pass filter for edge detection, local operators for Gaussian blur and a Sobel filter, Canny edge detection, and Harris corner detection operations.  Conventional MPI parallelization is employed in the implementations, demonstrating the applicability of this parallel programming model for the Epiphany architecture.  Benchmark performance is analyzed for understanding the relative performance of computation and communication.  The impact of the results on performance projections are discussed for RISC arrays on the current Epiphany roadmap scaled to thousands of cores.
Wednesday September 16
Home