2015 IEEE High Performane Extreme Computing (HPEC 2015)

Designed and maintained by Ballos Associates Web Design

IEEE Nondiscrimination Policy

2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA

Manycore Computing 2 3:00-4:40 in Eden Vale A1 - A2 Chair: David Cousins / BBN Heterogeneous Work-stealing across CPU and DSP Cores Vivek Kumar, Alina Sbîrlea, Zoran Budimlic, Deepak Majeti, Vivek Sarkar, Rice University Due to the increasing power constraints and higher and higher performance demands, many vendors have shifted their focus from designing high performance computer nodes using powerful multi-core general-purpose CPUs, to nodes containing a smaller number of general-purpose CPUs aided by a larger number of more power-efficient special purpose processing units, such as GPUs, FPGAs or DSPs. While offering a lower power-to-performance ratio, unfortunately, such heterogeneous systems are notoriously hard to program, forcing the users to resort to lower-level direct programming of the special purpose processors and manually managing data transfer and synchronization between the parts of the program running on general-purpose CPUs and special-purpose CPUs. In this paper, we present HC-K2H, a programming model and runtime system for the Texas Instruments Keystone II Hawking platform, consisting of 4 ARM CPUs and 8 TI DSP processors. This System-on-a-Chip (SoC) offers performance with a high Floating Point Operations per second.We present the design and implementation of a hybrid programming model and work-stealing runtime that allows for tasks to be created and executed on both the ARM and DSP and enables the seamless execution and synchronization of tasks regardless of whether they are running on the ARM or DSP. The design of our programming model and runtime is based on the Habanero C programming system. We evaluate our implementation using task-parallel benchmarks on a Hawking board, and demonstrate excellent scaling compared to sequential implementations on a single ARM processor. Achieving Low Latency, Reduced Memory Footprint and Low Power Consumption with Data Streaming Olivier Bockenbach, ContextVision, Murtaza Ali, Texas Instruments, Ian Wainwright, High Performance Consulting, Mark Nadeski, Texas Instruments In addition to its patient friendly properties, Ultrasound Imaging has become attractive because of its ability to provide images in real time. This low latency implementations allows for fast scanning and a quick time to establish a precise diagnostic using medical imaging. This study presents a framework aimed at the stream line processing of images, the ultimate goal being twofold. The first goal is to keep the latency as low as possible by processing the data as soon as there are enough samples available. The second goal is to reduce the required processing power per image. To achieve these goals, the framework allows several images to be processed simultaneously albeit in sequence. This allows taking advantage of periods where the processor is not fully loaded. This study shows how the latency is kept at the strict minimum while the required processing power is reduced when compared to a traditional image based implementation. The application runs a temporal adaptive filter on a hardware platform based on a Digital Signal Processors (DSP). Embedded Second-Order Cone Programming with Radar Applications Paul Mountcastle, Tom Henretty, Aale Naqvi, Richard Lethin, Reservoir Labs Second-order cone programming (SOCP) is required for the solution of under-determined systems of linear equations with complex coefficients, subject to the minimization of a convex objective function. This type of computational problem appears in compressed radar sensing, where the goal is to reconstruct a sparse image in a projective space whose dimension is higher than the number of complex measurements. In order to enforce sparsity in the final rectified radar image, the sum of moduli of a complex vector, called the L1-norm, must be minimized. This norm differs from what is ordinarily encountered in compressed sensing for digital photographic data and video, in that the convex optimization that must be performed involves an SOCP rather than a linear program. We illustrate the role of this type of optimization in radar signal processing by means of examples. The examples point to a significant generalization that encompasses and unifies a wide class of radar signal processing algorithms that can be implemented in software by means of SOCP solvers. Finally, we show how modern SOCP solvers are optimized for efficient solution of these problems in the context of embedded signal processing on small autonomous platforms. Efﬁcient Parallelization of Path Planning Workload on Single-chip Shared-memory Multicores Masab Ahmad, Omer Khan, University of Connecticut Path planning problems greatly arise in many applications where the objective is to find the shortest path from a given source to destination. In this paper, we explore the comparison of programming languages in the context of parallel workload analysis. We implement and characterize parallel versions of path planning algorithms, such as the Dijkstra’s Algorithm, across C/C++ and Python languages. Programming language comparisons are done for a single-socket real machine setup over shared memory to analyze fine grained scalability and efficiency. Our results show that a right parallelization strategy for path planning yields scalability for C/C++ codes executing on a commercial multicore CPU. However, several shortcomings exist in the parallel Python language that must be accounted for by HPC researchers. Monte Carlo Simulations on Intel Xeon Phi: Ofﬂ oad and Native Mode Bryar M. Shareef, Elise de Doncker, Western Michigan University In high performance computing, Monte Carlo methods are widely used to solve problems in various areas of computational physics, finance, mathematics, electrical engineering and many other fields. We present Monte Carlo methods for the Intel Xeon Phi coprocessor, to compute integrals for applications in high energy physics and in stochastic geometry. The Intel Xeon Phi is based on a Many Integrated Core (MIC) architecture to gain extreme performance. We use two modes, "offload" and "native", to implement the simulations. In offload mode, the main program resides on the host system and supporting functions are executed on the MIC; in native mode, the program is fully executed on the MIC card. We compare the parallel performance of our applications running on Intel Xeon Phi, in terms of time and speedup, with a sequential execution on the CPU. In addition, the applications are designed in both single and double precision.

Wednesday September 16

Senior Advisory Board
Technical Committee
Program Commitee
Tuesday, Sept 15
Wednesday, Sept 16
Thursday, Sept 17

Designed and maintained by Ballos Associates Web Design

IEEE Nondiscrimination Policy

2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA

Manycore Computing 2 3:00-4:40 in Eden Vale A1 - A2 Chair: David Cousins / BBN Heterogeneous Work-stealing across CPU and DSP Cores Vivek Kumar, Alina Sbîrlea, Zoran Budimlic, Deepak Majeti, Vivek Sarkar, Rice University Due to the increasing power constraints and higher and higher performance demands, many vendors have shifted their focus from designing high performance computer nodes using powerful multi- core general-purpose CPUs, to nodes containing a smaller number of general-purpose CPUs aided by a larger number of more power-efficient special purpose processing units, such as GPUs, FPGAs or DSPs. While offering a lower power-to- performance ratio, unfortunately, such heterogeneous systems are notoriously hard to program, forcing the users to resort to lower-level direct programming of the special purpose processors and manually managing data transfer and synchronization between the parts of the program running on general-purpose CPUs and special-purpose CPUs. In this paper, we present HC-K2H, a programming model and runtime system for the Texas Instruments Keystone II Hawking platform, consisting of 4 ARM CPUs and 8 TI DSP processors. This System-on-a-Chip (SoC) offers performance with a high Floating Point Operations per second.We present the design and implementation of a hybrid programming model and work-stealing runtime that allows for tasks to be created and executed on both the ARM and DSP and enables the seamless execution and synchronization of tasks regardless of whether they are running on the ARM or DSP. The design of our programming model and runtime is based on the Habanero C programming system. We evaluate our implementation using task-parallel benchmarks on a Hawking board, and demonstrate excellent scaling compared to sequential implementations on a single ARM processor. Achieving Low Latency, Reduced Memory Footprint and Low Power Consumption with Data Streaming Olivier Bockenbach, ContextVision, Murtaza Ali, Texas Instruments, Ian Wainwright, High Performance Consulting, Mark Nadeski, Texas Instruments In addition to its patient friendly properties, Ultrasound Imaging has become attractive because of its ability to provide images in real time. This low latency implementations allows for fast scanning and a quick time to establish a precise diagnostic using medical imaging. This study presents a framework aimed at the stream line processing of images, the ultimate goal being twofold. The first goal is to keep the latency as low as possible by processing the data as soon as there are enough samples available. The second goal is to reduce the required processing power per image. To achieve these goals, the framework allows several images to be processed simultaneously albeit in sequence. This allows taking advantage of periods where the processor is not fully loaded. This study shows how the latency is kept at the strict minimum while the required processing power is reduced when compared to a traditional image based implementation. The application runs a temporal adaptive filter on a hardware platform based on a Digital Signal Processors (DSP). Embedded Second-Order Cone Programming with Radar Applications Paul Mountcastle, Tom Henretty, Aale Naqvi, Richard Lethin, Reservoir Labs Second-order cone programming (SOCP) is required for the solution of under-determined systems of linear equations with complex coefficients, subject to the minimization of a convex objective function. This type of computational problem appears in compressed radar sensing, where the goal is to reconstruct a sparse image in a projective space whose dimension is higher than the number of complex measurements. In order to enforce sparsity in the final rectified radar image, the sum of moduli of a complex vector, called the L1-norm, must be minimized. This norm differs from what is ordinarily encountered in compressed sensing for digital photographic data and video, in that the convex optimization that must be performed involves an SOCP rather than a linear program. We illustrate the role of this type of optimization in radar signal processing by means of examples. The examples point to a significant generalization that encompasses and unifies a wide class of radar signal processing algorithms that can be implemented in software by means of SOCP solvers. Finally, we show how modern SOCP solvers are optimized for efficient solution of these problems in the context of embedded signal processing on small autonomous platforms. Efﬁcient Parallelization of Path Planning Workload on Single-chip Shared-memory Multicores Masab Ahmad, Omer Khan, University of Connecticut Path planning problems greatly arise in many applications where the objective is to find the shortest path from a given source to destination. In this paper, we explore the comparison of programming languages in the context of parallel workload analysis. We implement and characterize parallel versions of path planning algorithms, such as the Dijkstra’s Algorithm, across C/C++ and Python languages. Programming language comparisons are done for a single-socket real machine setup over shared memory to analyze fine grained scalability and efficiency. Our results show that a right parallelization strategy for path planning yields scalability for C/C++ codes executing on a commercial multicore CPU. However, several shortcomings exist in the parallel Python language that must be accounted for by HPC researchers. Monte Carlo Simulations on Intel Xeon Phi: Ofﬂ oad and Native Mode Bryar M. Shareef, Elise de Doncker, Western Michigan University In high performance computing, Monte Carlo methods are widely used to solve problems in various areas of computational physics, finance, mathematics, electrical engineering and many other fields. We present Monte Carlo methods for the Intel Xeon Phi coprocessor, to compute integrals for applications in high energy physics and in stochastic geometry. The Intel Xeon Phi is based on a Many Integrated Core (MIC) architecture to gain extreme performance. We use two modes, "offload" and "native", to implement the simulations. In offload mode, the main program resides on the host system and supporting functions are executed on the MIC; in native mode, the program is fully executed on the MIC card. We compare the parallel performance of our applications running on Intel Xeon Phi, in terms of time and speedup, with a sequential execution on the CPU. In addition, the applications are designed in both single and double precision.

Wednesday September 16

Welcome Message
Senior Advisory Board
Technical Committee
Program Committee
Invited Speakers
Sept 15
Sept 16
Sept 17
Demos