2018 IEEE High Performane Extreme Computing (HPEC 2016)

Designed and maintained by Ballos Associates Web Design

2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA

Evaluating an OpenCL FPGA Platform for HPC: a Case Study with the HACCmk Kernel Zheming Jin (ANL)*; Hal Finkel (ANL) Field-programmable gate array (FPGA) is a promising choice as a heterogeneous computing component for energy-aware applications in high-performance computing. Emerging high-level synthesis tools such as Intel OpenCL SDK offer a streamlined design flow to facilitate the use of FPGAs for scientists and researchers. Focused on the HACCmk kernel routine as a case study, we explore the kernel optimization space and their performance implications. We describe the resource usage, performance, and performance per watt of the kernel implementations in OpenCL. Using directives for accelerator programming, the performance per watt on an Intel Arria-10 based FPGA platform can achieve 2.5X improvement over that on an Intel Xeon 16-core CPU, and 2.1X improvement over that on an Nvidia K80 GPU, while trading off 50% of performance. Exploring Parallel Bitonic Sort on a Migratory Thread Architecture Kaushik Velusamy (University of Maryland, Baltimore County); Thomas Rolinger (University of Maryland, College Park)*; Janice McMahon (Emu Technologies); Tyler Simon (University of Maryland, Baltimore County) Large scale, data-intensive applications pose challenges to systems with a traditional memory hierarchy due to their unstructured data sources and irregular memory access patterns. In response, systems that employ migratory threads have been proposed to mitigate memory access bottlenecks as well as reduce energy consumption. One such system is the Emu Chick, which migrates a small program context to the data being referenced in a memory access. Sorting an unordered list of elements is a critical kernel for countless applications, such as graph processing and tensor decomposition. As such applications can be considered highly suitable for a migratory thread architecture, it is imperative to understand the performance of sorting algorithms on these systems. In this paper, we implement parallel bitonic sort and target the Emu Chick system. We investigate the performance of an explicit comparison-based approach as well as a sorting network implementation. Furthermore, we explore two different data layouts for the parallel bitonic sorting network, namely cyclic and blocked. From the results of our performance study, we find that while thread migrations can dictate the overall performance of an application, the cost of thread creation and management can out-grow the cost of thread migration Unlocking Performance-Programmability by Penetrating the Intel FPGA OpenCL Toolflow Ahmed Sanaullah (Boston University)*; Martin Herbordt (Boston University) Improved support for OpenCL has been an important step towards the mainstream adoption of FPGAs as compute resources. Current research has shown, however, that programmability derived from use of OpenCL typically comes at a significant expense of performance, with the latter falling below that of hand-coded HDL, GPU, and even CPU designs. This can primarily be attributed to 1) constrained deployment opportunities, 2) high testing time-frames, and 3) limitations of the Board Support Package (BSP). We address these challenges by penetrating the toolflow and utilizing OpenCL-generated HDL (OpenCL- HDL), which is created as an initial step during the full compilation. OpenCL-HDL can be used as an intermediate stage in the design process to get better resource/latency estimates and perform RTL simulations. It can also be carved out and used as a building block for an existing HDL system. In this work, we present the process of generating, isolating, and re-interfacing OpenCL-HDL. We first propose a kernel template which reliably exploits parallelism opportunities and ensures all compute pipelines are implemented as a single HDL module. We then outline the process of identifying this module from the thousands of lines of compiler generated code. Finally, we categorize the different types of interfaces and present methods for connecting/bypassing them in order to support integration into an existing HDL shell. We evaluate our approach using a number of benchmarks from the Rodinia suite and Molecular Dynamics simulations. Our OpenCL-HDL implementations of all benchmarks show an average of 37x, 4.8x, and 3.5x speedup over existing FPGA/OpenCL, GPU, and FPGA/Verilog designs, respectively. We demonstrate that OpenCL-HDL is able to deliver hand-coded HDL-like performance with significantly less development effort and with competitive resource overhead. Application Aware Tuning of Reconfigurable Multi-Layer Perceptron Architectures Ahmed Sanaullah (Boston University)*; Chen Yang (Boston University); Yuri Alexeev (Argonne National Laboratory); Kazutomo Yoshii (Argonne National Laboratory); Martin Herbordt (Boston University) Production FPGA implementations of Multi-Layer Perceptron (MLP) inference typically address the growing performance demands by (i) storing neuron weights on-chip to address the memory bound, e.g., Microsoft Brainwave, and (ii) by generating the largest possible arrays of multipliers and accumulators to address the compute bound. This approach of maximizing device utilization, irrespective of application model, can actually result in higher latencies due to the tight coupling of different function modules. Sub-optimal/generic parameter sizing of a given component, to reduce its latency, can force an increase in complexity and latency of multiple other modules in the design. In real-time applications, this can result in the overall computation failing to make the deadline. In our work we begin by creating a testbed for low-latency MLP inference, which we then use to explore the application-aware optimization space for compute-bound MLP inference engines. The optimization process begins by identifying modules in the critical path and their connectivity. We then use this information to determine key parameters and their ideal values. Also, we automate hardware generation using OpenCL to ensure standard optimizations are applied. We find that correct parameter sizing can reduce latency by 20% on average. For the MNIST, Poker, and ECP- Candle benchmarks, we implement inference models using the Arria10X115 FPGA and achieve an average speedup of 1.47x over the NVIDIA Tesla P100 GPU.

Wednesday September 26. 2018

ASIC & FPGA 1 1:00-2:40 in Eden Vale A1/A2 Chair: David Cousins / BBN

Designed and maintained by Ballos Associates Web Design

IEEE Nondiscrimination Policy

Evaluating an OpenCL FPGA Platform for HPC: a Case Study with the HACCmk Kernel Zheming Jin (ANL)*; Hal Finkel (ANL) Field-programmable gate array (FPGA) is a promising choice as a heterogeneous computing component for energy-aware applications in high-performance computing. Emerging high-level synthesis tools such as Intel OpenCL SDK offer a streamlined design flow to facilitate the use of FPGAs for scientists and researchers. Focused on the HACCmk kernel routine as a case study, we explore the kernel optimization space and their performance implications. We describe the resource usage, performance, and performance per watt of the kernel implementations in OpenCL. Using directives for accelerator programming, the performance per watt on an Intel Arria-10 based FPGA platform can achieve 2.5X improvement over that on an Intel Xeon 16-core CPU, and 2.1X improvement over that on an Nvidia K80 GPU, while trading off 50% of performance. Exploring Parallel Bitonic Sort on a Migratory Thread Architecture Kaushik Velusamy (University of Maryland, Baltimore County); Thomas Rolinger (University of Maryland, College Park)*; Janice McMahon (Emu Technologies); Tyler Simon (University of Maryland, Baltimore County) Large scale, data-intensive applications pose challenges to systems with a traditional memory hierarchy due to their unstructured data sources and irregular memory access patterns. In response, systems that employ migratory threads have been proposed to mitigate memory access bottlenecks as well as reduce energy consumption. One such system is the Emu Chick, which migrates a small program context to the data being referenced in a memory access. Sorting an unordered list of elements is a critical kernel for countless applications, such as graph processing and tensor decomposition. As such applications can be considered highly suitable for a migratory thread architecture, it is imperative to understand the performance of sorting algorithms on these systems. In this paper, we implement parallel bitonic sort and target the Emu Chick system. We investigate the performance of an explicit comparison-based approach as well as a sorting network implementation. Furthermore, we explore two different data layouts for the parallel bitonic sorting network, namely cyclic and blocked. From the results of our performance study, we find that while thread migrations can dictate the overall performance of an application, the cost of thread creation and management can out- grow the cost of thread migration Unlocking Performance-Programmability by Penetrating the Intel FPGA OpenCL Toolflow Ahmed Sanaullah (Boston University)*; Martin Herbordt (Boston University) Improved support for OpenCL has been an important step towards the mainstream adoption of FPGAs as compute resources. Current research has shown, however, that programmability derived from use of OpenCL typically comes at a significant expense of performance, with the latter falling below that of hand-coded HDL, GPU, and even CPU designs. This can primarily be attributed to 1) constrained deployment opportunities, 2) high testing time-frames, and 3) limitations of the Board Support Package (BSP). We address these challenges by penetrating the toolflow and utilizing OpenCL- generated HDL (OpenCL-HDL), which is created as an initial step during the full compilation. OpenCL-HDL can be used as an intermediate stage in the design process to get better resource/latency estimates and perform RTL simulations. It can also be carved out and used as a building block for an existing HDL system. In this work, we present the process of generating, isolating, and re-interfacing OpenCL-HDL. We first propose a kernel template which reliably exploits parallelism opportunities and ensures all compute pipelines are implemented as a single HDL module. We then outline the process of identifying this module from the thousands of lines of compiler generated code. Finally, we categorize the different types of interfaces and present methods for connecting/bypassing them in order to support integration into an existing HDL shell. We evaluate our approach using a number of benchmarks from the Rodinia suite and Molecular Dynamics simulations. Our OpenCL-HDL implementations of all benchmarks show an average of 37x, 4.8x, and 3.5x speedup over existing FPGA/OpenCL, GPU, and FPGA/Verilog designs, respectively. We demonstrate that OpenCL- HDL is able to deliver hand-coded HDL-like performance with significantly less development effort and with competitive resource overhead. Application Aware Tuning of Reconfigurable Multi-Layer Perceptron Architectures Ahmed Sanaullah (Boston University)*; Chen Yang (Boston University); Yuri Alexeev (Argonne National Laboratory); Kazutomo Yoshii (Argonne National Laboratory); Martin Herbordt (Boston University) Production FPGA implementations of Multi-Layer Perceptron (MLP) inference typically address the growing performance demands by (i) storing neuron weights on-chip to address the memory bound, e.g., Microsoft Brainwave, and (ii) by generating the largest possible arrays of multipliers and accumulators to address the compute bound. This approach of maximizing device utilization, irrespective of application model, can actually result in higher latencies due to the tight coupling of different function modules. Sub-optimal/generic parameter sizing of a given component, to reduce its latency, can force an increase in complexity and latency of multiple other modules in the design. In real-time applications, this can result in the overall computation failing to make the deadline. In our work we begin by creating a testbed for low-latency MLP inference, which we then use to explore the application-aware optimization space for compute-bound MLP inference engines. The optimization process begins by identifying modules in the critical path and their connectivity. We then use this information to determine key parameters and their ideal values. Also, we automate hardware generation using OpenCL to ensure standard optimizations are applied. We find that correct parameter sizing can reduce latency by 20% on average. For the MNIST, Poker, and ECP- Candle benchmarks, we implement inference models using the Arria10X115 FPGA and achieve an average speedup of 1.47x over the NVIDIA Tesla P100 GPU.

Wednesday September 26. 2018

ASIC & FPGA 1 1:00-2:40 in Eden Vale A1/A2 Chair: David Cousins / BBN

HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA