2018 IEEE High Performance
Extreme Computing Conference
(HPEC ‘18)
Twenty-second Annual HPEC Conference
25 - 27 September 2018
Westin Hotel, Waltham, MA USA
Evaluating an OpenCL FPGA Platform for HPC: a Case Study with the HACCmk Kernel
Zheming Jin (ANL)*; Hal Finkel (ANL)
Field-programmable gate array (FPGA) is a promising choice as a heterogeneous computing component for energy-aware
applications in high-performance computing. Emerging high-level synthesis tools such as Intel OpenCL SDK offer a
streamlined design flow to facilitate the use of FPGAs for scientists and researchers. Focused on the HACCmk kernel routine
as a case study, we explore the kernel optimization space and their performance implications. We describe the resource
usage, performance, and performance per watt of the kernel implementations in OpenCL. Using directives for accelerator
programming, the performance per watt on an Intel Arria-10 based FPGA platform can achieve 2.5X improvement over that on
an Intel Xeon 16-core CPU, and 2.1X improvement over that on an Nvidia K80 GPU, while trading off 50% of performance.
Exploring Parallel Bitonic Sort on a Migratory Thread Architecture
Kaushik Velusamy (University of Maryland, Baltimore County); Thomas Rolinger (University of Maryland, College Park)*;
Janice McMahon (Emu Technologies); Tyler Simon (University of Maryland, Baltimore County)
Large scale, data-intensive applications pose challenges to systems with a traditional memory hierarchy due to their
unstructured data sources and irregular memory access patterns. In response, systems that employ migratory threads have
been proposed to mitigate memory access bottlenecks as well as reduce energy consumption. One such system is the Emu
Chick, which migrates a small program context to the data being referenced in a memory access. Sorting an unordered list of
elements is a critical kernel for countless applications, such as graph processing and tensor decomposition. As such
applications can be considered highly suitable for a migratory thread architecture, it is imperative to understand the
performance of sorting algorithms on these systems. In this paper, we implement parallel bitonic sort and target the Emu Chick
system. We investigate the performance of an explicit comparison-based approach as well as a sorting network
implementation. Furthermore, we explore two different data layouts for the parallel bitonic sorting network, namely cyclic and
blocked. From the results of our performance study, we find that while thread migrations can dictate the overall performance of
an application, the cost of thread creation and management can out-grow the cost of thread migration
Unlocking Performance-Programmability by Penetrating the Intel FPGA OpenCL Toolflow
Ahmed Sanaullah (Boston University)*; Martin Herbordt (Boston University)
Improved support for OpenCL has been an important step towards the mainstream adoption of FPGAs as compute resources.
Current research has shown, however, that programmability derived from use of OpenCL typically comes at a significant
expense of performance, with the latter falling below that of hand-coded HDL, GPU, and even CPU designs. This can primarily
be attributed to 1) constrained deployment opportunities, 2) high testing time-frames, and 3) limitations of the Board Support
Package (BSP). We address these challenges by penetrating the toolflow and utilizing OpenCL-generated HDL (OpenCL-
HDL), which is created as an initial step during the full compilation. OpenCL-HDL can be used as an intermediate stage in the
design process to get better resource/latency estimates and perform RTL simulations. It can also be carved out and used as a
building block for an existing HDL system. In this work, we present the process of generating, isolating, and re-interfacing
OpenCL-HDL. We first propose a kernel template which reliably exploits parallelism opportunities and ensures all compute
pipelines are implemented as a single HDL module. We then outline the process of identifying this module from the thousands
of lines of compiler generated code. Finally, we categorize the different types of interfaces and present methods for
connecting/bypassing them in order to support integration into an existing HDL shell. We evaluate our approach using a
number of benchmarks from the Rodinia suite and Molecular Dynamics simulations. Our OpenCL-HDL implementations of all
benchmarks show an average of 37x, 4.8x, and 3.5x speedup over existing FPGA/OpenCL, GPU, and FPGA/Verilog designs,
respectively. We demonstrate that OpenCL-HDL is able to deliver hand-coded HDL-like performance with significantly less
development effort and with competitive resource overhead.
Application Aware Tuning of Reconfigurable Multi-Layer Perceptron Architectures
Ahmed Sanaullah (Boston University)*; Chen Yang (Boston University); Yuri Alexeev (Argonne National Laboratory);
Kazutomo Yoshii (Argonne National Laboratory); Martin Herbordt (Boston University)
Production FPGA implementations of Multi-Layer Perceptron (MLP) inference typically address the growing performance
demands by (i) storing neuron weights on-chip to address the memory bound, e.g., Microsoft Brainwave, and (ii) by generating
the largest possible arrays of multipliers and accumulators to address the compute bound. This approach of maximizing
device utilization, irrespective of application model, can actually result in higher latencies due to the tight coupling of different
function modules. Sub-optimal/generic parameter sizing of a given component, to reduce its latency, can force an increase in
complexity and latency of multiple other modules in the design. In real-time applications, this can result in the overall
computation failing to make the deadline. In our work we begin by creating a testbed for low-latency MLP inference, which we
then use to explore the application-aware optimization space for compute-bound MLP inference engines. The optimization
process begins by identifying modules in the critical path and their connectivity. We then use this information to determine key
parameters and their ideal values. Also, we automate hardware generation using OpenCL to ensure standard optimizations
are applied. We find that correct parameter sizing can reduce latency by 20% on average. For the MNIST, Poker, and ECP-
Candle benchmarks, we implement inference models using the Arria10X115 FPGA and achieve an average speedup of 1.47x
over the NVIDIA Tesla P100 GPU.
Wednesday September 26. 2018
ASIC & FPGA 1
1:00-2:40 in Eden Vale A1/A2
Chair: David Cousins / BBN