2019 IEEE High Performance
Extreme Computing Conference
(HPEC ‘19)
Twenty-third Annual HPEC Conference
24 - 26 September 2019
Westin Hotel, Waltham, MA USA
Thursday, September 26, 2019
FPGA
1:00-2:40 in Eden Vale C3
Chair: Karen Gettings / MIT-LL
Artificial Neural Network and Accelerator Co-design using Evolutionary Algorithms
Philip Colangelo (Intel), Oren Segal (Hofstra Univ.), Alex Speicher (Hofstra Univ.), Martin Margala (UMass Lowell)
Multilayer feed-forward Artificial Neural Networks (ANNs) are universal function approximators capable of modeling measurable functions to
any desired degree of accuracy. In practice, designing practical, efficient neural network architectures requires significant effort and expertise.
Further, designing efficient neural network architectures that fit optimally on hardware for the benefit of acceleration adds yet another degree of
complexity. In this paper, we use Evolutionary Cell Aided Design (ECAD), a framework capable of searching the design spaces for ANN
structures and reconfigurable hardware to find solutions based on a set of constraints and fitness functions. Providing a modular and scalable
2D systolic array based machine learning accelerator design built for an Arria 10 GX 1150 FPGA device using OpenCL enables results to be
tested and deployed in real hardware. Along with the hardware, a software model of the architecture was developed to speed up the
evolutionary process. We present results from the ECAD framework showing the effect various optimizations including accuracy, images per
second, effective giga-operations per second, and latency have on both ANN and hardware configurations. Through this work we show that
unique solutions can exist for each optimization resulting in the best performance. This work lays the foundation for finding machine learning
based solutions for a wide range of applications having different system constraints.
IP Cores for Graph Kernels on FPGAs
Sanmukh R. Kuppannagari, Rachit Rajat, Rajgopal Kannan (USC), Aravind DArizona State Univ. (Intel), Viktor K. Prasanna (USC)
Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph
analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores
for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over
graph partitions. Partitioning the input graph into non-overlapping partitions improves on-chip data reuse. Additional optimizations to exploit
intra- and inter- partition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general
graph algorithms with various vertex attributes and update propagation functions, such as Sparse Matrix Vector Multiplication (SpMV),
PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large
external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our
accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR,
SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5$\times$
speedup for SpMV, 16.4$\times$ speedup for PR, 3.5$\times$ speedup for SSSP, and 35.1$\times$ speedup for WCC, and compared with two
state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3$\times$ speedup for SpMV, 1.64$\times$ speedup for PR, and
1.8$\times$ speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance
predictions of our designs with HBM2 instead of DDR to store the graph. We further discuss extensions to our optimizations to improve the
throughput.
FPGA-Accelerated Spreading for Global Placement
Shounak Dhar (Univ. Texas Austin), Love Singhal (Intel), Mahesh A. Iyer (Intel), David Z. Pan (Univ. Texas Austin)
Placement takes a large part of the runtime in an Electronic Design Automation design implementation flow. In modern industrial and academic
physical design impementation tools, global placement consumes a significant part of the overall placement runtime. Many of these global
placers decouple the placement problem into two main parts - numerical optimization and spreading. In this paper, we propose a new and
massively parallel spreading algorithm and also accelerate a part of this algorithm on FPGA. Our algorithm produces placements with
comparable quality when integrated into a state-of-the-art academic placer. We formulate the spreading problem as a system of fluid flows
across reservoirs and mathematically prove that this formulation produces flows without cycles when solved as a continuous-time system. We
also propose a flow correction algorithm to make the flows monotonic, reduce total cell displacement and remove cycles which may arise
during the discretization process. Our new flow correction algorithm has a better time complexity for cycle removal than previous algorithms for
finding cycles in a generic graph. When compared to our previously published linear programming based spreading algorithm, our new fluid-
flow based multi-threaded spreading algorithm is 3.44x faster, and the corresponding FPGA-accelerated version is 5.15x faster.
An FPGA Decision Tree Classifier to Supervise Communication SoC
Abdelrahman Elkanishy (NMSU), Derrick Rivera (NMSU), Abdel-Hameed A. Badawy (NMSU, LANL), Paul M. Furth (NMSU), Z.M. Saifullah
(NMSU), and Christopher P. Michael (Sandia)
Wireless communication protocols are used in all smart devices and systems. This work is part of a proposed supervisory circuit that classifies
the operation of a communication SoC, in particular, a Bluetooth (BT) SoC, at a low sampling frequency by monitoring the RF output power and
input supply current. In essence, the goal is to inexpensively fabricate an RF envelope detector, power supply current monitor, and classifier on
a low-cost, low-frequency integrated circuit. When the supervisory circuit detects abnormal behavior, it can shut off power to the BT chip. We
extract simple descriptive features from the input and output power signals. Then, we train a machine learning (ML) model to classify the
different BT operation modes, such as advertising and transmit/receive modes. In this work, we implemented the ML classifier and feature
extraction on an FPGA with 100% matching with the corresponding MATLAB code. In the experimental setup, which included a function
generator and an on-board ADC, errors in the FPGA-sampled values degraded the match slightly to 99.26%. Finally, a low-power ASIC is
synthesized from the Verilog code in 0.18 µm CMOS, with an estimated area of 0.0152 mm2 and power of 9.43 µW
AH-CNN: Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA
Mohammad Farhadi, Mehdi Ghasemi, Yezhou Yang (Arizona State Univ.)
Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the “deeper model with deeper confidence”
belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk
of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the
implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the
adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and
FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper
model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs
on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using
confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for
CIFAR10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN
dataset. https://github.com/mfarhadi/AHCNN.