2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
Thursday, September 26, 2019 FPGA 1:00-2:40 in Eden Vale C3 Chair: Karen Gettings / MIT-LL Artificial Neural Network and Accelerator Co-design using Evolutionary Algorithms Philip Colangelo (Intel), Oren Segal (Hofstra Univ.), Alex Speicher (Hofstra Univ.), Martin Margala (UMass Lowell) Multilayer feed-forward Artificial Neural Networks (ANNs) are universal function approximators capable of modeling measurable functions to any desired degree of accuracy. In practice, designing practical, efficient neural network architectures requires significant effort and expertise. Further, designing efficient neural network architectures that fit optimally on hardware for the benefit of acceleration adds yet another degree of complexity. In this paper, we use Evolutionary Cell Aided Design (ECAD), a framework capable of searching the design spaces for ANN structures and reconfigurable hardware to find solutions based on a set of constraints and fitness functions. Providing a modular and scalable 2D systolic array based machine learning accelerator design built for an Arria 10 GX 1150 FPGA device using OpenCL enables results to be tested and deployed in real hardware. Along with the hardware, a software model of the architecture was developed to speed up the evolutionary process. We present results from the ECAD framework showing the effect various optimizations including accuracy, images per second, effective giga-operations per second, and latency have on both ANN and hardware configurations. Through this work we show that unique solutions can exist for each optimization resulting in the best performance. This work lays the foundation for finding machine learning based solutions for a wide range of applications having different system constraints. IP Cores for Graph Kernels on FPGAs Sanmukh R. Kuppannagari, Rachit Rajat, Rajgopal Kannan (USC), Aravind DArizona State Univ. (Intel), Viktor K. Prasanna (USC) Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into non-overlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra- and inter- partition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions,  such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly-optimized multi-core designs, our FPGA framework achieves up to 20.5$\times$ speedup for SpMV, 16.4$\times$ speedup for PR,  3.5$\times$ speedup for SSSP, and 35.1$\times$ speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3$\times$ speedup for SpMV, 1.64$\times$ speedup for PR, and 1.8$\times$ speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs with HBM2 instead of DDR to store the graph. We further discuss extensions to our optimizations to improve the throughput. FPGA-Accelerated Spreading for Global Placement Shounak Dhar (Univ. Texas Austin), Love Singhal (Intel), Mahesh A. Iyer (Intel), David Z. Pan (Univ. Texas Austin) Placement takes a large part of the runtime in an Electronic Design Automation design implementation flow. In modern industrial and academic physical design impementation tools, global placement consumes a significant part of the overall placement runtime. Many of these global placers decouple the placement problem into two main parts - numerical optimization and spreading. In this paper, we propose a new and massively parallel spreading algorithm and also accelerate a part of this algorithm on FPGA. Our algorithm produces placements with comparable quality when integrated into a state-of-the-art academic placer. We formulate the spreading problem as a system of fluid flows across reservoirs and mathematically prove that this formulation produces flows without cycles when solved as a continuous-time system. We also propose a flow correction algorithm to make the flows monotonic, reduce total cell displacement and remove cycles which may arise during the discretization process. Our new flow correction algorithm has a better time complexity for cycle removal than previous algorithms for finding cycles in a generic graph. When compared to our previously published linear programming based spreading algorithm, our new fluid- flow based multi-threaded spreading algorithm is 3.44x faster, and the corresponding FPGA-accelerated version is 5.15x faster. An FPGA Decision Tree Classifier to Supervise Communication SoC Abdelrahman Elkanishy (NMSU), Derrick Rivera (NMSU), Abdel-Hameed A. Badawy (NMSU, LANL), Paul M. Furth (NMSU), Z.M. Saifullah (NMSU), and Christopher P. Michael (Sandia) Wireless communication protocols are used in all smart devices and systems. This work is part of a proposed supervisory circuit that classifies the operation of a communication SoC, in particular, a Bluetooth (BT) SoC, at a low sampling frequency by monitoring the RF output power and input supply current. In essence, the goal is to inexpensively fabricate an RF envelope detector, power supply current monitor, and classifier on a low-cost, low-frequency integrated circuit. When the supervisory circuit detects abnormal behavior, it can shut off power to the BT chip. We extract simple descriptive features from the input and output power signals. Then, we train a machine learning (ML) model to classify the different BT operation modes, such as advertising and transmit/receive modes. In this work, we implemented the ML classifier and feature extraction on an FPGA with 100% matching with the corresponding MATLAB code. In the experimental setup, which included a function generator and an on-board ADC, errors in the FPGA-sampled values degraded the match slightly to 99.26%. Finally, a low-power ASIC is synthesized from the Verilog code in 0.18 µm CMOS, with an estimated area of 0.0152 mm2 and power of 9.43 µW   AH-CNN: Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA Mohammad Farhadi, Mehdi Ghasemi, Yezhou Yang (Arizona State Univ.) Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the “deeper model with deeper confidence” belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for CIFAR10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN dataset. https://github.com/mfarhadi/AHCNN.
Thursday, September 26, 2019 FPGA 1:00-2:40 in Eden Vale C3 Chair: Karen Gettings / MIT-LL Artificial Neural Network and Accelerator Co-design using Evolutionary Algorithms Philip Colangelo (Intel), Oren Segal (Hofstra Univ.), Alex Speicher (Hofstra Univ.), Martin Margala (UMass Lowell) Multilayer feed-forward Artificial Neural Networks (ANNs) are universal function approximators capable of modeling measurable functions to any desired degree of accuracy. In practice, designing practical, efficient neural network architectures requires significant effort and expertise. Further, designing efficient neural network architectures that fit optimally on hardware for the benefit of acceleration adds yet another degree of complexity. In this paper, we use Evolutionary Cell Aided Design (ECAD), a framework capable of searching the design spaces for ANN structures and reconfigurable hardware to find solutions based on a set of constraints and fitness functions. Providing a modular and scalable 2D systolic array based machine learning accelerator design built for an Arria 10 GX 1150 FPGA device using OpenCL enables results to be tested and deployed in real hardware. Along with the hardware, a software model of the architecture was developed to speed up the evolutionary process. We present results from the ECAD framework showing the effect various optimizations including accuracy, images per second, effective giga-operations per second, and latency have on both ANN and hardware configurations. Through this work we show that unique solutions can exist for each optimization resulting in the best performance. This work lays the foundation for finding machine learning based solutions for a wide range of applications having different system constraints. IP Cores for Graph Kernels on FPGAs Sanmukh R. Kuppannagari, Rachit Rajat, Rajgopal Kannan (USC), Aravind DArizona State Univ. (Intel), Viktor K. Prasanna (USC) Graphs are a powerful abstraction for representing networked data in many real-world applications. The need for performing large scale graph analytics has led to widespread adoption of dedicated hardware accelerators such as FPGA for this purpose. In this work, we develop IP cores for several key graph kernels. Our IP cores use graph processing over partitions (GPOP) programming paradigm to perform computations over graph partitions. Partitioning the input graph into non-overlapping partitions improves on-chip data reuse. Additional optimizations to exploit intra- and inter- partition parallelism and to reduce external memory accesses are also discussed. We generate FPGA designs for general graph algorithms with various vertex attributes and update propagation functions,  such as Sparse Matrix Vector Multiplication (SpMV), PageRank (PR), Single Source Shortest Path (SSSP), and Weakly Connected Component (WCC). We target a platform consisting of large external DDR4 memory to store the graph data and Intel Stratix FPGA to accelerate the processing. Experimental results show that our accelerators sustain a high throughput of up to 2250, 2300, 3378, and 2178 Million Traversed Edges Per Second (MTEPS) for SpMV, PR, SSSP and WCC, respectively. Compared with several highly- optimized multi-core designs, our FPGA framework achieves up to 20.5$\times$ speedup for SpMV, 16.4$\times$ speedup for PR,  3.5$\times$ speedup for SSSP, and 35.1$\times$ speedup for WCC, and compared with two state-of-the-art FPGA frameworks, our designs demonstrate up to 5.3$\times$ speedup for SpMV, 1.64$\times$ speedup for PR, and 1.8$\times$ speedup for WCC, respectively. We develop a performance model for our GPOP paradigm. We then perform performance predictions of our designs with HBM2 instead of DDR to store the graph. We further discuss extensions to our optimizations to improve the throughput. FPGA-Accelerated Spreading for Global Placement Shounak Dhar (Univ. Texas Austin), Love Singhal (Intel), Mahesh A. Iyer (Intel), David Z. Pan (Univ. Texas Austin) Placement takes a large part of the runtime in an Electronic Design Automation design implementation flow. In modern industrial and academic physical design impementation tools, global placement consumes a significant part of the overall placement runtime. Many of these global placers decouple the placement problem into two main parts - numerical optimization and spreading. In this paper, we propose a new and massively parallel spreading algorithm and also accelerate a part of this algorithm on FPGA. Our algorithm produces placements with comparable quality when integrated into a state-of- the-art academic placer. We formulate the spreading problem as a system of fluid flows across reservoirs and mathematically prove that this formulation produces flows without cycles when solved as a continuous-time system. We also propose a flow correction algorithm to make the flows monotonic, reduce total cell displacement and remove cycles which may arise during the discretization process. Our new flow correction algorithm has a better time complexity for cycle removal than previous algorithms for finding cycles in a generic graph. When compared to our previously published linear programming based spreading algorithm, our new fluid-flow based multi-threaded spreading algorithm is 3.44x faster, and the corresponding FPGA-accelerated version is 5.15x faster. An FPGA Decision Tree Classifier to Supervise Communication SoC Abdelrahman Elkanishy (NMSU), Derrick Rivera (NMSU), Abdel- Hameed A. Badawy (NMSU, LANL), Paul M. Furth (NMSU), Z.M. Saifullah (NMSU), and Christopher P. Michael (Sandia) Wireless communication protocols are used in all smart devices and systems. This work is part of a proposed supervisory circuit that classifies the operation of a communication SoC, in particular, a Bluetooth (BT) SoC, at a low sampling frequency by monitoring the RF output power and input supply current. In essence, the goal is to inexpensively fabricate an RF envelope detector, power supply current monitor, and classifier on a low-cost, low-frequency integrated circuit. When the supervisory circuit detects abnormal behavior, it can shut off power to the BT chip. We extract simple descriptive features from the input and output power signals. Then, we train a machine learning (ML) model to classify the different BT operation modes, such as advertising and transmit/receive modes. In this work, we implemented the ML classifier and feature extraction on an FPGA with 100% matching with the corresponding MATLAB code. In the experimental setup, which included a function generator and an on-board ADC, errors in the FPGA-sampled values degraded the match slightly to 99.26%. Finally, a low-power ASIC is synthesized from the Verilog code in 0.18 µm CMOS, with an estimated area of 0.0152 mm2 and power of 9.43 µW   AH-CNN: Adaptive and Hierarchical Convolutional Neural Networks using Partial Reconfiguration on FPGA Mohammad Farhadi, Mehdi Ghasemi, Yezhou Yang (Arizona State Univ.) Nowadays most research in visual recognition using Convolutional Neural Networks (CNNs) follows the “deeper model with deeper confidence” belief to gain a higher recognition accuracy. At the same time, deeper model brings heavier computation. On the other hand, for a large chunk of recognition challenges, a system can classify images correctly using simple models or so-called shallow networks. Moreover, the implementation of CNNs faces with the size, weight, and energy constraints on the embedded devices. In this paper, we implement the adaptive switching between shallow and deep networks to reach the highest throughput on a resource-constrained MPSoC with CPU and FPGA. To this end, we develop and present a novel architecture for the CNNs where a gate makes the decision whether using the deeper model is beneficial or not. Due to resource limitation on FPGA, the idea of partial reconfiguration has been used to accommodate deep CNNs on the FPGA resources. We report experimental results on CIFAR-10, CIFAR-100, and SVHN datasets to validate our approach. Using confidence metric as the decision making factor, only 69.8%, 71.8%, and 43.8% of the computation in the deepest network is done for CIFAR10, CIFAR-100, and SVHN while it can maintain the desired accuracy with the throughput of around 400 images per second for SVHN dataset. https://github.com/mfarhadi/AHCNN.