IEEE High Performane Extreme Computing

2020 IEEE High Performance Extreme Computing Virtual Conference 21 - 25 September 2020

Friday, September 25 5-1: Fault-Tolerant Computing Session (11:00-12:15 EDT) Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics Steven Roberts (IBM)*; Woong Shin (Oak Ridge National Laboratory); Justin Thaler (IBM); Todd Rosedahl (IBM) The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory. Identifying Execution Anomalies for Data Intensive Workflows Using Lightweight ML Techniques Cong Wang (RENCI/UNC Chapel Hill)*; George Papadimitriou (USC ISI); Mariam Kiran (ESnet, LBNL); Anirban Mandal (RENCI/UNC Chapel Hill); Ewa Deelman (USC Information Sciences Institute) Today's computational science applications are increasingly dependent on many complex, data-intensive operations on distributed datasets that originate from a variety of scientific instruments and repositories. To manage this complexity, science workflows are created to automate the execution of these computational and data transfer tasks, which significantly improves scientific productivity. As the scale of workflows rapidly increases, detecting anomalous behaviors in workflow executions has become critical to ensure timely and accurate science products. In this paper, we present a set of lightweight machine learning-based techniques, including both supervised and unsupervised algorithms, to identify anomalous workflow behaviors. We perform anomaly analysis on both workflow-level and task-level datasets collected from real workflow executions on a distributed cloud testbed. Results show that the workflow-level analysis employing k-means clustering can accurately cluster anomalous, i.e. failure-prone and poorly performing workflows into statistically similar classes with a reasonable quality of clustering, achieving over 0.7 for Normalized Mutual Information and Completeness scores. These results affirm the selection of the workflow-level features for workflow anomaly analysis. For task-level analysis, the Decision Tree classifier achieves >80% accuracy, while other tested classifiers can achieve >50% accuracy in most cases. We believe that these promising results can be a foundation for future research on anomaly detection and failure prediction for scientific workflows running in production environments. Total Ionizing Dose Radiation Testing of NVIDIA Jetson Nano GPUs Windy Slater (University of New Mexico); Nayana Tiwari (California Polytechnic State University); Tyler Lovelly (U.S. Air Force Research Laboratory)*; Jesse Mee (U.S. Air Force Research Laboratory) On-board electronics for small satellites can achieve high performance and power efficiency by using state-of-the-art commercial processors such as graphical processing units (GPUs). However, because commercial GPUs are not designed to operate in a space environment, they must be evaluated to determine their tolerance to radiation effects including Total Ionizing Dose (TID). In this research, TID radiation testing is performed on NVIDIA Jetson Nano GPUs using the U.S. Air Force Research Laboratory’s Cobalt- 60 panoramic irradiator. Preliminary results suggest operation beyond 20 krad(Si), which is sufficient radiation tolerance for short duration small satellite missions. An Efficient LP Rounding Scheme for Replica Placement ZHIHUI DU (New Jersey Institute of Technology)*; Sen Zhang (State University of New York, College at Oneonta); David Bader (New Jersey Institute of Technology); Jingkun Hu (Worldmoney Blockchain Management Limited) Large fault-tolerant network systems with high Quality of Service (QoS) guarantee are critical in many real world applications and entail diverse replica placement problems. In this paper, the replica placement problem in terms of minimizing the replica placement cost subject to both QoS and fault-tolerant constraints is formulated as a binary integer linear programming problem first and then relaxed as a linear programming problem. Given the optimal fractional linear programming solution, we propose a two-step rounding algorithm to obtain its integer solution. In the first step, a \emph{half rounding} algorithm is used to simplify the problem. In the second step, a \emph{cheapest amortized cost rounding} algorithm uses a novel metric, named \emph{amortized cost}, to make locally optimal rounding decision for the remaining vertices independently. Furthermore, a conflict resolution algorithm is presented to tackle the situations when different vertices make conflicting rounding decisions. Finally, we prove that the proposed two-step rounding algorithm has a 2-approximation ratio when the additional conflict cost meets a given constraint. A Fault Tolerant Implementation for a Massively Parallel Seismic Framework Suha Kayum (Saudi Aramco)*; Hussain Salim (Saudi Aramco); Thierry Tonellot (Saudi Aramco); Ali Almomin (Saudi Aramco) An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties. 5-2: High Performance & Secure Hardware 1 Session (12:30-13:45 EDT) Discrete Integrated Circuit Electronics (DICE) Zach Fredin (MIT); Jiri Zemanek (MIT); Camron Blackburn (MIT); Erik Strand (MIT); Amira Abdel-Rahman (MIT); Premila Rowles (MIT); Neil Gershenfeld (MIT)* We introduce DICE (Discrete Integrated Circuit Electronics). Rather than separately develop chips, packages, boards, blades, and systems, DICE spans these scales in a direct-write process with the three-dimensional assembly of computational building blocks. We present DICE parts, discuss their assembly, programming, and design workflow, illustrate applications in machine learning and high performance computing, and project performance. Arithmetic and Boolean Secret Sharing MPC on FPGAs in the Data Center Rushi Patel (Boston University)*; Pierre-Francois Wolfe (Boston University); Robert Munafo (Boston University); Mayank Varia (Boston University); Martin Herbordt (Boston University) Multi-Party Computation (MPC) is an important technique used to enable computation over confidential data from several sources. % where it is necessary to keep ones data confidential while allowing for utilization in joint applications. The public cloud provides a unique opportunity to enable MPC in a low latency environment. Field Programmable Gate Array (FPGA) hardware adoption allows for both MPC acceleration and utilization of low latency, high bandwidth communication networks that substantially improve the performance of MPC applications. In this work, we show how designing arithmetic and Boolean Multi-Party Computation gates for FPGAs in a cloud provide improvement to current MPC offerings and their use case in applications such as machine learning. We focus on the usage of Secret Sharing MPC first designed by Araki et al to design our FPGA MPC while also providing a comparison with those utilizing Garbled Circuits for MPC. We show that Secret Sharing MPC provides a better usage of cloud resources, specifically FPGA acceleration, than Garbled Circuits and is able to use at least a 10x less computer resources as compared to the original design using CPUs. Evaluating Cryptographic Performance of Raspberry Pi Clusters Daniel Hawthorne (US Military Academy); Michael Kapralos (US Military Academy); Raymond Blaine (US Military Academy); Suzanne Matthews (US Military Academy)* ARM-based single board computers (SBCs) such as the Raspberry Pi capture the imaginations of hobbyists and scientists due to their low cost and versatility. With the deluge of data produced in edge environments, SBCs and SBC clusters have emerged as low- cost platform for data collection and analysis. Simultaneously, security is a growing concern as new regulations require secure communication for data collected from the edge. In this paper, we compare the performance of a Raspberry Pi cluster to a power- efficient next unit of computing (NUC) and a mid-range desktop (MRD) on three leading cryptographic algorithms (AES, Twofish, and Serpent) and assess the general-purpose performance of the three systems using the HPL benchmark. Our results suggest that hardware-level instruction sets for all three cryptographic algorithms should be implemented on single board computers to aid with secure data transfer on the edge. MetaCL: Automated “Meta” OpenCL Code Generation for High-Level Synthesis on FPGA Paul Sathre (Virginia Tech); Atharva Gondhalekar (Virginia Tech); Mohamed Hassan (Virginia Tech); Wu-chun Feng (Virginia Tech)* Traditionally, FPGA programming has been done via a hardware description language (HDL). An HDL provides fine-grained control over reconfigurable hardware but with limited productivity due to a steep learning curve and tedious design cycle. Thus, high-level synthesis (HLS) approaches have been a significant boon to productivity, and in recent years, OpenCL has emerged as a vendor- agnostic HLS language that offers the added benefit of interoperation with other OpenCL platforms (e.g., CPU, GPU, DSP) and existing OpenCL software. However, OpenCL’s productivity can also suffer from tedious boilerplate code and the need to manually coordinate the host (i.e., CPU) and device (i.e., FPGA or other device). So, we present MetaCL, a compiler- assisted interface that takes OpenCL kernel functions as input and automatically generates OpenCL host-side code as output. MetaCL produces more efficient and readable host-side code, ensures portability, and introduces minimal additional runtime overhead compared to unassisted OpenCL development. A High Throughput Parallel Hash Table on FPGA using XOR-based Memory Ruizhi Zhang (University of Southern California ); Sasindu Wijeratne (University of Southern California ); Yang Yang (University of Southern California)*; Sanmukh Rao Kuppannagari (University of Southern California); Viktor Prasanna (Unversity of Southern California) Hash table is a fundamental data structure for quick search and retrieval of data. It is a key component in complex graph analytics and AI/ML applications. State-of-the-art parallel hash table implementations either make some simplifying assumptions such as supporting only a subset of hash table operations or employ optimizations that lead to performance that is highly data dependent and in the worst case can be similar to a sequential implementation. In contrast, in this work we develop a dynamic hash table that supports all the hash table queries - search, insert, delete, update, while allowing us to support p parallel queries (p>1) per clock cycle via p processing engines (PEs) in the worst case i.e. the performance is data agnostic. We achieve this by implementing novel XOR based multi-ported block memories on FPGAs. Additionally, we develop a technique to optimize the memory requirement of the hash table if the ratio of search to insert/update/delete queries is known beforehand. We implement our design on state-of-the-art Intel and Xilinx FPGA devices. Our design is scalable to 16 PEs and supports throughput up to 5926 MOPS. It matches the throughput of the state-of-the-art hash table design -- FASTHash, which only supports search and insert operations. Comparing with the best FPGA design that supports the same set of operations, our hash table achieves up to 12.3x speedup. 5-3: High Performance & Secure Hardware 2 Session (14:15-15:30 EDT) Homomorphic Encryption Based Secure Sensor Data Processing Vijay Gadepally (MIT Lincoln Laboratory); Mihailo Isakov (Boston University); Karen Gettings (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory); Michel Kinsy (Boston University)* Novel sensor processing algorithms face many hurdles to their adoption. Sensor processing environments have become increasingly difficult with an ever increasing array of threats. These threats have, in turn, raised the bar on deploying new capabilities. Many novel sensor processing algorithms exploit or induce randomness to boost algorithm performance. Co-designing this randomness with cryptographic features could be a powerful combination providing both improved algorithm performance and increased resiliency. The emerging field of signal processing in the encrypted domain has begun to explore such approaches. The development of this new class of algorithms will require new classes of tools. In particular, the foundational linear algebraic mathematics will need to be enhanced with cryptographic concepts to allow researchers to explore this new domain. This work highlights a relatively low overhead method that uses homomorphic encryption to enhance the resiliency of a part of a larger sensor processing pipeline. Accelerator Design and Performance Modeling for Homomorphic Encrypted CNN Inference Tian Ye (University of Southern California)*; Rajgopal Kannan (Army Research Lab-West); Viktor Prasanna (Unversity of Southern California) The rapid advent of cloud computing has brought with it concerns on data security and privacy. Fully Homomorphic Encryption (FHE) is a technique for enabling data security that allows arbitrary computations to be performed directly on encrypted data. In particular, FHE can be used with convolutional neural networks (CNN) to perform inference as a service on homomorphic encrypted input data. However, the high computational demands of FHE inference require a careful understanding of the tradeoffs between various parameters such as security level, hardware resources and performance. In this paper, we propose a parameterized accelerator for homomorphic encrypted CNN inference. We first develop parallel algorithms to implement CNN operations via FHE primitives. We then develop a parameterized model to evaluate the performance of our CNN design. The model accepts inputs in terms of available hardware resources and security parameters and outputs performance estimates. As an illustration, for a typical image classification task on CIFAR-10 dataset with a seven-layer CNN model, we show that a batch of 4K encrypted images can be classified within 1 second on a device operating at 2 GHz clock rate with 16K MACs, 64 MB on-chip memory and 256 GB/s external memory bandwidth. FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives Martin Herbordt (Boston University)*; Pouya Haghi (Boston University); Anqi Guo (Boston University); Qingqing Xiong (Boston university); Chen Yang (Boston University); Rushi Patel (Boston University); Anthony Skjellum (UTC); Ryan Marshall (UTC); Justin Broaddus (UTC) MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9x speedup over conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI applications. Hardware Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels Anthony Cabrera (Washington University in St. Louis)*; Roger Chamberlain (Washington University in St. Louis) The use of FPGAs in heterogeneous systems are valuable because they can be used to architect custom hardware to accelerate a particular application or domain. However, they are notoriously difficult to program. The development of high level synthesis tools like OpenCL make FPGA development more accessible, but not without its own challenges. The synthesized hardware comes from a description that is semantically closer to the application, which leaves the underlying hardware implementation unclear. Moreover, the interaction of the hardware tuning knobs exposed using a higher level specification increases the challenge of finding the most performant hardware configuration. In this work, we address these aforementioned challenges by describing how to approach the design space, using both information from the literature as well as by describing a methodology to better visualize the resulting hardware from the high level specification. Finally, we present an empirical evaluation of the impact of vectorizing data types as a tunable knob and its interaction among other coarse-grained hardware knobs. Hardware Foundation for Secure Computing Donato Kava (MIT Lincoln Laboratory)*; Alice Lee (MIT Lincoln Laboratory); Michael Vai (MIT Lincoln Laboratory); Aaron Mills (MIT Lincoln Laboratory) Software security solutions are often considered to be more adaptable than their hardware counterparts. However, software has to work within the limitations of the system hardware platform, of which the selection is often dictated by functionality rather than security. Performance issues of security solutions without proper hardware support are easy to understand. The real challenge, however, is in the dilemma of “what should be done?” vs. “what could be done?” Security software could become ineffective if its “liberal” assumptions, e.g., the availability of a substantial trusted computing base (TCB) on the hardware platform, are violated. To address this dilemma, we have been developing and prototyping a security-by-design hardware foundation platform that enhances mainstream microprocessors with proper hardware security primitives to support and enhance software security solutions. This paper describes our progress in the use of a customized security co-processor to provide security services. 5-4: High Performance & Secure Hardware 3 Session (15:45-17:00 EDT) How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms Yuan Meng (1997)*; Yang Yang (University of Southern California); Sanmukh Rao Kuppannagari (University of Southern California); Rajgopal Kannan (USC); Viktor Prasanna (Unversity of Southern California) Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms – PPO and A2C for application areas – robotics and games. we show that a FPGA-based custom accelerator achieves up to 24x (PPO) and 8x (A2C) speedups on training tasks, and 17x (PPO) and 2.1x (A2C) improvements on overall throughput, respectively. A Hardware Root-of-Trust Design for Low-Power SoC Edge Devices Alan Ehret (Texas A&M University)*; Eliakin Del Rosario (Texas A&M University); Karen Gettings (MIT Lincoln Laboratory); Michel Kinsy (Texas A&M University) In this work, we introduce a hardware root-of-trust architecture for low-power edge devices. An accelerator-based SoC design that includes the hardware root-of-trust architecture is developed. An example application for the device is presented. We examine attacks based on physical access given the significant threat they pose to unattended edge systems. The hardware root-of-trust provides security features to ensure the integrity of the SoC execution environment when deployed in uncontrolled, unattended locations. E-fused boot memory ensures the boot code and other security critical software is not compromised after deployment. Digitally signed programmable instruction memory prevents execution of code from untrusted sources. A programmable finite state machine is used to enforce access policies to device resources even if the application software on the device is compromised. Access policies isolate the execution states of application and security-critical software. The hardware root-of-trust architecture saves energy with a lower hardware overhead than a separate secure enclave while eliminating software attack surfaces for access control policies. Dynamic Computational Diversity with Multi-Radix Logic and Memory Paul Flikkema (NAU)*; James Palmer (Northern Arizona University); Tolga Yalcin (Northern Arizona University); Bertrand Cambou (northern arizona university) Today's computing systems are highly vulnerable to attacks, in large part because nearly all computers are part of a hardware and software monoculture of machines in its market, industry or sector. This is of special concern in mission-critical networked systems upon which our civil, industrial, and defense infrastructures increasingly rely. One approach to tackle this challenge is to endow these systems with dynamic computational diversity, wherein each processor assumes a sequence of unique variants, such that it executes only machine code encoded for a variant during the time interval of that variant's existence. The variants are drawn from a very large set, all adhering to a computational diversity architecture, which is based on an underlying instruction set architecture. Thus any population of machines belonging to a specific diversity architecture consists of a temporally dynamic set of essentially- unique variants. However, an underlying ISA enables development of a common development toolchain for the diversity architecture. Our approach is hardware-centric, relying on the rapidly developing microelectronics technologies of ternary computing, resistive RAM (ReRAM) memory, and physical unclonable functions. This paper describes our on-going work in dynamic computational diversity, which targets the principled design of a secure processor for embedded applications. OpenCL Performance on the Intel Heterogeneous Architecture Research Platform Steven Harris (Washington University in St. Louis)*; Roger Chamberlain (Washington University in St. Louis); Christopher Gill (Washington University in St. Louis) The fundamental operation of matrix multiplication is ubiquitous across a myriad of disciplines. Yet, the identification of new optimizations for matrix multiplication remains relevant for emerging hardware architectures and heterogeneous systems. Frameworks such as OpenCL enable computation orchestration on existing systems, and its availability using the Intel High Level Synthesis compiler allows users to architect new designs for reconfigurable hardware using C/C++. Using the HARPv2 as a vehicle for exploration, we investigate the utility of several traditional matrix multiplication optimizations to better understand the performance portability of OpenCL and the implications for such optimizations on cache coherent heterogeneous architectures. Our results give targeted insights into the applicability of best practices that were designed for existing architectures when used on emerging heterogeneous systems.

Welcome

Organizers

Advisory Board

Technical Committee

2020 Abstract Book

IEEE Nondiscrimination Policy

Friday, September 25 5-1: Fault-Tolerant Computing Session (11:00- 12:15 EDT) Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics Steven Roberts (IBM)*; Woong Shin (Oak Ridge National Laboratory); Justin Thaler (IBM); Todd Rosedahl (IBM) The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to, a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves system up- time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The architecture advances existing approaches by including both programmable policies, which are applied as events stream through the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge National Laboratory. Identifying Execution Anomalies for Data Intensive Workflows Using Lightweight ML Techniques Cong Wang (RENCI/UNC Chapel Hill)*; George Papadimitriou (USC ISI); Mariam Kiran (ESnet, LBNL); Anirban Mandal (RENCI/UNC Chapel Hill); Ewa Deelman (USC Information Sciences Institute) Today's computational science applications are increasingly dependent on many complex, data-intensive operations on distributed datasets that originate from a variety of scientific instruments and repositories. To manage this complexity, science workflows are created to automate the execution of these computational and data transfer tasks, which significantly improves scientific productivity. As the scale of workflows rapidly increases, detecting anomalous behaviors in workflow executions has become critical to ensure timely and accurate science products. In this paper, we present a set of lightweight machine learning-based techniques, including both supervised and unsupervised algorithms, to identify anomalous workflow behaviors. We perform anomaly analysis on both workflow-level and task-level datasets collected from real workflow executions on a distributed cloud testbed. Results show that the workflow-level analysis employing k-means clustering can accurately cluster anomalous, i.e. failure-prone and poorly performing workflows into statistically similar classes with a reasonable quality of clustering, achieving over 0.7 for Normalized Mutual Information and Completeness scores. These results affirm the selection of the workflow-level features for workflow anomaly analysis. For task-level analysis, the Decision Tree classifier achieves >80% accuracy, while other tested classifiers can achieve >50% accuracy in most cases. We believe that these promising results can be a foundation for future research on anomaly detection and failure prediction for scientific workflows running in production environments. Total Ionizing Dose Radiation Testing of NVIDIA Jetson Nano GPUs Windy Slater (University of New Mexico); Nayana Tiwari (California Polytechnic State University); Tyler Lovelly (U.S. Air Force Research Laboratory)*; Jesse Mee (U.S. Air Force Research Laboratory) On-board electronics for small satellites can achieve high performance and power efficiency by using state-of-the-art commercial processors such as graphical processing units (GPUs). However, because commercial GPUs are not designed to operate in a space environment, they must be evaluated to determine their tolerance to radiation effects including Total Ionizing Dose (TID). In this research, TID radiation testing is performed on NVIDIA Jetson Nano GPUs using the U.S. Air Force Research Laboratory’s Cobalt- 60 panoramic irradiator. Preliminary results suggest operation beyond 20 krad(Si), which is sufficient radiation tolerance for short duration small satellite missions. An Efficient LP Rounding Scheme for Replica Placement ZHIHUI DU (New Jersey Institute of Technology)*; Sen Zhang (State University of New York, College at Oneonta); David Bader (New Jersey Institute of Technology); Jingkun Hu (Worldmoney Blockchain Management Limited) Large fault-tolerant network systems with high Quality of Service (QoS) guarantee are critical in many real world applications and entail diverse replica placement problems. In this paper, the replica placement problem in terms of minimizing the replica placement cost subject to both QoS and fault-tolerant constraints is formulated as a binary integer linear programming problem first and then relaxed as a linear programming problem. Given the optimal fractional linear programming solution, we propose a two-step rounding algorithm to obtain its integer solution. In the first step, a \emph{half rounding} algorithm is used to simplify the problem. In the second step, a \emph{cheapest amortized cost rounding} algorithm uses a novel metric, named \emph{amortized cost}, to make locally optimal rounding decision for the remaining vertices independently. Furthermore, a conflict resolution algorithm is presented to tackle the situations when different vertices make conflicting rounding decisions. Finally, we prove that the proposed two-step rounding algorithm has a 2-approximation ratio when the additional conflict cost meets a given constraint. A Fault Tolerant Implementation for a Massively Parallel Seismic Framework Suha Kayum (Saudi Aramco)*; Hussain Salim (Saudi Aramco); Thierry Tonellot (Saudi Aramco); Ali Almomin (Saudi Aramco) An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358 billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC application with embarrassingly parallel properties. 5-2: High Performance & Secure Hardware 1 Session (12:30-13:45 EDT) Discrete Integrated Circuit Electronics (DICE) Zach Fredin (MIT); Jiri Zemanek (MIT); Camron Blackburn (MIT); Erik Strand (MIT); Amira Abdel-Rahman (MIT); Premila Rowles (MIT); Neil Gershenfeld (MIT)* We introduce DICE (Discrete Integrated Circuit Electronics). Rather than separately develop chips, packages, boards, blades, and systems, DICE spans these scales in a direct-write process with the three-dimensional assembly of computational building blocks. We present DICE parts, discuss their assembly, programming, and design workflow, illustrate applications in machine learning and high performance computing, and project performance. Arithmetic and Boolean Secret Sharing MPC on FPGAs in the Data Center Rushi Patel (Boston University)*; Pierre-Francois Wolfe (Boston University); Robert Munafo (Boston University); Mayank Varia (Boston University); Martin Herbordt (Boston University) Multi-Party Computation (MPC) is an important technique used to enable computation over confidential data from several sources. % where it is necessary to keep ones data confidential while allowing for utilization in joint applications. The public cloud provides a unique opportunity to enable MPC in a low latency environment. Field Programmable Gate Array (FPGA) hardware adoption allows for both MPC acceleration and utilization of low latency, high bandwidth communication networks that substantially improve the performance of MPC applications. In this work, we show how designing arithmetic and Boolean Multi-Party Computation gates for FPGAs in a cloud provide improvement to current MPC offerings and their use case in applications such as machine learning. We focus on the usage of Secret Sharing MPC first designed by Araki et al to design our FPGA MPC while also providing a comparison with those utilizing Garbled Circuits for MPC. We show that Secret Sharing MPC provides a better usage of cloud resources, specifically FPGA acceleration, than Garbled Circuits and is able to use at least a 10x less computer resources as compared to the original design using CPUs. Evaluating Cryptographic Performance of Raspberry Pi Clusters Daniel Hawthorne (US Military Academy); Michael Kapralos (US Military Academy); Raymond Blaine (US Military Academy); Suzanne Matthews (US Military Academy)* ARM-based single board computers (SBCs) such as the Raspberry Pi capture the imaginations of hobbyists and scientists due to their low cost and versatility. With the deluge of data produced in edge environments, SBCs and SBC clusters have emerged as low-cost platform for data collection and analysis. Simultaneously, security is a growing concern as new regulations require secure communication for data collected from the edge. In this paper, we compare the performance of a Raspberry Pi cluster to a power- efficient next unit of computing (NUC) and a mid-range desktop (MRD) on three leading cryptographic algorithms (AES, Twofish, and Serpent) and assess the general-purpose performance of the three systems using the HPL benchmark. Our results suggest that hardware-level instruction sets for all three cryptographic algorithms should be implemented on single board computers to aid with secure data transfer on the edge. MetaCL: Automated “Meta” OpenCL Code Generation for High- Level Synthesis on FPGA Paul Sathre (Virginia Tech); Atharva Gondhalekar (Virginia Tech); Mohamed Hassan (Virginia Tech); Wu-chun Feng (Virginia Tech)* Traditionally, FPGA programming has been done via a hardware description language (HDL). An HDL provides fine-grained control over reconfigurable hardware but with limited productivity due to a steep learning curve and tedious design cycle. Thus, high-level synthesis (HLS) approaches have been a significant boon to productivity, and in recent years, OpenCL has emerged as a vendor- agnostic HLS language that offers the added benefit of interoperation with other OpenCL platforms (e.g., CPU, GPU, DSP) and existing OpenCL software. However, OpenCL’s productivity can also suffer from tedious boilerplate code and the need to manually coordinate the host (i.e., CPU) and device (i.e., FPGA or other device). So, we present MetaCL, a compiler- assisted interface that takes OpenCL kernel functions as input and automatically generates OpenCL host-side code as output. MetaCL produces more efficient and readable host-side code, ensures portability, and introduces minimal additional runtime overhead compared to unassisted OpenCL development. A High Throughput Parallel Hash Table on FPGA using XOR- based Memory Ruizhi Zhang (University of Southern California ); Sasindu Wijeratne (University of Southern California ); Yang Yang (University of Southern California)*; Sanmukh Rao Kuppannagari (University of Southern California); Viktor Prasanna (Unversity of Southern California) Hash table is a fundamental data structure for quick search and retrieval of data. It is a key component in complex graph analytics and AI/ML applications. State-of-the-art parallel hash table implementations either make some simplifying assumptions such as supporting only a subset of hash table operations or employ optimizations that lead to performance that is highly data dependent and in the worst case can be similar to a sequential implementation. In contrast, in this work we develop a dynamic hash table that supports all the hash table queries - search, insert, delete, update, while allowing us to support p parallel queries (p>1) per clock cycle via p processing engines (PEs) in the worst case i.e. the performance is data agnostic. We achieve this by implementing novel XOR based multi-ported block memories on FPGAs. Additionally, we develop a technique to optimize the memory requirement of the hash table if the ratio of search to insert/update/delete queries is known beforehand. We implement our design on state-of-the-art Intel and Xilinx FPGA devices. Our design is scalable to 16 PEs and supports throughput up to 5926 MOPS. It matches the throughput of the state-of-the-art hash table design -- FASTHash, which only supports search and insert operations. Comparing with the best FPGA design that supports the same set of operations, our hash table achieves up to 12.3x speedup. 5-3: High Performance & Secure Hardware 2 Session (14:15-15:30 EDT) Homomorphic Encryption Based Secure Sensor Data Processing Vijay Gadepally (MIT Lincoln Laboratory); Mihailo Isakov (Boston University); Karen Gettings (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory); Michel Kinsy (Boston University)* Novel sensor processing algorithms face many hurdles to their adoption. Sensor processing environments have become increasingly difficult with an ever increasing array of threats. These threats have, in turn, raised the bar on deploying new capabilities. Many novel sensor processing algorithms exploit or induce randomness to boost algorithm performance. Co-designing this randomness with cryptographic features could be a powerful combination providing both improved algorithm performance and increased resiliency. The emerging field of signal processing in the encrypted domain has begun to explore such approaches. The development of this new class of algorithms will require new classes of tools. In particular, the foundational linear algebraic mathematics will need to be enhanced with cryptographic concepts to allow researchers to explore this new domain. This work highlights a relatively low overhead method that uses homomorphic encryption to enhance the resiliency of a part of a larger sensor processing pipeline. Accelerator Design and Performance Modeling for Homomorphic Encrypted CNN Inference Tian Ye (University of Southern California)*; Rajgopal Kannan (Army Research Lab-West); Viktor Prasanna (Unversity of Southern California) The rapid advent of cloud computing has brought with it concerns on data security and privacy. Fully Homomorphic Encryption (FHE) is a technique for enabling data security that allows arbitrary computations to be performed directly on encrypted data. In particular, FHE can be used with convolutional neural networks (CNN) to perform inference as a service on homomorphic encrypted input data. However, the high computational demands of FHE inference require a careful understanding of the tradeoffs between various parameters such as security level, hardware resources and performance. In this paper, we propose a parameterized accelerator for homomorphic encrypted CNN inference. We first develop parallel algorithms to implement CNN operations via FHE primitives. We then develop a parameterized model to evaluate the performance of our CNN design. The model accepts inputs in terms of available hardware resources and security parameters and outputs performance estimates. As an illustration, for a typical image classification task on CIFAR-10 dataset with a seven-layer CNN model, we show that a batch of 4K encrypted images can be classified within 1 second on a device operating at 2 GHz clock rate with 16K MACs, 64 MB on-chip memory and 256 GB/s external memory bandwidth. FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives Martin Herbordt (Boston University)*; Pouya Haghi (Boston University); Anqi Guo (Boston University); Qingqing Xiong (Boston university); Chen Yang (Boston University); Rushi Patel (Boston University); Anthony Skjellum (UTC); Ryan Marshall (UTC); Justin Broaddus (UTC) MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9x speedup over conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI applications. Hardware Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels Anthony Cabrera (Washington University in St. Louis)*; Roger Chamberlain (Washington University in St. Louis) The use of FPGAs in heterogeneous systems are valuable because they can be used to architect custom hardware to accelerate a particular application or domain. However, they are notoriously difficult to program. The development of high level synthesis tools like OpenCL make FPGA development more accessible, but not without its own challenges. The synthesized hardware comes from a description that is semantically closer to the application, which leaves the underlying hardware implementation unclear. Moreover, the interaction of the hardware tuning knobs exposed using a higher level specification increases the challenge of finding the most performant hardware configuration. In this work, we address these aforementioned challenges by describing how to approach the design space, using both information from the literature as well as by describing a methodology to better visualize the resulting hardware from the high level specification. Finally, we present an empirical evaluation of the impact of vectorizing data types as a tunable knob and its interaction among other coarse-grained hardware knobs. Hardware Foundation for Secure Computing Donato Kava (MIT Lincoln Laboratory)*; Alice Lee (MIT Lincoln Laboratory); Michael Vai (MIT Lincoln Laboratory); Aaron Mills (MIT Lincoln Laboratory) Software security solutions are often considered to be more adaptable than their hardware counterparts. However, software has to work within the limitations of the system hardware platform, of which the selection is often dictated by functionality rather than security. Performance issues of security solutions without proper hardware support are easy to understand. The real challenge, however, is in the dilemma of “what should be done?” vs. “what could be done?” Security software could become ineffective if its “liberal” assumptions, e.g., the availability of a substantial trusted computing base (TCB) on the hardware platform, are violated. To address this dilemma, we have been developing and prototyping a security-by-design hardware foundation platform that enhances mainstream microprocessors with proper hardware security primitives to support and enhance software security solutions. This paper describes our progress in the use of a customized security co- processor to provide security services. 5-4: High Performance & Secure Hardware 3 Session (15:45-17:00 EDT) How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous Platforms Yuan Meng (1997)*; Yang Yang (University of Southern California); Sanmukh Rao Kuppannagari (University of Southern California); Rajgopal Kannan (USC); Viktor Prasanna (Unversity of Southern California) Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance, etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of the hardware architecture's memory hierarchy to minimize data- communication latency; and (3) the ability of the hardware architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms. GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative algorithms – PPO and A2C for application areas – robotics and games. we show that a FPGA- based custom accelerator achieves up to 24x (PPO) and 8x (A2C) speedups on training tasks, and 17x (PPO) and 2.1x (A2C) improvements on overall throughput, respectively. A Hardware Root-of-Trust Design for Low-Power SoC Edge Devices Alan Ehret (Texas A&M University)*; Eliakin Del Rosario (Texas A&M University); Karen Gettings (MIT Lincoln Laboratory); Michel Kinsy (Texas A&M University) In this work, we introduce a hardware root-of-trust architecture for low-power edge devices. An accelerator-based SoC design that includes the hardware root-of-trust architecture is developed. An example application for the device is presented. We examine attacks based on physical access given the significant threat they pose to unattended edge systems. The hardware root-of-trust provides security features to ensure the integrity of the SoC execution environment when deployed in uncontrolled, unattended locations. E-fused boot memory ensures the boot code and other security critical software is not compromised after deployment. Digitally signed programmable instruction memory prevents execution of code from untrusted sources. A programmable finite state machine is used to enforce access policies to device resources even if the application software on the device is compromised. Access policies isolate the execution states of application and security-critical software. The hardware root-of-trust architecture saves energy with a lower hardware overhead than a separate secure enclave while eliminating software attack surfaces for access control policies. Dynamic Computational Diversity with Multi-Radix Logic and Memory Paul Flikkema (NAU)*; James Palmer (Northern Arizona University); Tolga Yalcin (Northern Arizona University); Bertrand Cambou (northern arizona university) Today's computing systems are highly vulnerable to attacks, in large part because nearly all computers are part of a hardware and software monoculture of machines in its market, industry or sector. This is of special concern in mission-critical networked systems upon which our civil, industrial, and defense infrastructures increasingly rely. One approach to tackle this challenge is to endow these systems with dynamic computational diversity, wherein each processor assumes a sequence of unique variants, such that it executes only machine code encoded for a variant during the time interval of that variant's existence. The variants are drawn from a very large set, all adhering to a computational diversity architecture, which is based on an underlying instruction set architecture. Thus any population of machines belonging to a specific diversity architecture consists of a temporally dynamic set of essentially- unique variants. However, an underlying ISA enables development of a common development toolchain for the diversity architecture. Our approach is hardware-centric, relying on the rapidly developing microelectronics technologies of ternary computing, resistive RAM (ReRAM) memory, and physical unclonable functions. This paper describes our on-going work in dynamic computational diversity, which targets the principled design of a secure processor for embedded applications. OpenCL Performance on the Intel Heterogeneous Architecture Research Platform Steven Harris (Washington University in St. Louis)*; Roger Chamberlain (Washington University in St. Louis); Christopher Gill (Washington University in St. Louis) The fundamental operation of matrix multiplication is ubiquitous across a myriad of disciplines. Yet, the identification of new optimizations for matrix multiplication remains relevant for emerging hardware architectures and heterogeneous systems. Frameworks such as OpenCL enable computation orchestration on existing systems, and its availability using the Intel High Level Synthesis compiler allows users to architect new designs for reconfigurable hardware using C/C++. Using the HARPv2 as a vehicle for exploration, we investigate the utility of several traditional matrix multiplication optimizations to better understand the performance portability of OpenCL and the implications for such optimizations on cache coherent heterogeneous architectures. Our results give targeted insights into the applicability of best practices that were designed for existing architectures when used on emerging heterogeneous systems.

9/21

9/22

9/23

9/24

9/25

Welcome

Organizers

Advisory Board

Technical Committee