2020
IEEE High Performance Extreme Computing
Virtual Conference
21 - 25 September 2020
Friday, September 25
5-1: Fault-Tolerant Computing Session (11:00-12:15 EDT)
Hybrid Approach to HPC Cluster Telemetry and Hardware Log Analytics
Steven Roberts (IBM)*; Woong Shin (Oak Ridge National Laboratory); Justin Thaler (IBM); Todd Rosedahl (IBM)
The number of computer processing nodes and processor cores in cluster systems is growing rapidly. Discovering, and reacting to,
a hardware or environmental issue in a timely manner enables proper fault isolation, improves quality of service, and improves
system up-time. In the case of performance impacts and node outages, RAS policies can direct actions such as job quiescence or
migration. Additionally, power consumption, thermal information, and utilization metrics can be used to provide cluster energy and
cooling efficiency improvements as well as optimized job placement. This paper describes a highly scalable telemetry architecture
that allows event aggregation, application of RAS policies, and provides the ability for cluster control system feedback. The
architecture advances existing approaches by including both programmable policies, which are applied as events stream through
the hierarchical network to persistence storage, and treatment of sensor telemetry in an extensible framework. This implementation
has proven robust and is in use in both cloud and HPC environments including the Summit system of 4,608 nodes at Oak Ridge
National Laboratory.
Identifying Execution Anomalies for Data Intensive Workflows Using Lightweight ML Techniques
Cong Wang (RENCI/UNC Chapel Hill)*; George Papadimitriou (USC ISI); Mariam Kiran (ESnet, LBNL); Anirban Mandal
(RENCI/UNC Chapel Hill); Ewa Deelman (USC Information Sciences Institute)
Today's computational science applications are increasingly dependent on many complex, data-intensive operations on distributed
datasets that originate from a variety of scientific instruments and repositories. To manage this complexity, science workflows are
created to automate the execution of these computational and data transfer tasks, which significantly improves scientific productivity.
As the scale of workflows rapidly increases, detecting anomalous behaviors in workflow executions has become critical to ensure
timely and accurate science products. In this paper, we present a set of lightweight machine learning-based techniques, including
both supervised and unsupervised algorithms, to identify anomalous workflow behaviors. We perform anomaly analysis on both
workflow-level and task-level datasets collected from real workflow executions on a distributed cloud testbed. Results show that the
workflow-level analysis employing k-means clustering can accurately cluster anomalous, i.e. failure-prone and poorly performing
workflows into statistically similar classes with a reasonable quality of clustering, achieving over 0.7 for Normalized Mutual
Information and Completeness scores. These results affirm the selection of the workflow-level features for workflow anomaly
analysis. For task-level analysis, the Decision Tree classifier achieves >80% accuracy, while other tested classifiers can achieve
>50% accuracy in most cases. We believe that these promising results can be a foundation for future research on anomaly detection
and failure prediction for scientific workflows running in production environments.
Total Ionizing Dose Radiation Testing of NVIDIA Jetson Nano GPUs
Windy Slater (University of New Mexico); Nayana Tiwari (California Polytechnic State University); Tyler Lovelly (U.S. Air Force
Research Laboratory)*; Jesse Mee (U.S. Air Force Research Laboratory)
On-board electronics for small satellites can achieve high performance and power efficiency by using state-of-the-art commercial
processors such as graphical processing units (GPUs). However, because commercial GPUs are not designed to operate in a space
environment, they must be evaluated to determine their tolerance to radiation effects including Total Ionizing Dose (TID). In this
research, TID radiation testing is performed on NVIDIA Jetson Nano GPUs using the U.S. Air Force Research Laboratory’s Cobalt-
60 panoramic irradiator. Preliminary results suggest operation beyond 20 krad(Si), which is sufficient radiation tolerance for short
duration small satellite missions.
An Efficient LP Rounding Scheme for Replica Placement
ZHIHUI DU (New Jersey Institute of Technology)*; Sen Zhang (State University of New York, College at Oneonta); David Bader
(New Jersey Institute of Technology); Jingkun Hu (Worldmoney Blockchain Management Limited)
Large fault-tolerant network systems with high Quality of Service (QoS) guarantee are critical in many real world applications and
entail diverse replica placement problems.
In this paper, the replica placement problem in terms of minimizing the replica placement cost subject to both QoS and fault-tolerant
constraints is formulated as a binary integer linear programming problem first and then relaxed as a linear programming problem.
Given the optimal fractional linear programming solution, we propose a two-step rounding algorithm to obtain its integer solution. In
the first step, a \emph{half rounding} algorithm is used to simplify the problem. In the second step, a \emph{cheapest amortized cost
rounding} algorithm uses a novel metric, named \emph{amortized cost}, to make locally optimal rounding decision for the remaining
vertices independently. Furthermore, a conflict resolution algorithm is presented to tackle the situations when different vertices make
conflicting rounding decisions. Finally, we prove that the proposed two-step rounding algorithm has a 2-approximation ratio when the
additional conflict cost meets a given constraint.
A Fault Tolerant Implementation for a Massively Parallel Seismic Framework
Suha Kayum (Saudi Aramco)*; Hussain Salim (Saudi Aramco); Thierry Tonellot (Saudi Aramco); Ali Almomin (Saudi Aramco)
An increase in the acquisition of seismic data volumes has resulted in applications processing seismic data running for weeks or
months on large supercomputers. A fault occurring during processing would jeopardize the fidelity and quality of the results, hence
necessitating a resilient application. GeoDRIVE is a High-Performance Computing (HPC) software framework tailored to massive
seismic applications and supercomputers. A fault tolerance mechanism that capitalizes on Boost.asio for network communication is
presented and tested quantitatively and qualitatively by simulating faults using fault injection. Resource provisioning is also
illustrated by adding more resources to a job during simulation. Finally, a large-scale job of 2,500 seismic experiments and 358
billion grid elements is executed on 32,000 cores. Subsets of nodes are killed at different times, validating the resilience of the
mechanism in large scale. While the implementation is demonstrated in a seismic application context, it can be tailored to any HPC
application with embarrassingly parallel properties.
5-2: High Performance & Secure Hardware 1 Session (12:30-13:45 EDT)
Discrete Integrated Circuit Electronics (DICE)
Zach Fredin (MIT); Jiri Zemanek (MIT); Camron Blackburn (MIT); Erik Strand (MIT); Amira Abdel-Rahman (MIT); Premila Rowles
(MIT); Neil Gershenfeld (MIT)*
We introduce DICE (Discrete Integrated Circuit Electronics). Rather than separately develop chips, packages, boards, blades, and
systems, DICE spans these scales in a direct-write process with the three-dimensional assembly of computational building blocks.
We present DICE parts, discuss their assembly, programming, and design workflow, illustrate applications in machine learning and
high performance computing, and project performance.
Arithmetic and Boolean Secret Sharing MPC on FPGAs in the Data Center
Rushi Patel (Boston University)*; Pierre-Francois Wolfe (Boston University); Robert Munafo (Boston University); Mayank Varia
(Boston University); Martin Herbordt (Boston University)
Multi-Party Computation (MPC) is an important technique used to enable computation over confidential data from several sources.
% where it is necessary to keep ones data confidential while allowing for utilization in joint applications. The public cloud provides a
unique opportunity to enable MPC in a low latency environment. Field Programmable Gate Array (FPGA) hardware adoption allows
for both MPC acceleration and utilization of low latency, high bandwidth communication networks that substantially improve the
performance of MPC applications. In this work, we show how designing arithmetic and Boolean Multi-Party Computation gates for
FPGAs in a cloud provide improvement to current MPC offerings and their use case in applications such as machine learning.
We focus on the usage of Secret Sharing MPC first designed by Araki et al to design our FPGA MPC while also providing a
comparison with those utilizing Garbled Circuits for MPC. We show that Secret Sharing MPC provides a better usage of cloud
resources, specifically FPGA acceleration, than Garbled Circuits and is able to use at least a 10x less computer resources as
compared to the original design using CPUs.
Evaluating Cryptographic Performance of Raspberry Pi Clusters
Daniel Hawthorne (US Military Academy); Michael Kapralos (US Military Academy); Raymond Blaine (US Military Academy);
Suzanne Matthews (US Military Academy)*
ARM-based single board computers (SBCs) such as the Raspberry Pi capture the imaginations of hobbyists and scientists due to
their low cost and versatility. With the deluge of data produced in edge environments, SBCs and SBC clusters have emerged as low-
cost platform for data collection and analysis. Simultaneously, security is a growing concern as new regulations require secure
communication for data collected from the edge. In this paper, we compare the performance of a Raspberry Pi cluster to a power-
efficient next unit of computing (NUC) and a mid-range desktop (MRD) on three leading cryptographic algorithms (AES, Twofish, and
Serpent) and assess the general-purpose performance of the three systems using the HPL benchmark. Our results suggest that
hardware-level instruction sets for all three cryptographic algorithms should be implemented on single board computers to aid with
secure data transfer on the edge.
MetaCL: Automated “Meta” OpenCL Code Generation for High-Level Synthesis on FPGA
Paul Sathre (Virginia Tech); Atharva Gondhalekar (Virginia Tech); Mohamed Hassan (Virginia Tech); Wu-chun Feng (Virginia Tech)*
Traditionally, FPGA programming has been done via a hardware description language (HDL). An HDL provides fine-grained control
over reconfigurable hardware but with limited productivity due to a steep learning curve and tedious design cycle. Thus, high-level
synthesis (HLS) approaches have been a significant boon to productivity, and in recent years, OpenCL has emerged as a vendor-
agnostic HLS language that offers the added benefit of interoperation with other OpenCL platforms (e.g., CPU, GPU, DSP) and
existing OpenCL software. However, OpenCL’s productivity can also suffer from tedious boilerplate code and the need to manually
coordinate the host (i.e., CPU) and device (i.e., FPGA or other device). So, we present MetaCL, a compiler- assisted interface that
takes OpenCL kernel functions as input and automatically generates OpenCL host-side code as output. MetaCL produces more
efficient and readable host-side code, ensures portability, and introduces minimal additional runtime overhead compared to
unassisted OpenCL development.
A High Throughput Parallel Hash Table on FPGA using XOR-based Memory
Ruizhi Zhang (University of Southern California
); Sasindu Wijeratne (University of Southern California
); Yang Yang (University of
Southern California)*; Sanmukh Rao Kuppannagari (University of Southern California); Viktor Prasanna (Unversity of Southern
California)
Hash table is a fundamental data structure for quick search and retrieval of data. It is a key component in complex graph analytics
and AI/ML applications. State-of-the-art parallel hash table implementations either make some simplifying assumptions such as
supporting only a subset of hash table operations or employ optimizations that lead to performance that is highly data dependent
and in the worst case can be similar to a sequential implementation. In contrast, in this work we develop a dynamic hash table that
supports all the hash table queries - search, insert, delete, update, while allowing us to support p parallel queries (p>1) per clock
cycle via p processing engines (PEs) in the worst case i.e. the performance is data agnostic. We achieve this by implementing novel
XOR based multi-ported block memories on FPGAs. Additionally, we develop a technique to optimize the memory requirement of the
hash table if the ratio of search to insert/update/delete queries is known beforehand. We implement our design on state-of-the-art
Intel and Xilinx FPGA devices. Our design is scalable to 16 PEs and supports throughput up to 5926 MOPS. It matches the
throughput of the state-of-the-art hash table design -- FASTHash, which only supports search and insert operations. Comparing with
the best FPGA design that supports the same set of operations, our hash table achieves up to 12.3x speedup.
5-3: High Performance & Secure Hardware 2 Session (14:15-15:30 EDT)
Homomorphic Encryption Based Secure Sensor Data Processing
Vijay Gadepally (MIT Lincoln Laboratory); Mihailo Isakov (Boston University); Karen Gettings (MIT Lincoln Laboratory); Jeremy
Kepner (MIT Lincoln Laboratory); Michel Kinsy (Boston University)*
Novel sensor processing algorithms face many hurdles to their adoption. Sensor processing environments have become
increasingly difficult with an ever increasing array of threats. These threats have, in turn, raised the bar on deploying new
capabilities. Many novel sensor processing algorithms exploit or induce randomness to boost algorithm performance. Co-designing
this randomness with cryptographic features could be a powerful combination providing both improved algorithm performance and
increased resiliency. The emerging field of signal processing in the encrypted domain has begun to explore such approaches. The
development of this new class of algorithms will require new classes of tools. In particular, the foundational linear algebraic
mathematics will need to be enhanced with cryptographic concepts to allow researchers to explore this new domain. This work
highlights a relatively low overhead method that uses homomorphic encryption to enhance the resiliency of a part of a larger sensor
processing pipeline.
Accelerator Design and Performance Modeling for Homomorphic Encrypted CNN Inference
Tian Ye (University of Southern California)*; Rajgopal Kannan (Army Research Lab-West); Viktor Prasanna (Unversity of Southern
California)
The rapid advent of cloud computing has brought with it concerns on data security and privacy. Fully Homomorphic Encryption
(FHE) is a technique for enabling data security that allows arbitrary computations to be performed directly on encrypted data. In
particular, FHE can be used with convolutional neural networks (CNN) to perform inference as a service on homomorphic encrypted
input data. However, the high computational demands of FHE inference require a careful understanding of the tradeoffs between
various parameters such as security level, hardware resources and performance. In this paper, we propose a parameterized
accelerator for homomorphic encrypted CNN inference. We first develop parallel algorithms to implement CNN operations via FHE
primitives. We then develop a parameterized model to evaluate the performance of our CNN design.
The model accepts inputs in terms of available hardware resources and security parameters and outputs performance estimates. As
an illustration, for a typical image classification task on CIFAR-10 dataset with a seven-layer CNN model, we show that a batch of 4K
encrypted images can be classified within 1 second on a device operating at 2 GHz clock rate with 16K MACs, 64 MB on-chip
memory and 256 GB/s external memory bandwidth.
FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives
Martin Herbordt (Boston University)*; Pouya Haghi (Boston University); Anqi Guo (Boston University); Qingqing Xiong (Boston
university); Chen Yang (Boston University); Rushi Patel (Boston University); Anthony Skjellum (UTC); Ryan Marshall (UTC); Justin
Broaddus (UTC)
MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them
to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA
to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9x speedup over
conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We
introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is
scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator
to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we
implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed
and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI
applications.
Hardware Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels
Anthony Cabrera (Washington University in St. Louis)*; Roger Chamberlain (Washington University in St. Louis)
The use of FPGAs in heterogeneous systems are valuable because they can be used to architect custom hardware to accelerate a
particular application or domain. However, they are notoriously difficult to program. The development of high level synthesis tools like
OpenCL make FPGA development more accessible, but not without its own challenges. The synthesized hardware comes from a
description that is semantically closer to the application, which leaves the underlying hardware implementation unclear. Moreover,
the interaction of the hardware tuning knobs exposed using a higher level specification increases the challenge of finding the most
performant hardware configuration. In this work, we address these aforementioned challenges by describing how to approach the
design space, using both information from the literature as well as by describing a methodology to better
visualize the resulting hardware from the high level specification. Finally, we present an empirical evaluation of the impact of
vectorizing data types as a tunable knob and its interaction among other coarse-grained hardware knobs.
Hardware Foundation for Secure Computing
Donato Kava (MIT Lincoln Laboratory)*; Alice Lee (MIT Lincoln Laboratory); Michael Vai (MIT Lincoln Laboratory); Aaron Mills (MIT
Lincoln Laboratory)
Software security solutions are often considered to be more adaptable than their hardware counterparts. However, software has to
work within the limitations of the system hardware platform, of which the selection is often dictated by functionality rather than
security. Performance issues of security solutions without proper hardware support are easy to understand. The real challenge,
however, is in the dilemma of “what should be done?” vs. “what could be done?” Security software could become ineffective if its
“liberal” assumptions, e.g., the availability of a substantial trusted computing base (TCB) on the hardware platform, are violated. To
address this dilemma, we have been developing and prototyping a security-by-design hardware foundation platform that enhances
mainstream microprocessors with proper hardware security primitives to support and enhance software security solutions. This
paper describes our progress in the use of a customized security co-processor to provide security services.
5-4: High Performance & Secure Hardware 3 Session (15:45-17:00 EDT)
How to Efficiently Train Your AI Agent? Characterizing and Evaluating Deep Reinforcement Learning on Heterogeneous
Platforms
Yuan Meng (1997)*; Yang Yang (University of Southern California); Sanmukh Rao Kuppannagari (University of Southern California);
Rajgopal Kannan (USC); Viktor Prasanna (Unversity of Southern California)
Deep Reinforcement Learning (Deep RL) is a key technology in several domains such as self-driving cars, robotics, surveillance,
etc. In Deep RL, using a Deep Neural Network model, an agent learns how to interact with the environment to achieve a certain
goal. The efficiency of running a Deep RL algorithm on a hardware architecture is dependent upon several factors including (1) the
suitability of the hardware architecture for kernels and computation patterns which are fundamental to Deep RL; (2) the capability of
the hardware architecture's memory hierarchy to minimize data-communication latency; and (3) the ability of the hardware
architecture to hide overheads introduced by the deeply nested highly irregular computation characteristics in Deep RL algorithms.
GPUs have been popular for accelerating RL algorithms, however, they fail to optimally satisfy the above-mentioned requirements. A
few recent works have developed highly customized accelerators for specific Deep RL algorithms. However, they cannot be
generalized easily to the plethora of Deep RL algorithms and DNN model choices that are available. In this paper, we explore the
possibility of developing a unified framework that can accelerate a wide range of Deep RL algorithms including variations in training
methods or DNN model structures. We take one step towards this goal by defining a domain-specific high-level abstraction for a
widely used broad class of Deep RL algorithms - on-policy Deep RL. Furthermore, we provide a systematic analysis of the
performance of state-of-the-art on-policy Deep RL algorithms on CPU-GPU and CPU-FPGA platforms. We target two representative
algorithms – PPO and A2C for application areas – robotics and games. we show that a FPGA-based custom accelerator achieves
up to 24x (PPO) and 8x (A2C) speedups on training tasks, and 17x (PPO) and 2.1x (A2C) improvements on overall throughput,
respectively.
A Hardware Root-of-Trust Design for Low-Power SoC Edge Devices
Alan Ehret (Texas A&M University)*; Eliakin Del Rosario (Texas A&M University); Karen Gettings (MIT Lincoln Laboratory); Michel
Kinsy (Texas A&M University)
In this work, we introduce a hardware root-of-trust architecture for low-power edge devices. An accelerator-based SoC design that
includes the hardware root-of-trust architecture is developed. An example application for the device is presented. We examine
attacks based on physical access given the significant threat they pose to unattended edge systems. The hardware root-of-trust
provides security features to ensure the integrity of the SoC execution environment when deployed in uncontrolled, unattended
locations. E-fused boot memory ensures the boot code and other security critical software is not compromised after deployment.
Digitally signed programmable instruction memory prevents execution of code from untrusted sources. A programmable finite state
machine is used to enforce access policies to device resources even if the application software on the device is compromised.
Access policies isolate the execution states of application and security-critical software. The hardware root-of-trust architecture
saves energy with a lower hardware overhead than a separate secure enclave while eliminating software attack surfaces for access
control policies.
Dynamic Computational Diversity with Multi-Radix Logic and Memory
Paul Flikkema (NAU)*; James Palmer (Northern Arizona University); Tolga Yalcin (Northern Arizona University); Bertrand Cambou
(northern arizona university)
Today's computing systems are highly vulnerable to attacks, in large part because nearly all computers are part of a hardware and
software monoculture of machines in its market, industry or sector. This is of special concern in mission-critical networked systems
upon which our civil, industrial, and defense infrastructures increasingly rely. One approach to tackle this challenge is to endow
these systems with dynamic computational diversity, wherein each processor assumes a sequence of unique variants, such that it
executes only machine code encoded for a variant during the time interval of that variant's existence. The variants are drawn from a
very large set, all adhering to a computational diversity architecture, which is based on an underlying instruction set architecture.
Thus any population of machines belonging to a specific diversity architecture consists of a temporally dynamic set of essentially-
unique variants. However, an underlying ISA enables development of a common development toolchain for the diversity architecture.
Our approach is hardware-centric, relying on the rapidly developing microelectronics technologies of ternary computing, resistive
RAM (ReRAM) memory, and physical unclonable functions. This paper describes our on-going work in dynamic computational
diversity, which targets the principled design of a secure processor for embedded applications.
OpenCL Performance on the Intel Heterogeneous Architecture Research Platform
Steven Harris (Washington University in St. Louis)*; Roger Chamberlain (Washington University in St. Louis); Christopher Gill
(Washington University in St. Louis)
The fundamental operation of matrix multiplication is ubiquitous across a myriad of disciplines. Yet, the identification of new
optimizations for matrix multiplication remains relevant for emerging hardware architectures and heterogeneous systems.
Frameworks such as OpenCL enable computation orchestration on existing systems, and its availability using the Intel High Level
Synthesis compiler allows users to architect new designs for reconfigurable hardware using C/C++. Using the HARPv2 as a vehicle
for exploration, we investigate the utility of several traditional matrix multiplication optimizations to better understand the
performance portability of OpenCL and the implications for such optimizations on cache coherent heterogeneous architectures. Our
results give targeted insights into the applicability of best practices that were designed for existing architectures when used on
emerging heterogeneous systems.