2022
IEEE High Performance Extreme Computing
Virtual Conference
19 - 23 September 2022
1.1: Advanced Multicore Software Technologies Session (11:00-12:15)
Co-Chairs: Franz Franchetti & Brian Sroka
Performance speedup of Quantum Espresso using optimized AOCL-FFTW
S. Biplab Raut (AMD)
Quantum Espresso (QE) is an open-source software suite for electronic-structure calculations and materials modeling at the
nanoscale. QE depends upon multiple libraries including an internal or external library for FFT computations. The iterative
diagonalization process and the computation of charge density in QE use forward and inverse 3D FFTs that account for a large
portion of the total application runtime. AOCL-FFTW is the FFT library recommended for QE on AMD CPU systems. QE currently
uses the FFTW library in a sub-optimal manner thereby not achieving the best performance. This paper presents a new set of
design and implementation strategies applied in AOCL-FFTW to overcome the major limitations of QE in its use of FFTW without
requiring any code changes in QE. Results showcasing the performance benefits of the proposed optimizations in AOCL-FFTW
are presented in this paper. Speedups are achieved in single-node and multi-node test executions that help to accelerate the QE
application.
Task-Parallel Programming with Constrained Parallelism
Tsung-Wei Huang (University of Utah); Leslie Hwang (Synopsis)
Task graph programming model (TGPM) has become central to a wide range of scientific computing applications because it
enables top-down optimization of parallelism that governs the macro-scale performance. Existing TGPMs focus on expressing
tasks and dependencies of a workload and leave the scheduling details to a library runtime. While maximizing the task concurrency
is a typical scheduling goal, many applications require task parallelism to be constrained during the graph execution. Examples are
limiting the number of worker threads in a subgraph or relating a conflict between two tasks. How- ever, mainstream TGPMs have
largely ignored this important feature of constrained parallelism in a task graph. Users have no choice but to implement a separate
and often sophisticated scheduling solution that is neither generalizable nor scalable. In this paper, we propose a semaphore
programming model and a scheduling method both of which can be easily integrated into an existing TGPM to support constrained
parallelism. We have demonstrated the effectiveness and efficiency of our approach in real applications. As an example, our
semaphore model speeds up an industrial circuit placement workload up to 28%.
HashTag: Fast Lookup in a Persistent Memory Filesystem
Matthew Curtis-Maury; Yash Trivedi (NetApp)
Persistent Memory (PM) offers byte-addressability and persistence on the memory bus, and delivers dramatic performance
improvements over traditional storage media. While many filesystems have been optimized for PM, a large fraction of processing
time is generally spent locating the required data in PM due to the standard use of extent-trees for location indexing. This paper
presents HashTag, a cache of PM locations for use in PM filesystems with support for snapshot creation. We evaluate HashTag
across a range of configurations to determine the impact of various location caching options on filesystem performance. These
lessons can inform the design of future caching solutions in PM filesystems.
Computing In-Place FFTs with SIMD Lane Slicing
Benoît Dupont de Dinechin (Kalray)
We present an approach for implementing in-place FFTs on cores fitted with SIMD units and non-temporal load-store units.
Loading the input samples with SIMD instructions decimates them in time across the SIMD lanes. A classic FFT implementation is
extended to operate on SIMD data rather than scalar data and computes the sub-transforms concurrently. This enables efficient
exploitation of the SIMD arithmetic and memory access instructions while involving little SIMD lane shuffling. A last FFT stage then
recombines in-place the sub-transforms results to produce the output. We illustrate this approach on a Cooley-Tukey radix-4
decimated-in-frequency FFT implementation, which also integrates the two inner loop collapsing optimization of the TI C6x
DSP\_fft32x32 code that enables software pipelining and the Burrus technique for using bit-reversal in high-radix FFT
implementations. Performance evaluations are performed on the Kalray KV3 core, which implements a 64-bit vector-scalar VLIW
architecture with level-1 cache bypass load instructions.
Applying the Midas Touch of Reproducibility to High-Performance Computing
Austin Minor; Wu-chun Feng (Virginia Tech)
"With the exponentially improving serial performance of CPUs from the 1980s and 1990s slowing to a standstill by the 2010s, the
high-performance computing (HPC) community has seen parallel computing
become ubiquitous, which, in turn, has led to a proliferation of parallel programming models, including CUDA, OpenACC, OpenCL,
OpenMP, and SYCL. This diversity in hardware platform and programming model has forced application users to port their codes
from one hardware platform to another (e.g., CUDA on NVIDIA GPU to HIP or OpenCL on AMD GPU) and demonstrate
reproducibility via ad-hoc testing. To more rigorously ensure reproducibility between codes, we propose Midas, a system to ensure
that the results of the original code match the results of the ported code by leveraging the power of snapshots to capture the state
of a system before and after the execution of a kernel. "
Poster Session: 1-P (12:15-14:15): Poster Session 1
Chair(s)/Host(s): TBD & TBD
Resource-Constrained Optimizations For Synthetic Aperture Radar On-Board Image Processing [Outstanding Paper
Award]
Maron Schlemon (German Aerospace Center); Martin Schulz (TU Munich); Rolf Scheiber (German Aerospace Center)
Synthetic Aperture Radar (SAR) can be used to create realistic and high-resolution 2D or 3D reconstructions of landscapes. The
data capture is typically deployed using radar instruments in specially equipped, low flying planes, resulting in a large amount of
raw data, which needs to be processed for image reconstruction. However, due to limited on-board processing capacities on the
plane (power, size, weight, cooling, communication bandwidth to ground stations, etc.) and the need to capture many images
during a single flight, the raw data must be processed on-board and then sent to the ground station efficiently as image products.
In this paper we describe the processing architecture of the digital beamforming SAR (DBFSAR) of the German Areaospace
Center (DLR) and the special steps that had to be taken to enable the on-board processing. We explain the required software
optimizations and under which conditions their integration in the SAR imaging process leads to (near) real-time capability. We
further describe the lessons learned in our work and discuss how they can be applied to other processing scenarios with limited
resource availability.
Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs
Tsung-Wei Huang (University of Utah)
Recently, CPU-GPU heterogeneous parallelism has brought transformational performance milestones to static timing analysis
(STA) algorithms. As the computing ecosystem continues to proliferate, performance portability has emerged as a new challenge
when deploying the result to diverse heterogeneous computing platforms. Specifically, the optimal code written on a CPU-GPU
architecture may not be optimal for other CPU- GPU architectures, due to various performance, interoperability, and availability
constraints. As a result, we introduce in this paper a learning-based framework to enhance the performance portability of a GPU-
accelerated STA program. We parameterize important performance parameters and leverage a neural network model to adapt
performance optimization to any given computing platforms. We have demonstrated the effectiveness of our framework in real STA
applications.
Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes
Chenxu Niu; Wei Zhang (Texas Tech Univ.); Suren Byna (LBNL); Yong Chen (Texas Tech Univ.)
Distributed representation methods for words have been developed for years, and numerous methods exist, such as word2vec,
GloVe, and fastText. However, they are not designed for key-value pairs, which is an important data pattern and widely used in
many scenarios. For example, metadata attributes of scientific files consist of a collection of key-value pairs. In this research, we
propose \texttt{kv2vec}, a method that captures relationships between keys and values and represents key-value pairs in dense
vectors. The fundamental idea of the \texttt{kv2vec} method is utilizing recurrent neural networks (RNNs) with long short-term
memory (LSTM) hidden units to convert each key-value pair to a distributed vector representation. This new method overcomes
the weaknesses of existing embedding models for representing key-value pairs as vectors. Moreover, it can be integrated into
dataset search solutions through querying metadata attributes for self-describing file formats that are widely used in HPC systems.
We evaluate the \texttt{kv2vec} method with multiple real-world datasets, and the results show that \texttt{kv2vec} outperforms
existing models.
Unsupervised Adaptation of Spiking Networks in a Gradual Changing Environment
Zaidao Mei (Syracuse Univ.); Mark Barnell (Air Force Research Laboratory); Qinru Qiu (Syracuse Univ.)
"Spiking neural networks(SNNs) have drawn broad research interests in recent years due to their high energy efficiency and
biologically-plausibility. They have proven to be competitive in many machine learning tasks. Similar to all Artificial Neural
Network(ANNs) machine learning models, the SNNs rely on the assumption that the training and testing data are drawn from the
same distribution. As the environment changes gradually, the input distribution will shift over time, and the performance of SNNs
turns out to be brittle. To this end, we propose a unified framework that can adapt non-stationary streaming data by exploiting
unlabeled intermediate domain, and fits with the in-hardware SNN learning algorithm Error-modulated STDP. Specifically, we
propose a unique self-training framework to generate pseudo labels to retrain the model for intermediate and target domains. In
addition, we develop an online-normalization method with an auxiliary neuron to normalize the output of the hidden layers. By
combining the normalization with self-training, our approach gains average classification improvements over 10% on MNIST,
NMINST, and two other datasets."
Predicting Ankle Moment Trajectory with Adaptive Weighted Ensemble of LSTM Networks
Emilia A Grzesiak; Ho Chit Siu; Jennifer Sloboda (MIT Lincoln Laboratory)
"Estimations of ankle moments can provide clinically helpful information on the function of lower extremities and further lead to
insight on patient rehabilitation and assistive wearable exoskeleton design. Current methods for estimating ankle moments leave
room for improvement, with most recent cutting-edge methods relying on machine learning models trained on wearable sEMG and
IMU data. While machine learning eliminates many practical challenges that troubled more traditional human body models for this
application, we aim to expand on prior work that showed the feasibility of using LSTM models by employing an ensemble of LSTM
networks. We present an adaptive weighted LSTM ensemble network and demonstrate its performance during standing, walking,
running, and sprinting. Our result show that the LSTM ensemble outperformed every single LSTM model component within the
ensemble. Across every activity, the ensemble reduced median root mean squared error (RMSE) by 0.0017-0.0053 K*m/kg, which
is 2.7-10.3% lower than the best performing single LSTM model. Hypothesis testing revealed that most reductions in RMSE were
statistically significant between the ensemble and other single models across all activities and subjects. Future work may analyze
different trajectory lengths and different combinations of LSTM submodels within the ensemble. This study improves on an existing
approach to joint moment prediction from wearable sensors, which may be used to obtain clinically-useful information about joint
kinetics outside of a motion capture space. "
Interval Arithmetic-based FFT for Large Integer Multiplication
Zibo Gong; Nathan Zhu; Matt Ngaw (Carnegie Mellon Univ.); Joao Rivera (ETH Zurich); Larry Tang; Eric Tang; Het Mankad; Franz
Franchetti (Carnegie Mellon Univ.)
In this work we propose an interval arithmetic Fast Fourier Transform (FFT) algorithm for large integer multiplication on both CPU
and GPU. We utilize techniques of double-double precision, shared memory, and thread parallelization to improve both the
efficiency and accuracy of our implementation. Early results show that for CPU, we can achieve correctness on factors of billions of
digits in size. On GPU, we see performance speedups compared to existing software libraries, lowering computation cost without
adversely impacting accuracy of the result.
Machine Learning for Accurate and Fast Bandgap Prediction of Solid-State Materials
Shomik Verma; Shivam Kajale; Rafael Gomez-Bombarelli (MIT)
Semi-local DFT tends to vastly underestimate the bandgap of materials. Here we propose a machine learning calibration workflow
to improve the accuracy of cheap DFT calculations. We first compile a dataset of 25k materials with PBE and HSE calculations
completed. Using this dataset, we benchmark various machine learning architectures and features to determine which results in
the highest accuracy. The best technique is able to improve the accuracy of PBE 10-fold. We then expand the generalizability of
the model by utilizing active learning to intelligently sample chemical space. Because HSE data is not available for these new
materials, we develop an optimized high-throughput parallelized workflow to calculate HSE bandgaps of 10k additional materials.
We therefore develop a cheap, accurate, and generalized ML model for bandgap prediction.
Systolic Array based FPGA accelerator for Yolov3-tiny
Prithvi Velicheti; Sivani Pentapati; Suresh Purini (IIIT Hyderabad)
"FPGAs are increasingly significant for deploying convolutional neural network (CNN) inference models because of performance
demands and power constraints in embedded and data centre applications. Object detection and classification are
essential tasks in computer vision. You Only Look Once (YOLO) is a very efficient algorithm for object detection and classification
with its variant Yolov3-tiny specially designed for embedded applications. This paper presents the FPGA accelerator for
multiple precisions (FIXED-8, FIXED-16, FLOAT32) of YoloV3-tiny. We use a homogenous systolic array architecture with a
synchronized pipeline adder tree for convolution, allowing it to be scalable for multiple variants of Yolo with a change in host driver.
We evaluated the design on Terasic DE5a-Net-DDR4. The
Fixed point (FP-8, FP-16) implementations attain a throughput of 57 GOPs/s (> 23%) and 46.16 GOPs/s (> 340 %). We
synthesized the first FLOAT32 implementation attaining 11.22 GFLOPs/s."
Epigenetics and Transcriptomics Quality Control Pipelines in a HPC Environment
Darrell O Ricke (MIT Lincoln Laboratory); Derek Ng (Northeastern Univ.); Philip Fremont-Smith; Adam Michaleas; Rafael Jaimes
(MIT Lincoln Laboratory)
"Chemical and pathogen exposures can modify an individual’s epigenome and transcriptome. These modifications can persist over
time and may provide distinctive signatures and timelines of exposure. These signatures may be distinctive for different viral
pathogens, bacterial pathogens, and chemical exposures. Exposure signature discovery is enabled by improved transcriptomic
and epigenomic assaying techniques to detect RNA expression, DNA base modifications, histone modifications, and chromatin
accessibility. However, there is a paucity of quality control (QC) guidelines and software to ensure data integrity and accuracy. We
developed analytical pipelines to validate QC of data generated by twelve different transcriptomic and epigenomic assays. These
QC pipelines were containerized using Singularity to ensure portability and scalability across high performance computing
environments. We deployed the pipelines on the MIT SuperCloud high performance computing system and report execution time.
Quality thresholds and metrics are also proposed across the broad set of assays, which may serve as a comprehensive reference
guide. These tools and associated metrics are available as open source resources."
1-2: Cloud HPEC Session (12:30-13:45)
Co-Chairs: Brian Sroka & Laura Brattain
Invited Talk: HPC Matters! How Supercomputing Supports NASA’s Mission
Dr. Piyush Mehrotra (NASA)
Scalable Interactive Autonomous Navigation Simulations on HPC
Wesley Brewer; Joel Bretheim (HPCMP PET/GDIT); John Kaniarz (DEVCOM Ground Vehicle Systems Center); Peilin Song;
Burhman Gates (Engineer Research & Development Center)
We present our work of enabling HPC in an interactive real-time autonomy loop. The workflow consists of many different software
components deployed within Singularity containers and communicating using both the Robotic Operating System's (ROS) publish-
subscribe system and the Message Passing Interface (MPI). We use Singularity's container networking interface (CNI) to enable
virtual networking within the containers, so that multiple containers can run the various components using different IP addresses on
the same compute node. The Virtual Autonomous Navigation Environment Environmental Sensor Engine (VANE:ESE) is used for
physically-realistic simulation of LIDAR along with the Autonomous Navigation Virtual Environment Laboratory (ANVEL) for vehicle
simulation. VANE:ESE sends Velodyne UDP LIDAR packets directly to the Robotic Technology Kernel (RTK) and is distributed
across multiple compute nodes via MPI along with OpenMP for shared memory parallelism within each compute node. The user
interfaces with the navigation environment using an XFCE desktop with virtual workspaces running over a VNC containerized
deployment through a double-hop ssh tunnel, which uses noVNC (a JavaScript-based VNC client) to provide a browser-based
client interface. We automate the complete launch process using a custom iLauncher plugin. We benchmark scalable performance
with multiple vehicle simulations on four different HPC systems and discuss our findings.
Parallelizing Explicit and Implicit Extrapolation Methods for Ordinary Differential Equations
Utkarsh (IIT Kanpur); Chris Elrod; Yingbo Ma; Christopher Rackauckas (Julia Computing)
Numerically solving ordinary differential equations (ODEs) is a naturally serial process and as a result the vast majority of ODE
solver software are serial. In this manuscript we developed a set of parallelized ODE solvers using extrapolation methods which
exploit ``parallelism within the method'' so that arbitrary user ODEs can be parallelized. We describe the specific choices made in
the implementation of the explicit and implicit extrapolation methods which allow for generating low overhead static schedules to
then exploit with optimized multi-threaded implementations. We demonstrate that while the multi-threading gives a noticeable
acceleration on both explicit and implicit problems, the explicit parallel extrapolation methods gave no significant improvement over
state-of-the-art even with a multi-threading advantage against current optimized high order Runge-Kutta tableaus. However, we
demonstrate that the implicit parallel extrapolation methods are able to achieve state-of-the-art performance (2x-4x) on standard
multicore x86 CPUs for systems of < 200 stiff ODEs solved at low tolerance, a typical setup for a vast majority of users of high level
language equation solver suites. The resulting method is distributed as the first widely available open source software for within-
method parallel acceleration targeting typical modest compute architectures.
SuperCloud Lite in the Cloud – Lightweight, Secure, Self-Service, On-Demand Mechanisms for Creating Customizable
Research Computing Environments
Kelsie Edie (US Military Academy); Kurt Keville; Lauren Milechin; Chris N Hill (MIT)
We describe and examine an automation for deploying on-demand, OAuth2 secured virtual machine instances. Our approach does
not require any expert security and web service knowledge to create a secure instance. The approach allows non-experts to
launch web-accessible virtual machine services that are automatically secured through OAuth2 authentication, an authentication
standard widely employed in academic and enterprise environments. We demonstrate the approach through an example of
creating secure commercial cloud instances of the MIT SuperCloud modern research
computing oriented software stack. A small example of a use case is examined and compared with native MIT SuperCloud
experience as a preliminary evaluation.
The example illustrates several useful features. It retains OAuth2 security guarantees and leverages a simple OAuth2 proxy
architecture that in turn employs simple DNS based service limits to manage access to the proxy service. The system has the
potential to provide a default secure environment in which access is, in theory, limited to a narrow trust circle. It leverages
WebSockets to provide a pure browser enabled, zero install base service. For the user, it is entirely self-service so that a non-
expert, non-privileged user can launch instances, while supporting access to a familiar environment on a broad selection of
hardware, including high-end GPUs and isolated bare-metal resources. The environment includes pre-configured browser based
desktop GUI and notebook configurations.
It can provide the option of end-user privileged access to the VM for flexible customization. It integrates with a simplified cost-
monitoring and machine management framework that provides visibility to commercial cloud charges and some budget guard rails,
and supports instance stop, restart, and pausing features to allow intermittent use and cost reduction.
Site-Wide HPC Data Center Demand Response
Daniel C Wilson; Ioannis Paschalidis; Ayse K. Coskun (Boston Univ.)
"As many electricity markets are trending towards greater renewable energy generation, there will be an increased need for
electrical grids to cooperatively balance electricity supply and demand. Data centers are one large consumer of electricity on a
global scale, and they are well-suited to act as a grid load stabilizer via performing ""demand response.""
Prior investigations in this space have demonstrated how data centers can continue to meet their users' quality of service (QoS)
needs by modeling relationships between cluster job queues, server power properties, and application performance. While server
power is a major factor in data center power consumption, other components such as cooling systems contribute a non-negligible
amount of electricity demand.
This work proposes using a simple site-wide (i.e., including all components of the data center) power model on top of QoS-aware
demand response solutions to achieve the QoS benefits of those solutions while improving the cost-saving opportunities in
demand response. We demonstrate 1.3x cost savings compared to QoS-aware demand response policies that do not utilize site-
wide power models, and show similar savings in cases of severely under-predicted site-wide power consumption if 1.5x relaxed
QoS constraints are allowed."
1-3: Quantum and Non-Deterministic Computing Session (14:15-15:30)
Co-Chairs: Patrick Dreher & Donato Kava
C2QA – Bosonic Qiskit [Outstanding Paper Award]
Timothy Stavenger (PNNL); Eleanor Crane (JQI, QuICS); Kevin Smith (Brookhaven National Laboratory, Yale Univ.); Christopher T
Kang (Univ. of Washington); Steven Girvin (Yale Univ.); Nathan Wiebe (Univ. of Toronto, PNNL)
The practical benefits of hybrid quantum information processing hardware that contains continuous-variable objects (bosonic
modes such as mechanical or electromagnetic oscillators) in addition to traditional (discrete-variable) qubits have recently been
demonstrated by experiments with bosonic codes that reach the break-even point for quantum error correction [1]–[5] and by
efficient Gaussian boson sampling simulation of the Franck-Condon spectra of triatomic molecules [6] that is well beyond the
capabilities of current qubit-only hardware. The goal of this Co-design Center for Quantum Advantage (C2QA) project is to develop
an instruction set architecture (ISA) for hybrid qubit/bosonic mode systems that contains an inventory of the fundamental
operations and measurements that are possible in such hardware. The corresponding abstract machine model (AMM) could also
contain a description of the appropriate error models associated with the gates, measurements and time evolution of the hardware.
This information has been implemented as an extension of IBM Qiskit. IBM Qiskit is an open-source software development toolkit
(SDK) for simulating the quantum state of a quantum circuit and for running the same circuits on prototype hardware within the IBM
Quantum Experience. We introduce the Bosonic Qiskit software to enable the simulation of hybrid qubit/bosonic systems using the
existing Qiskit software development kit [7]. This implementation can be used for simulating new hybrid systems, verifying
proposed physical systems, and modeling systems larger than can currently be constructed. We also cover tutorials and example
use cases included within the software to study Jaynes-Cummings models, bosonic Hubbard models, plotting Wigner functions
and animations, and calculating maximum likelihood estimations using Wigner functions.
Constructing Optimal Contraction Trees for Tensor Network Quantum Circuit Simulation [Outstanding Student Paper
Award]
Cameron A Ibrahim (Univ. of Delaware); Danylo Lykov (Argonne National Laboratory); Zichang He (UC Santa Barbara); Yuri
Alexeev (Argonne National Laboratory); Ilya Safro (Univ. of Delaware)
One of the key problems in tensor network based quantum circuit simulation is the construction of a contraction tree which
minimizes the cost of the simulation, where the cost can be expressed in the number of operations as a proxy for the simulation
running time. This same problem arises in a variety of application areas, such as combinatorial scientific computing,
marginalization in probabilistic graphical models, and solving constraint satisfaction problems. In this paper, we reduce the
computationally hard portion of this problem to one of graph linear ordering, and demonstrate how existing approaches in this area
can be utilized to achieve results up to several orders of magnitude better than existing state of the art methods for the same
running time. To do so, we introduce a novel polynomial time algorithm for constructing an optimal contraction tree from a given
order. Furthermore, we introduce a fast and high quality linear ordering solver, and demonstrate its applicability as a heuristic for
providing orderings for contraction trees. Finally, we compare our solver with competing methods for constructing contraction trees
in quantum circuit simulation on a collection of randomly generated Quantum Approximate Optimization Algorithm Max Cut circuits
and show that our method achieves superior results on a majority of tested quantum circuits.
Reproducibility: Our source code and data are available at https://github.com/cameton/HPEC2022_ContractionTrees."
Quantum Netlist Compiler (QNC)
Shamminuj Aktar; Abdel-Hameed A. Badawy (New Mexico State Univ.); Nandakishore Santhi (Los Alamos National Laboratory)
Over the last decade, Quantum Computing hardware has undergone rapid development and has become a very intriguing,
promising, and active research field among scientists worldwide. To achieve the desired quantum functionalities, quantum
algorithms require translation from a high-level description to a machine-specific physical operation sequence. In contrast to
classical compilers, state-of-the-art quantum compilers are in their infancy. There is a research need for a quantum compiler that
can deal with generic unitary operators and generate basic unitary operations according to quantum machines’ diverse underlying
technologies and characteristics. In this work, we introduce Quantum Netlist Compiler (QNC) that converts arbitrary unitary
operators or desired initial states of quantum algorithms to OpenQASM-2.0 circuits enabling to run them on actual quantum
hardware. Extensive simulations on IBM quantum system and results analysis show that QNC is well suited for quantum circuit
optimization and produces circuits with competitive success rates in practice.
Hardware Design and Implementation of Classic McEliece Post-Quantum Cryptosystem Based on FPGA
Shaofen Chen; Haiyan Lin; Wenjin Huang; Yihua Huang (Sun Yat-sen Univ.)
With the development of information age, the security of data transmission has attracted more attention. In addition, quantum
computers pose a great threat to widely used cryptography algorithms. Therefore, Classic McEliece algorithm is a post-uantum
algorithm, which has high security and stands firm in all kinds of attacks for decades. The wide application of the cryptosystem is
inseparable from its hardware implementation scheme. So this paper proposes a Classic McEliece implementation scheme based
on FPGA platform. To achieve the balance between resources and speed, a variety of methods to implement the scheme are
adopted. First, using the characteristics of random access in the RAMthe clock cycle consumption of the error vector generating
module is reduced by 95.1%. Second, multiple computing units are employed inside the module for parallel computing and which
reduces the number of computing cycles by about 22.4%. Finally, this thesis proposes a multiplexing syndrome decoding module,
and compared to
the non-multiplexing scheme, the LUT resource consumption of this thesis is reduced by about 24.2%, and the FF resource
consumption of this thesis is reduced by about 15.4%.
Hardware Design and Implementation of Post-Quantum Cryptography Kyber
Qingru Zeng; Quanxin Li; Baoze Zhao; Han Jiao; Yihua Huang (Sun Yat-sen Univ.)
In order to resist quantum attacks, post-quantum cryptographic algorithms have become the focus of cryptography research. As a
lattice-based key algorithm, the Kyber protocol has great advantages in the selection of post-quantum algorithms. This paper
proposes an efficient hardware design scheme for Kyber512 whose security level is L1. This paper first design a general hash
module to reuse computing cores to improve resource utilization.A ping-pong RAM and a pipeline structure is used to design a
general-purpose NTT processor to support all operations on polynomial multiplication.Finally, the inter-module cooperation and
data scheduling are compactly designed to shorten the working cycle. In this paper, the top-level key generation, public key
encryption and private key decryption modules are implemented on Artix 7 FPGA with 204MHz frequency. The times of the
corresponding modules are 11.5μs, 17.3μs, and 23.5μs, respectively. Compared with the leading hardware implementation, the
design in this paper reduces the area-delay product by 10.2\%, achieving an effective balance between resources and area.
1-4: BRAIDS – Boosting Resilience through Artificial Intelligence and Decision Support Session (15:45-
17:00)
Co-Chairs: Courtland VanDam & Sandeep Pisharody
Invited Talk: Welcome to CyberWar: Long Term Ramifications Unleashed by Russia’s War
Barry Greene (Akamai)
Distributed Hardware Accelerated Secure Joint Computation on the COPA Framework [Outstanding Student Paper Award]
Rushi Patel; Pouya Haghi (Boston Univ.); Shweta Jain; Andriy Kot; Venkata Krishnan (Intel); Mayank Varia; Martin Herbordt
(Boston Univ.)
Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide
additional functionality and enable higher bandwidth communication and lower latency. Until lately, however, the lack of a simple
approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel’s Configurable Network
Protocol Accelerator (COPA) provides a customizable FPGA framework that integrates both hardware and software development
to improve computation and communication performance. In this first case study, we demonstrate the capabilities of the COPA
framework with an application from cryptography -- secure Multi-Party Computation (MPC) -- that utilizes hardware accelerators
connected directly to host memory and the COPA network. We find that using the COPA framework gives significant improvements
to both computation and communication as compared to traditional implementations of MPC that use CPUs and NICs. A single
MPC accelerator running on COPA enables more than 17Gb/s of communication bandwidth while using only 3% of Stratix 10
resources. We show that utilizing the COPA framework enables multiple MPC accelerators running in parallel to fully saturate a
100Gbps link enabling higher performance compared to traditional NICs.
Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic
Ivan Kawaminami; Arminda Estrada; Youssef Elsakkary (Univ. of Arizona); Hayden Jananthan (MIT LLSC); Aydin Buluc (LBNL);
Tim Davis (Texas A&M Univ.); Daniel Grant (GreyNoise); Michael Jones (MIT LLSC); Chad Meiners (MIT Lincoln Laboratory);
Andrew Morris (GreyNoise); Sandeep Pisharody (MIT Lincoln Laboratory); Jeremy Kepner (MIT LLSC)
Modern network sensors continuously produce enormous quantities of raw data that are beyond the capacity of human analysts.
Cross-correlation of network sensors increases this challenge by enriching every network event with additional metadata. These
large volumes of enriched network data present opportunities to statistically characterize network traffic and quickly answer a key
question: “What are the primary cyber characteristics of my network data?” The Python GraphBLAS and PyD4M analysis
frameworks enable anonymized statistical analysis to be performed quickly and efficiently on very large network data sets. This
approach is tested using billions of anonymized network data samples from the largest Internet observatory (CAIDA Telescope)
and tens of millions of anonymized records from the largest commercially available background enrichment capability (GreyNoise).
The analysis confirms that most of the enriched variables follow expected heavy-tail distributions and that a large fraction of the
network traffic is due to a small number of cyber activities. This information can simplify the cyber analysts’ task by enabling
prioritization of cyber activities based on statistical prevalence.
Edge Computing Security for a Multi-Agent System
Alice Lee; Karen Gettings; Matthias Beebe; Paul Monticciolo; Michael Vai (MIT Lincoln Laboratory)
We examine the computational energy requirements of different systems driven by the geometrical scaling law (known as Moore’s
law or Dennard Scaling for geometry) and increasing use of Artificial Intelligence/ Machine Learning (AI/ML) over the last decade.
With more scientific and technology applications based on data-driven discovery, machine learning methods, especially deep
neural networks, have become widely used. In order to enable such applications, both hardware accelerators and advanced AI/ML
methods have led to introduction of new architectures, system designs, algorithms, and software. Our analysis of energy trends
indicates three important observations: 1) Energy efficiency due to geometrical scaling is slowing down; 2) The energy efficiency at
the bit-level does not translate into efficiency at the instruction level, or at the system level for a variety of systems, especially for
large-scale supercomputers; 3) At the application level, general-purpose ML/AI methods can be computationally energy intensive,
off-setting the gains in energy from geometrical scaling and special purpose accelerators. Further, our analysis provides specific
pointers for integrating energy efficiency with performance analysis for enabling ML/AI-driven and high-performance computing
applications in the future.
Invited Talk: Proposed Empirical Assessment of Remote Workers’ Cyberslacking and Computer Security Posture to
Assess Organizational Cybersecurity Risks
Ariel Luna; Yair Levy; Greg Simco; Wei Li (Nova Southeastern University)
Cyberslacking is conducted by employees who are using their companies’ equipment and network for personal purposes instead of
working during work hours. Cyberslacking has a significant adverse effect on overall employee productivity, however, recently, due
to COVID19 move to remote working also pose a cybersecurity risk to organizations networks and infrastructure. In this work-in-
progress research study, we are developing, validating, and will empirically test a taxonomy to assess an organization’s remote
workers’ risk level of cybersecurity threats. This study includes a three-phased developmental approach in developing the Remote
Worker Cyberslacking Security Risk Taxonomy. In collaboration with cybersecurity Subject Matter Experts (SMEs) use the
taxonomy to assess organization’s remote workers’ risk level of cybersecurity threats by using actual system indicators of
productivity measures to estimate their cyberslacking along with assessing via organizational information the computer security
posture of the remote device being used to access corporate resources. Anticipated results from 125 anonymous employees from
one organization will then be assessed on the cybersecurity risk taxonomy where recommendation to the organization’s
cybersecurity leadership will be provided.
1-S1: Sky Computing – Toward Efficient Computing on the Cloud Special (17:30-19:30)
Organizers: Marco Montes de Oca, Luna Xu, Erica Lin, Suraj Bramhavar, Jeffrey Chou (Sync Computing)
Running Spark Applications In Large Scale On K8s: Challenges and Solutions
Bo Yang (Stealth Startup)
Taming High-Performance Computing Platform Heterogeneity with Machine Learning
Prasanna Balaprakash (Argonne National Laboratory)
Optimizing Heterogeneous Computing Resources Based Only on Cost and Time
Suraj Bramhavar (Sync Computing)
AI-Powered Acceleration of Deep Learning Inference on the Cloud
Glenn Ko (Stochastic)
Cost-Effective Batch Scheduling in the Cloud
Chaoran Yu (Apple)
Monday, September 19
2022 Abstract Book