IEEE High Performane Extreme Computing

2022 IEEE High Performance Extreme Computing Virtual Conference 19 - 23 September 2022

Welcome

Organizers

Advisory Board

Technical Committee

1.1: Advanced Multicore Software Technologies Session (11:00-12:15) Co-Chairs: Franz Franchetti & Brian Sroka Performance speedup of Quantum Espresso using optimized AOCL-FFTW S. Biplab Raut (AMD) Quantum Espresso (QE) is an open-source software suite for electronic-structure calculations and materials modeling at the nanoscale. QE depends upon multiple libraries including an internal or external library for FFT computations. The iterative diagonalization process and the computation of charge density in QE use forward and inverse 3D FFTs that account for a large portion of the total application runtime. AOCL-FFTW is the FFT library recommended for QE on AMD CPU systems. QE currently uses the FFTW library in a sub-optimal manner thereby not achieving the best performance. This paper presents a new set of design and implementation strategies applied in AOCL-FFTW to overcome the major limitations of QE in its use of FFTW without requiring any code changes in QE. Results showcasing the performance benefits of the proposed optimizations in AOCL-FFTW are presented in this paper. Speedups are achieved in single-node and multi-node test executions that help to accelerate the QE application. Task-Parallel Programming with Constrained Parallelism Tsung-Wei Huang (University of Utah); Leslie Hwang (Synopsis) Task graph programming model (TGPM) has become central to a wide range of scientific computing applications because it enables top-down optimization of parallelism that governs the macro-scale performance. Existing TGPMs focus on expressing tasks and dependencies of a workload and leave the scheduling details to a library runtime. While maximizing the task concurrency is a typical scheduling goal, many applications require task parallelism to be constrained during the graph execution. Examples are limiting the number of worker threads in a subgraph or relating a conflict between two tasks. How- ever, mainstream TGPMs have largely ignored this important feature of constrained parallelism in a task graph. Users have no choice but to implement a separate and often sophisticated scheduling solution that is neither generalizable nor scalable. In this paper, we propose a semaphore programming model and a scheduling method both of which can be easily integrated into an existing TGPM to support constrained parallelism. We have demonstrated the effectiveness and efficiency of our approach in real applications. As an example, our semaphore model speeds up an industrial circuit placement workload up to 28%. HashTag: Fast Lookup in a Persistent Memory Filesystem Matthew Curtis-Maury; Yash Trivedi (NetApp) Persistent Memory (PM) offers byte-addressability and persistence on the memory bus, and delivers dramatic performance improvements over traditional storage media. While many filesystems have been optimized for PM, a large fraction of processing time is generally spent locating the required data in PM due to the standard use of extent-trees for location indexing. This paper presents HashTag, a cache of PM locations for use in PM filesystems with support for snapshot creation. We evaluate HashTag across a range of configurations to determine the impact of various location caching options on filesystem performance. These lessons can inform the design of future caching solutions in PM filesystems. Computing In-Place FFTs with SIMD Lane Slicing Benoît Dupont de Dinechin (Kalray) We present an approach for implementing in-place FFTs on cores fitted with SIMD units and non-temporal load-store units. Loading the input samples with SIMD instructions decimates them in time across the SIMD lanes. A classic FFT implementation is extended to operate on SIMD data rather than scalar data and computes the sub-transforms concurrently. This enables efficient exploitation of the SIMD arithmetic and memory access instructions while involving little SIMD lane shuffling. A last FFT stage then recombines in-place the sub-transforms results to produce the output. We illustrate this approach on a Cooley-Tukey radix-4 decimated-in-frequency FFT implementation, which also integrates the two inner loop collapsing optimization of the TI C6x DSP\_fft32x32 code that enables software pipelining and the Burrus technique for using bit-reversal in high-radix FFT implementations. Performance evaluations are performed on the Kalray KV3 core, which implements a 64-bit vector-scalar VLIW architecture with level-1 cache bypass load instructions. Applying the Midas Touch of Reproducibility to High-Performance Computing Austin Minor; Wu-chun Feng (Virginia Tech) "With the exponentially improving serial performance of CPUs from the 1980s and 1990s slowing to a standstill by the 2010s, the high-performance computing (HPC) community has seen parallel computing become ubiquitous, which, in turn, has led to a proliferation of parallel programming models, including CUDA, OpenACC, OpenCL, OpenMP, and SYCL. This diversity in hardware platform and programming model has forced application users to port their codes from one hardware platform to another (e.g., CUDA on NVIDIA GPU to HIP or OpenCL on AMD GPU) and demonstrate reproducibility via ad-hoc testing. To more rigorously ensure reproducibility between codes, we propose Midas, a system to ensure that the results of the original code match the results of the ported code by leveraging the power of snapshots to capture the state of a system before and after the execution of a kernel. " Poster Session: 1-P (12:15-14:15): Poster Session 1 Chair(s)/Host(s): TBD & TBD Resource-Constrained Optimizations For Synthetic Aperture Radar On-Board Image Processing [Outstanding Paper Award] Maron Schlemon (German Aerospace Center); Martin Schulz (TU Munich); Rolf Scheiber (German Aerospace Center) Synthetic Aperture Radar (SAR) can be used to create realistic and high-resolution 2D or 3D reconstructions of landscapes. The data capture is typically deployed using radar instruments in specially equipped, low flying planes, resulting in a large amount of raw data, which needs to be processed for image reconstruction. However, due to limited on-board processing capacities on the plane (power, size, weight, cooling, communication bandwidth to ground stations, etc.) and the need to capture many images during a single flight, the raw data must be processed on-board and then sent to the ground station efficiently as image products. In this paper we describe the processing architecture of the digital beamforming SAR (DBFSAR) of the German Areaospace Center (DLR) and the special steps that had to be taken to enable the on-board processing. We explain the required software optimizations and under which conditions their integration in the SAR imaging process leads to (near) real-time capability. We further describe the lessons learned in our work and discuss how they can be applied to other processing scenarios with limited resource availability. Enhancing the Performance Portability of Heterogeneous Circuit Analysis Programs Tsung-Wei Huang (University of Utah) Recently, CPU-GPU heterogeneous parallelism has brought transformational performance milestones to static timing analysis (STA) algorithms. As the computing ecosystem continues to proliferate, performance portability has emerged as a new challenge when deploying the result to diverse heterogeneous computing platforms. Specifically, the optimal code written on a CPU-GPU architecture may not be optimal for other CPU- GPU architectures, due to various performance, interoperability, and availability constraints. As a result, we introduce in this paper a learning-based framework to enhance the performance portability of a GPU- accelerated STA program. We parameterize important performance parameters and leverage a neural network model to adapt performance optimization to any given computing platforms. We have demonstrated the effectiveness of our framework in real STA applications. Kv2vec: A Distributed Representation Method for Key-value Pairs from Metadata Attributes Chenxu Niu; Wei Zhang (Texas Tech Univ.); Suren Byna (LBNL); Yong Chen (Texas Tech Univ.) Distributed representation methods for words have been developed for years, and numerous methods exist, such as word2vec, GloVe, and fastText. However, they are not designed for key-value pairs, which is an important data pattern and widely used in many scenarios. For example, metadata attributes of scientific files consist of a collection of key-value pairs. In this research, we propose \texttt{kv2vec}, a method that captures relationships between keys and values and represents key-value pairs in dense vectors. The fundamental idea of the \texttt{kv2vec} method is utilizing recurrent neural networks (RNNs) with long short-term memory (LSTM) hidden units to convert each key-value pair to a distributed vector representation. This new method overcomes the weaknesses of existing embedding models for representing key-value pairs as vectors. Moreover, it can be integrated into dataset search solutions through querying metadata attributes for self-describing file formats that are widely used in HPC systems. We evaluate the \texttt{kv2vec} method with multiple real-world datasets, and the results show that \texttt{kv2vec} outperforms existing models. Unsupervised Adaptation of Spiking Networks in a Gradual Changing Environment Zaidao Mei (Syracuse Univ.); Mark Barnell (Air Force Research Laboratory); Qinru Qiu (Syracuse Univ.) "Spiking neural networks(SNNs) have drawn broad research interests in recent years due to their high energy efficiency and biologically-plausibility. They have proven to be competitive in many machine learning tasks. Similar to all Artificial Neural Network(ANNs) machine learning models, the SNNs rely on the assumption that the training and testing data are drawn from the same distribution. As the environment changes gradually, the input distribution will shift over time, and the performance of SNNs turns out to be brittle. To this end, we propose a unified framework that can adapt non-stationary streaming data by exploiting unlabeled intermediate domain, and fits with the in-hardware SNN learning algorithm Error-modulated STDP. Specifically, we propose a unique self-training framework to generate pseudo labels to retrain the model for intermediate and target domains. In addition, we develop an online-normalization method with an auxiliary neuron to normalize the output of the hidden layers. By combining the normalization with self-training, our approach gains average classification improvements over 10% on MNIST, NMINST, and two other datasets." Predicting Ankle Moment Trajectory with Adaptive Weighted Ensemble of LSTM Networks Emilia A Grzesiak; Ho Chit Siu; Jennifer Sloboda (MIT Lincoln Laboratory) "Estimations of ankle moments can provide clinically helpful information on the function of lower extremities and further lead to insight on patient rehabilitation and assistive wearable exoskeleton design. Current methods for estimating ankle moments leave room for improvement, with most recent cutting-edge methods relying on machine learning models trained on wearable sEMG and IMU data. While machine learning eliminates many practical challenges that troubled more traditional human body models for this application, we aim to expand on prior work that showed the feasibility of using LSTM models by employing an ensemble of LSTM networks. We present an adaptive weighted LSTM ensemble network and demonstrate its performance during standing, walking, running, and sprinting. Our result show that the LSTM ensemble outperformed every single LSTM model component within the ensemble. Across every activity, the ensemble reduced median root mean squared error (RMSE) by 0.0017-0.0053 K*m/kg, which is 2.7-10.3% lower than the best performing single LSTM model. Hypothesis testing revealed that most reductions in RMSE were statistically significant between the ensemble and other single models across all activities and subjects. Future work may analyze different trajectory lengths and different combinations of LSTM submodels within the ensemble. This study improves on an existing approach to joint moment prediction from wearable sensors, which may be used to obtain clinically-useful information about joint kinetics outside of a motion capture space. " Interval Arithmetic-based FFT for Large Integer Multiplication Zibo Gong; Nathan Zhu; Matt Ngaw (Carnegie Mellon Univ.); Joao Rivera (ETH Zurich); Larry Tang; Eric Tang; Het Mankad; Franz Franchetti (Carnegie Mellon Univ.) In this work we propose an interval arithmetic Fast Fourier Transform (FFT) algorithm for large integer multiplication on both CPU and GPU. We utilize techniques of double-double precision, shared memory, and thread parallelization to improve both the efficiency and accuracy of our implementation. Early results show that for CPU, we can achieve correctness on factors of billions of digits in size. On GPU, we see performance speedups compared to existing software libraries, lowering computation cost without adversely impacting accuracy of the result. Machine Learning for Accurate and Fast Bandgap Prediction of Solid-State Materials Shomik Verma; Shivam Kajale; Rafael Gomez-Bombarelli (MIT) Semi-local DFT tends to vastly underestimate the bandgap of materials. Here we propose a machine learning calibration workflow to improve the accuracy of cheap DFT calculations. We first compile a dataset of 25k materials with PBE and HSE calculations completed. Using this dataset, we benchmark various machine learning architectures and features to determine which results in the highest accuracy. The best technique is able to improve the accuracy of PBE 10-fold. We then expand the generalizability of the model by utilizing active learning to intelligently sample chemical space. Because HSE data is not available for these new materials, we develop an optimized high-throughput parallelized workflow to calculate HSE bandgaps of 10k additional materials. We therefore develop a cheap, accurate, and generalized ML model for bandgap prediction. Systolic Array based FPGA accelerator for Yolov3-tiny Prithvi Velicheti; Sivani Pentapati; Suresh Purini (IIIT Hyderabad) "FPGAs are increasingly significant for deploying convolutional neural network (CNN) inference models because of performance demands and power constraints in embedded and data centre applications. Object detection and classification are essential tasks in computer vision. You Only Look Once (YOLO) is a very efficient algorithm for object detection and classification with its variant Yolov3-tiny specially designed for embedded applications. This paper presents the FPGA accelerator for multiple precisions (FIXED-8, FIXED-16, FLOAT32) of YoloV3-tiny. We use a homogenous systolic array architecture with a synchronized pipeline adder tree for convolution, allowing it to be scalable for multiple variants of Yolo with a change in host driver. We evaluated the design on Terasic DE5a-Net-DDR4. The Fixed point (FP-8, FP-16) implementations attain a throughput of 57 GOPs/s (> 23%) and 46.16 GOPs/s (> 340 %). We synthesized the first FLOAT32 implementation attaining 11.22 GFLOPs/s." Epigenetics and Transcriptomics Quality Control Pipelines in a HPC Environment Darrell O Ricke (MIT Lincoln Laboratory); Derek Ng (Northeastern Univ.); Philip Fremont-Smith; Adam Michaleas; Rafael Jaimes (MIT Lincoln Laboratory) "Chemical and pathogen exposures can modify an individual’s epigenome and transcriptome. These modifications can persist over time and may provide distinctive signatures and timelines of exposure. These signatures may be distinctive for different viral pathogens, bacterial pathogens, and chemical exposures. Exposure signature discovery is enabled by improved transcriptomic and epigenomic assaying techniques to detect RNA expression, DNA base modifications, histone modifications, and chromatin accessibility. However, there is a paucity of quality control (QC) guidelines and software to ensure data integrity and accuracy. We developed analytical pipelines to validate QC of data generated by twelve different transcriptomic and epigenomic assays. These QC pipelines were containerized using Singularity to ensure portability and scalability across high performance computing environments. We deployed the pipelines on the MIT SuperCloud high performance computing system and report execution time. Quality thresholds and metrics are also proposed across the broad set of assays, which may serve as a comprehensive reference guide. These tools and associated metrics are available as open source resources." 1-2: Cloud HPEC Session (12:30-13:45) Co-Chairs: Brian Sroka & Laura Brattain Invited Talk: HPC Matters! How Supercomputing Supports NASA’s Mission Dr. Piyush Mehrotra (NASA) Scalable Interactive Autonomous Navigation Simulations on HPC Wesley Brewer; Joel Bretheim (HPCMP PET/GDIT); John Kaniarz (DEVCOM Ground Vehicle Systems Center); Peilin Song; Burhman Gates (Engineer Research & Development Center) We present our work of enabling HPC in an interactive real-time autonomy loop. The workflow consists of many different software components deployed within Singularity containers and communicating using both the Robotic Operating System's (ROS) publish- subscribe system and the Message Passing Interface (MPI). We use Singularity's container networking interface (CNI) to enable virtual networking within the containers, so that multiple containers can run the various components using different IP addresses on the same compute node. The Virtual Autonomous Navigation Environment Environmental Sensor Engine (VANE:ESE) is used for physically-realistic simulation of LIDAR along with the Autonomous Navigation Virtual Environment Laboratory (ANVEL) for vehicle simulation. VANE:ESE sends Velodyne UDP LIDAR packets directly to the Robotic Technology Kernel (RTK) and is distributed across multiple compute nodes via MPI along with OpenMP for shared memory parallelism within each compute node. The user interfaces with the navigation environment using an XFCE desktop with virtual workspaces running over a VNC containerized deployment through a double-hop ssh tunnel, which uses noVNC (a JavaScript-based VNC client) to provide a browser-based client interface. We automate the complete launch process using a custom iLauncher plugin. We benchmark scalable performance with multiple vehicle simulations on four different HPC systems and discuss our findings. Parallelizing Explicit and Implicit Extrapolation Methods for Ordinary Differential Equations Utkarsh (IIT Kanpur); Chris Elrod; Yingbo Ma; Christopher Rackauckas (Julia Computing) Numerically solving ordinary differential equations (ODEs) is a naturally serial process and as a result the vast majority of ODE solver software are serial. In this manuscript we developed a set of parallelized ODE solvers using extrapolation methods which exploit ``parallelism within the method'' so that arbitrary user ODEs can be parallelized. We describe the specific choices made in the implementation of the explicit and implicit extrapolation methods which allow for generating low overhead static schedules to then exploit with optimized multi-threaded implementations. We demonstrate that while the multi-threading gives a noticeable acceleration on both explicit and implicit problems, the explicit parallel extrapolation methods gave no significant improvement over state-of-the-art even with a multi-threading advantage against current optimized high order Runge-Kutta tableaus. However, we demonstrate that the implicit parallel extrapolation methods are able to achieve state-of-the-art performance (2x-4x) on standard multicore x86 CPUs for systems of < 200 stiff ODEs solved at low tolerance, a typical setup for a vast majority of users of high level language equation solver suites. The resulting method is distributed as the first widely available open source software for within- method parallel acceleration targeting typical modest compute architectures. SuperCloud Lite in the Cloud – Lightweight, Secure, Self-Service, On-Demand Mechanisms for Creating Customizable Research Computing Environments Kelsie Edie (US Military Academy); Kurt Keville; Lauren Milechin; Chris N Hill (MIT) We describe and examine an automation for deploying on-demand, OAuth2 secured virtual machine instances. Our approach does not require any expert security and web service knowledge to create a secure instance. The approach allows non-experts to launch web-accessible virtual machine services that are automatically secured through OAuth2 authentication, an authentication standard widely employed in academic and enterprise environments. We demonstrate the approach through an example of creating secure commercial cloud instances of the MIT SuperCloud modern research computing oriented software stack. A small example of a use case is examined and compared with native MIT SuperCloud experience as a preliminary evaluation. The example illustrates several useful features. It retains OAuth2 security guarantees and leverages a simple OAuth2 proxy architecture that in turn employs simple DNS based service limits to manage access to the proxy service. The system has the potential to provide a default secure environment in which access is, in theory, limited to a narrow trust circle. It leverages WebSockets to provide a pure browser enabled, zero install base service. For the user, it is entirely self-service so that a non- expert, non-privileged user can launch instances, while supporting access to a familiar environment on a broad selection of hardware, including high-end GPUs and isolated bare-metal resources. The environment includes pre-configured browser based desktop GUI and notebook configurations. It can provide the option of end-user privileged access to the VM for flexible customization. It integrates with a simplified cost- monitoring and machine management framework that provides visibility to commercial cloud charges and some budget guard rails, and supports instance stop, restart, and pausing features to allow intermittent use and cost reduction. Site-Wide HPC Data Center Demand Response Daniel C Wilson; Ioannis Paschalidis; Ayse K. Coskun (Boston Univ.) "As many electricity markets are trending towards greater renewable energy generation, there will be an increased need for electrical grids to cooperatively balance electricity supply and demand. Data centers are one large consumer of electricity on a global scale, and they are well-suited to act as a grid load stabilizer via performing ""demand response."" Prior investigations in this space have demonstrated how data centers can continue to meet their users' quality of service (QoS) needs by modeling relationships between cluster job queues, server power properties, and application performance. While server power is a major factor in data center power consumption, other components such as cooling systems contribute a non-negligible amount of electricity demand. This work proposes using a simple site-wide (i.e., including all components of the data center) power model on top of QoS-aware demand response solutions to achieve the QoS benefits of those solutions while improving the cost-saving opportunities in demand response. We demonstrate 1.3x cost savings compared to QoS-aware demand response policies that do not utilize site- wide power models, and show similar savings in cases of severely under-predicted site-wide power consumption if 1.5x relaxed QoS constraints are allowed." 1-3: Quantum and Non-Deterministic Computing Session (14:15-15:30) Co-Chairs: Patrick Dreher & Donato Kava C2QA – Bosonic Qiskit [Outstanding Paper Award] Timothy Stavenger (PNNL); Eleanor Crane (JQI, QuICS); Kevin Smith (Brookhaven National Laboratory, Yale Univ.); Christopher T Kang (Univ. of Washington); Steven Girvin (Yale Univ.); Nathan Wiebe (Univ. of Toronto, PNNL) The practical benefits of hybrid quantum information processing hardware that contains continuous-variable objects (bosonic modes such as mechanical or electromagnetic oscillators) in addition to traditional (discrete-variable) qubits have recently been demonstrated by experiments with bosonic codes that reach the break-even point for quantum error correction [1]–[5] and by efficient Gaussian boson sampling simulation of the Franck-Condon spectra of triatomic molecules [6] that is well beyond the capabilities of current qubit-only hardware. The goal of this Co-design Center for Quantum Advantage (C2QA) project is to develop an instruction set architecture (ISA) for hybrid qubit/bosonic mode systems that contains an inventory of the fundamental operations and measurements that are possible in such hardware. The corresponding abstract machine model (AMM) could also contain a description of the appropriate error models associated with the gates, measurements and time evolution of the hardware. This information has been implemented as an extension of IBM Qiskit. IBM Qiskit is an open-source software development toolkit (SDK) for simulating the quantum state of a quantum circuit and for running the same circuits on prototype hardware within the IBM Quantum Experience. We introduce the Bosonic Qiskit software to enable the simulation of hybrid qubit/bosonic systems using the existing Qiskit software development kit [7]. This implementation can be used for simulating new hybrid systems, verifying proposed physical systems, and modeling systems larger than can currently be constructed. We also cover tutorials and example use cases included within the software to study Jaynes-Cummings models, bosonic Hubbard models, plotting Wigner functions and animations, and calculating maximum likelihood estimations using Wigner functions. Constructing Optimal Contraction Trees for Tensor Network Quantum Circuit Simulation [Outstanding Student Paper Award] Cameron A Ibrahim (Univ. of Delaware); Danylo Lykov (Argonne National Laboratory); Zichang He (UC Santa Barbara); Yuri Alexeev (Argonne National Laboratory); Ilya Safro (Univ. of Delaware) One of the key problems in tensor network based quantum circuit simulation is the construction of a contraction tree which minimizes the cost of the simulation, where the cost can be expressed in the number of operations as a proxy for the simulation running time. This same problem arises in a variety of application areas, such as combinatorial scientific computing, marginalization in probabilistic graphical models, and solving constraint satisfaction problems. In this paper, we reduce the computationally hard portion of this problem to one of graph linear ordering, and demonstrate how existing approaches in this area can be utilized to achieve results up to several orders of magnitude better than existing state of the art methods for the same running time. To do so, we introduce a novel polynomial time algorithm for constructing an optimal contraction tree from a given order. Furthermore, we introduce a fast and high quality linear ordering solver, and demonstrate its applicability as a heuristic for providing orderings for contraction trees. Finally, we compare our solver with competing methods for constructing contraction trees in quantum circuit simulation on a collection of randomly generated Quantum Approximate Optimization Algorithm Max Cut circuits and show that our method achieves superior results on a majority of tested quantum circuits. Reproducibility: Our source code and data are available at https://github.com/cameton/HPEC2022_ContractionTrees." Quantum Netlist Compiler (QNC) Shamminuj Aktar; Abdel-Hameed A. Badawy (New Mexico State Univ.); Nandakishore Santhi (Los Alamos National Laboratory) Over the last decade, Quantum Computing hardware has undergone rapid development and has become a very intriguing, promising, and active research field among scientists worldwide. To achieve the desired quantum functionalities, quantum algorithms require translation from a high-level description to a machine-specific physical operation sequence. In contrast to classical compilers, state-of-the-art quantum compilers are in their infancy. There is a research need for a quantum compiler that can deal with generic unitary operators and generate basic unitary operations according to quantum machines’ diverse underlying technologies and characteristics. In this work, we introduce Quantum Netlist Compiler (QNC) that converts arbitrary unitary operators or desired initial states of quantum algorithms to OpenQASM-2.0 circuits enabling to run them on actual quantum hardware. Extensive simulations on IBM quantum system and results analysis show that QNC is well suited for quantum circuit optimization and produces circuits with competitive success rates in practice. Hardware Design and Implementation of Classic McEliece Post-Quantum Cryptosystem Based on FPGA Shaofen Chen; Haiyan Lin; Wenjin Huang; Yihua Huang (Sun Yat-sen Univ.) With the development of information age, the security of data transmission has attracted more attention. In addition, quantum computers pose a great threat to widely used cryptography algorithms. Therefore, Classic McEliece algorithm is a post-uantum algorithm, which has high security and stands firm in all kinds of attacks for decades. The wide application of the cryptosystem is inseparable from its hardware implementation scheme. So this paper proposes a Classic McEliece implementation scheme based on FPGA platform. To achieve the balance between resources and speed, a variety of methods to implement the scheme are adopted. First, using the characteristics of random access in the RAMthe clock cycle consumption of the error vector generating module is reduced by 95.1%. Second, multiple computing units are employed inside the module for parallel computing and which reduces the number of computing cycles by about 22.4%. Finally, this thesis proposes a multiplexing syndrome decoding module, and compared to the non-multiplexing scheme, the LUT resource consumption of this thesis is reduced by about 24.2%, and the FF resource consumption of this thesis is reduced by about 15.4%. Hardware Design and Implementation of Post-Quantum Cryptography Kyber Qingru Zeng; Quanxin Li; Baoze Zhao; Han Jiao; Yihua Huang (Sun Yat-sen Univ.) In order to resist quantum attacks, post-quantum cryptographic algorithms have become the focus of cryptography research. As a lattice-based key algorithm, the Kyber protocol has great advantages in the selection of post-quantum algorithms. This paper proposes an efficient hardware design scheme for Kyber512 whose security level is L1. This paper first design a general hash module to reuse computing cores to improve resource utilization.A ping-pong RAM and a pipeline structure is used to design a general-purpose NTT processor to support all operations on polynomial multiplication.Finally, the inter-module cooperation and data scheduling are compactly designed to shorten the working cycle. In this paper, the top-level key generation, public key encryption and private key decryption modules are implemented on Artix 7 FPGA with 204MHz frequency. The times of the corresponding modules are 11.5μs, 17.3μs, and 23.5μs, respectively. Compared with the leading hardware implementation, the design in this paper reduces the area-delay product by 10.2\%, achieving an effective balance between resources and area. 1-4: BRAIDS – Boosting Resilience through Artificial Intelligence and Decision Support Session (15:45- 17:00) Co-Chairs: Courtland VanDam & Sandeep Pisharody Invited Talk: Welcome to CyberWar: Long Term Ramifications Unleashed by Russia’s War Barry Greene (Akamai) Distributed Hardware Accelerated Secure Joint Computation on the COPA Framework [Outstanding Student Paper Award] Rushi Patel; Pouya Haghi (Boston Univ.); Shweta Jain; Andriy Kot; Venkata Krishnan (Intel); Mayank Varia; Martin Herbordt (Boston Univ.) Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide additional functionality and enable higher bandwidth communication and lower latency. Until lately, however, the lack of a simple approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel’s Configurable Network Protocol Accelerator (COPA) provides a customizable FPGA framework that integrates both hardware and software development to improve computation and communication performance. In this first case study, we demonstrate the capabilities of the COPA framework with an application from cryptography -- secure Multi-Party Computation (MPC) -- that utilizes hardware accelerators connected directly to host memory and the COPA network. We find that using the COPA framework gives significant improvements to both computation and communication as compared to traditional implementations of MPC that use CPUs and NICs. A single MPC accelerator running on COPA enables more than 17Gb/s of communication bandwidth while using only 3% of Stratix 10 resources. We show that utilizing the COPA framework enables multiple MPC accelerators running in parallel to fully saturate a 100Gbps link enabling higher performance compared to traditional NICs. Large Scale Enrichment and Statistical Cyber Characterization of Network Traffic Ivan Kawaminami; Arminda Estrada; Youssef Elsakkary (Univ. of Arizona); Hayden Jananthan (MIT LLSC); Aydin Buluc (LBNL); Tim Davis (Texas A&M Univ.); Daniel Grant (GreyNoise); Michael Jones (MIT LLSC); Chad Meiners (MIT Lincoln Laboratory); Andrew Morris (GreyNoise); Sandeep Pisharody (MIT Lincoln Laboratory); Jeremy Kepner (MIT LLSC) Modern network sensors continuously produce enormous quantities of raw data that are beyond the capacity of human analysts. Cross-correlation of network sensors increases this challenge by enriching every network event with additional metadata. These large volumes of enriched network data present opportunities to statistically characterize network traffic and quickly answer a key question: “What are the primary cyber characteristics of my network data?” The Python GraphBLAS and PyD4M analysis frameworks enable anonymized statistical analysis to be performed quickly and efficiently on very large network data sets. This approach is tested using billions of anonymized network data samples from the largest Internet observatory (CAIDA Telescope) and tens of millions of anonymized records from the largest commercially available background enrichment capability (GreyNoise). The analysis confirms that most of the enriched variables follow expected heavy-tail distributions and that a large fraction of the network traffic is due to a small number of cyber activities. This information can simplify the cyber analysts’ task by enabling prioritization of cyber activities based on statistical prevalence. Edge Computing Security for a Multi-Agent System Alice Lee; Karen Gettings; Matthias Beebe; Paul Monticciolo; Michael Vai (MIT Lincoln Laboratory) We examine the computational energy requirements of different systems driven by the geometrical scaling law (known as Moore’s law or Dennard Scaling for geometry) and increasing use of Artificial Intelligence/ Machine Learning (AI/ML) over the last decade. With more scientific and technology applications based on data-driven discovery, machine learning methods, especially deep neural networks, have become widely used. In order to enable such applications, both hardware accelerators and advanced AI/ML methods have led to introduction of new architectures, system designs, algorithms, and software. Our analysis of energy trends indicates three important observations: 1) Energy efficiency due to geometrical scaling is slowing down; 2) The energy efficiency at the bit-level does not translate into efficiency at the instruction level, or at the system level for a variety of systems, especially for large-scale supercomputers; 3) At the application level, general-purpose ML/AI methods can be computationally energy intensive, off-setting the gains in energy from geometrical scaling and special purpose accelerators. Further, our analysis provides specific pointers for integrating energy efficiency with performance analysis for enabling ML/AI-driven and high-performance computing applications in the future. Invited Talk: Proposed Empirical Assessment of Remote Workers’ Cyberslacking and Computer Security Posture to Assess Organizational Cybersecurity Risks Ariel Luna; Yair Levy; Greg Simco; Wei Li (Nova Southeastern University) Cyberslacking is conducted by employees who are using their companies’ equipment and network for personal purposes instead of working during work hours. Cyberslacking has a significant adverse effect on overall employee productivity, however, recently, due to COVID19 move to remote working also pose a cybersecurity risk to organizations networks and infrastructure. In this work-in- progress research study, we are developing, validating, and will empirically test a taxonomy to assess an organization’s remote workers’ risk level of cybersecurity threats. This study includes a three-phased developmental approach in developing the Remote Worker Cyberslacking Security Risk Taxonomy. In collaboration with cybersecurity Subject Matter Experts (SMEs) use the taxonomy to assess organization’s remote workers’ risk level of cybersecurity threats by using actual system indicators of productivity measures to estimate their cyberslacking along with assessing via organizational information the computer security posture of the remote device being used to access corporate resources. Anticipated results from 125 anonymous employees from one organization will then be assessed on the cybersecurity risk taxonomy where recommendation to the organization’s cybersecurity leadership will be provided. 1-S1: Sky Computing – Toward Efficient Computing on the Cloud Special (17:30-19:30) Organizers: Marco Montes de Oca, Luna Xu, Erica Lin, Suraj Bramhavar, Jeffrey Chou (Sync Computing) Running Spark Applications In Large Scale On K8s: Challenges and Solutions Bo Yang (Stealth Startup) Taming High-Performance Computing Platform Heterogeneity with Machine Learning Prasanna Balaprakash (Argonne National Laboratory) Optimizing Heterogeneous Computing Resources Based Only on Cost and Time Suraj Bramhavar (Sync Computing) AI-Powered Acceleration of Deep Learning Inference on the Cloud Glenn Ko (Stochastic) Cost-Effective Batch Scheduling in the Cloud Chaoran Yu (Apple)

Monday, September 19

2022 Abstract Book