2020 IEEE High Performance Extreme Computing Virtual Conference 21 - 25 September 2020
1-1:  General Purpose GPU Computing Session (11:00-12:15 EDT)
Minesweeper: A Novel and Fast Ordered-Statistic CFAR Algorithm Carl Colena (Lockheed Martin Advanced Technology Laboratories)*; Michael Russell (Lockheed Martin Advanced Technology Laboratories); Stephen Braun (Lockheed Martin Advanced Technology Laboratories) A novel algorithm named ‘Minesweeper’ was developed for computing the Ordered Statistic Constant False Alarm Rate (CFAR) in a computationally efficient and novel way. OS-CFAR processing chains are used in radar applications for noise-floor estimation and target discrimination. Unlike other approaches, this algorithm aims to minimize data reuse by using training cell geometry and an accumulation matrix to compute the noise estimate. Computing the OS-CFAR in this manner affords some unique efficiencies that are novel for this application. This includes runtime invariance by bit-depth of the input data and by training geometry. Three implementations of Minesweeper were developed and benchmarked. The Optimized GPU Implementation (GPU-OPT) performed the best in both throughput and latency for large inputs. This algorithm has potential for use in real-time GPU-accelerated SDR applications. Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs Cade Brown (UTK)*; Ahmad Abdelfattah (UTK); Stanimire Tomov (University of Tennessee); Jack Dongarra (University of Tennessee) Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the Heterogeneous-Computing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements. A Deep Q-Learning Approach for GPU Task Scheduling Ryan Luley (Air Force Research Laboratory)*; Qinru Qiu (Syracuse University) Efficient utilization of resources is critical to system performance and effectiveness for high performance computing systems.  In a graphics processing unit (GPU) –based system, one method for enabling higher utilization is concurrent kernel execution – allowing multiple independent kernels to simultaneously execute on the GPU.  However, resource contention due to the manner in which kernel tasks are scheduled may still lead to suboptimal task performance and utilization.  In this work, we present a deep Q-learning approach to identify an ordering for a given set of tasks which achieves near-optimal average task performance and high resource utilization.  Our solution outperforms other similar approaches and has additional benefit of being adaptable to dynamic task characteristics or GPU resource configurations. GPU-Accelerated Discontinuous Galerkin Methods: 30x Speedup on 345 Billion Unknowns Andrew Kirby (Massachusetts Institute of Technology Lincoln Laboratory)*; Dimitri J. Mavriplis (University of Wyoming ) A discontinuous Galerkin method for the discretization of the compressible Euler equations, the governing equations of inviscid fluid dynamics, on Cartesian meshes is developed for use of Graphical Processing Units via OCCA, a unified approach to performance portability on multi-threaded hardware architectures. A 30x time-to-solution speedup over CPU-only implementations using non-CUDA- Aware MPI communications is demonstrated up to 1,536 NVIDIA V100 GPUs and parallel strong scalability is shown up to 6,144 NVIDIA V100 GPUs for a problem containing 345 billion unknowns. A comparison of CUDA-Aware MPI communication to non- GPUDirect communication is performed demonstrating an additional 24% speedup on eight nodes composed of 32 NVIDIA V100 GPUs. Energy-Efficient Analysis of Synchrophasor Data using the NVIDIA Jetson Nano Suzanne Matthews (US Military Academy)*; Aaron St. Leger (US Military Academy) Smart Grid Technology is an important part of increasing resilience and reliability of power grids. Applying Phasor Measurement Units (PMUs) to obtain synchronized phasor measurements, or synchrophasors, provides more detailed, higher fidelity data that can enhance situational awareness by rapidly detecting anomalous conditions. However, sample rates of PMUs are up to three orders of magnitude faster than traditional telemetry, resulting in large datasets that require novel computing methods to process the data quickly and efficiently. This work aims to improve calculation speed and energy efficiency of anomaly detection by leveraging manycore computing on a NVIDIA Jetson Nano. This work translates an existing PMU anomaly detection scheme into a novel GPU-compute algorithm and compares the computational performance and energy efficiency of the GPU approach to serial and multicore CPU methods. The GPU algorithm was benchmarked on a real dataset of 11.3 million measurements derived from 8 PMUs from a 1:1000 scale emulation of a power grid, and two additional datasets derived from the original dataset. Results show that the GPU detection scheme is up to 51.91 times faster than the serial method, and over 13 times faster than the multicore method. Additionally, the GPU approach exhibits up to 92.3% run-time energy reduction compared to serial method and 78.4% reduction compared to the multicore approach. 1-2: High Performance Data Analysis Session (12:30-13:45 EDT) Large-scale Sparse Tensor Decomposition Using a Damped Gauss-Newton Method Teresa Ranadive (Laboratory for Physical Sciences)*; Muthu Baskaran (Reservoir Labs) CANDECOMP/PARAFAC (CP) tensor decomposition is a popular unsupervised machine learning method with numerous applications.  This process involves modeling a high-dimensional, multi-modal array (a tensor) as the sum of several low-dimensional components.  In order to decompose a tensor, one must solve an optimization problem, whose objective is often given by the sum of the squares of the tensor and decomposition model entry differences.  One algorithm occasionally utilized to solve such problems is CP-OPT-DGN, a damped Gauss-Newton all-at-once optimization method for CP tensor decomposition.  However, there are currently no published results that consider the decomposition of large-scale (with up to billions of non-zeros), sparse tensors using this algorithm.  This work considers the decomposition of large-scale tensors using an efficiently implemented CP-OPT-DGN method.  It is observed that CP- OPT-DGN significantly outperforms CP-ALS (CP-Alternating Least Squares) and CP-OPT-QNR (a quasi-Newton-Raphson all-at-once optimization method for CP tensor decomposition), two other widely used tensor decomposition algorithms, in terms of accuracy and latent behavior detection. Multiscale Data Analysis Using Binning, Tensor Decompositions, and Backtracking Dimitri Leggas (Reservoir Labs, Inc.); Thomas Henretty (Reservoir Labs, Inc.)*; James Ezick (Reservoir Labs); Muthu Baskaran (Reservoir Labs); Brendan von Hofe (Reservoir Labs, Inc.); Grace Cimaszewski (Reservoir Labs, Inc.); M. Harper Langston (Reservoir Labs, Inc); Richard Lethin (Reservoir Labs) Large data sets can contain patterns at multiple scales (spatial, temporal, etc.). In practice, it is useful for data exploration techniques to detect patterns at each relevant scale.  In this paper, we develop an approach to detect activities at multiple scales using tensor decomposition, an unsupervised high-dimensional data analysis technique that finds correlations between different features in the data. This method typically requires that the feature values are discretized during the construction of the tensor in a process called “binning.” We develop a method of constructing and decomposing tensors with different binning schemes of various features in order to uncover patterns across a set of user-defined scales. While binning is necessary to obtain interpretable results from tensor decompositions, it also decreases the specificity of the data. Thus, we develop backtracking methods that enable the recovery of original source data corresponding to patterns found in the decomposition. These technique are discussed in the context of spatiotemporal and network traffic data, and in particular on Automatic Identification System (AIS) data. SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data Keita Teranishi (Sandia National Laboratories)*; Daniel Dunlavy (Sandia National Laboratories); Jeremy Myers (College of William & Mary, Sandia National Laboratories); Richard Barrett (Sandia) Canonical Polyadic tensor decomposition using alternate Poisson regression (CP-APR) is an effective analysis tool for large sparse count datasets. One of the variants using projected damped Newton optimization for row subproblems (PDNR) offers quadratic convergence and is amenable to parallelization. Despite its potential effectiveness, PDNR performance on modern high performance computing (HPC) systems is not well understood. To remedy this, we have developed a parallel implementation of PDNR using Kokkos, a performance portable parallel programming framework supporting efficient runtime of a single code base on multiple HPC systems. We demonstrate that the performance of parallel PDNR can be poor if load imbalance associated with the irregular distribution of nonzero entries in the tensor data is not addressed. Preliminary results using tensors from the FROSTT data set indicate that using multiple kernels to address this imbalance when solving the PDNR row subproblems in parallel can improve performance, with up to 80% speedup on CPUs and 10-fold speedup on NVIDIA GPUs. Scalable Data Generation for Evaluating Mixed-Precision Solvers Piotr Luszczek (University of Tennessee)*; Yaohung Tsai (The University of Tennessee); Neil Lindquist (University of Tennessee); Hartwig Anzt (University of Tennessee); Jack Dongarra (University of Tennessee) We present techniques of generating data for mixed precision solvers that allows to test those solvers in a scalable manner. Our techniques focus on mixed precision hardware and software where both the solver and the hardware can take advantage of mixing multiple floating precision formats. This allows taking advantage of recently released generation of hardware platforms that focus on ML and DNN workloads but can also be utilized for HPC applications if a new breed of algorithms is combined with the custom floating- point formats to deliver performance levels beyond the standard IEEE data types while delivering a comparable accuracy of the results. Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software Jeremy Myers (College of William & Mary, Sandia National Laboratories)*; Daniel Dunlavy (Sandia National Laboratories); Keita Teranishi (Sandia National Laboratories); D.S. Hollman (Sandia National Laboratories) Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large-scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior. 1-3: Multicore Software Technologies Session (14:15-15:30 EDT) Work-Efficient Parallel Algorithms for Accurate Floating-Point Prefix Sums Sean Fraser (MIT); Helen Xu (MIT)*; Charles E. Leisersen (MIT CSAIL) Existing work-efficient parallel algorithms for floating-point prefix sums exhibit either good performance or good numerical accuracy, but not both. Consequently, prefix-sum algorithms cannot easily be used in scientific-computing applications that require both high performance and accuracy.  We have designed and implemented two new algorithms, called CAST_BLK and PAIR_BLK, whose accuracy is significantly higher than that of the high-performing prefix-sum algorithm from the Problem Based Benchmark Suite, while running with comparable performance on modern multicore machines.  Specifically, the root mean squared error of the PBBS code on a large array of uniformly distributed 64-bit floating-point numbers is 8 times higher than that of CAST_BLK and 5.8 times higher than that of PAIR_BLK.  These two codes employ the PBBS three-stage strategy for performance, but they are designed to achieve high accuracy, both theoretically and in practice.  A vectorization enhancement to these two scalar codes trades off a small amount of accuracy to match or outperform the  PBBS code while still maintaining lower error. Machine Learning Algorithm Performance on the Lucata Computer Paul Springer (JPL)* A new parallel computing paradigm has recently become available, one that combines a PIM (processor in memory) architecture with the use of many lightweight threads, where each thread migrates automatically to the memory used by that thread. Our effort focuses on producing performance gains on this architecture for a key machine learning algorithm, Random Forest, that are at least linear in proportion to the number of cores. Beyond that, we show that a data distribution that groups test samples and trees by feature improves run times by a factor more than double the number of cores in the machine.   Automatic Mapping and Optimization to Kokkos with Polyhedral Compilation Muthu Baskaran (Reservoir Labs)*; Charles Jin (Massachusetts Institute of Technology); Benoit Meister (Reservoir Labs); Jonathan springer (Reservoir Labs) In the post-Moore's Law era, the quest for exascale computing has resulted in diverse hardware architecture trends, including novel custom and/or specialized processors to accelerate the systems, asynchronous or self-timed computing cores, and near-memory computing architectures. To contend with such heterogeneous and complex hardware targets, there have been advanced software solutions in the form of new programming models and runtimes. However, using these advanced programming models poses productivity and performance portability challenges. This work takes a significant step towards addressing the performance, productivity, and performance portability challenges faced by the high-performance computing and exascale community. We present an automatic mapping and optimization framework that takes sequential code and automatically generates high-performance parallel code in Kokkos, a performance portable parallel programming model targeted for exascale computing. We demonstrate the productivity and performance benefits of optimized mapping to Kokkos using kernels from a critical application project on climate modeling, the Energy Exascale Earth System Model (E3SM) project. This work thus shows that automatic generation of Kokkos code enhances the productivity of application developers and enables them to fully utilize the benefits of a programming model such as Kokkos. Implementing Sparse Linear Algebra Kernels on the Lucata Pathfinder-A Computer Geraud Krawezik (Emu Technology)*; Shannon Kuntz (Emu Technologies); Peter Kogge (University of Notre Dame) We present the implementation of two sparse linear algebra kernels on a migratory memory-side processing architecture.  The first is the Sparse Matrix-Vector (SpMV) multiplication, and the second is the Symmetric Gauss-Seidel (SymGS) method.  Both were chosen as they account for the largest run time of the HPCG benchmark.  We introduce the system used for the experiments, as well as its programming model and key aspects to get the most performance from it.  We describe the data distribution used to allow an efficient parallelization of the algorithms, and their actual  implementations.  We then present hardware results and simulator traces to explain their behavior. We show an almost linear strong scaling with the code, and discuss future work and improvements. A Scalable Architecture for CNN Accelerators Leveraging High-Performance Memories Maarten Hattink (Columbia University)*; Giuseppe Di Guglielmo (Columbia University); Luca Carloni (Columbia University); Keren Bergman (Columbia University) As FPGA-based accelerators become ubiquitous and more powerful, the demand for integration with High-Performance Memory (HPM) grows. Although HPMs offer a much greater bandwidth than standard DDR4 DRAM, they introduce new design challenges such as increased latency and higher bandwidth mismatch between memory and FPGA cores. This paper presents a scalable architecture for convolutional neural network accelerators conceived specifically to address these challenges and make full use of the memory’s high bandwidth. The accelerator, which was designed using high-level synthesis, is highly configurable. The intrinsic parallelism of its architecture allows near-perfect scaling up to saturating the available memory bandwidth. 1-4: Quantum & Novel Computing Session (15:45-17:00 EDT) Invited Talk: The Need for Hardware-Accelerated Combinatorial Optimization Dr. Jeffrey Chou (Sync Computing); Dr. Suraj Bramhavar (Sync Computing) Abstract Not Available Invited Talk: Advances in Algorithms for Near-Term Quantum Computers Dr. Yudong Cao (Zapata Computing) Abstract Not Available Invited Talk: Post Quantum Cryptography (PQC) - An Overview Manoj Kumar (IBM)*; Pratap Pattnaik (IBM) We discuss the Post Quantum Cryptography algo- rithms for key establishment under consideration by NIST for standardization. Three of these, Crystals-Kyber, Classic McEliece and Supersingular Isogeny based Key Encapsulation (SIKE), are representatives of the three classes of hard problems underlying the security of almost all 69 candidate algorithms accepted by NIST for consideration in round 1 of evaluation. For each algorithm, we briefly describe the hard problem underlying the algorithm’s cryptographic strength, the algebraic structure i.e., the groups or finite fields, underlying the computations, the basic computations performed in these algorithms, the algorithm itself, and the performance considerations for efficient implementation of the basic algorithm on conventional many-core processors. For Crystals-Kyber and SIKE, we will discuss the potential solutions to improve their performance on many-core processors. Homomorphic Encryption for Quantum Annealing with Spin Reversal Transformations Daniel O'malley (Los Alamos National Laboratory)*; John Golden (Los Alamos National Laboratory) Homomorphic encryption has been an area of study in classical computing for decades. The fundamental goal of homomorphic encryption is to enable (untrusted) Oscar to perform a computation for Alice without Oscar knowing the input to the computation or the output from the computation. Alice encrypts the input before sending it to Oscar, and Oscar performs the computation directly on the encrypted data, producing an encrypted result. Oscar then sends the encrypted result of the computation back to Alice, who can decrypt it. We describe an approach to homomorphic encryption for quantum annealing based on spin reversal transformations and show that it comes with little or no performance penalty. This is in contrast to approaches to homomorphic encryption for classical computing, which incur a significant additional computational cost. This implies that the performance gap between quantum annealing and classical computing is reduced when both paradigms use homomorphic encryption. Further, homomorphic encryption is critical for quantum annealing because quantum annealers are native to the cloud -- a third party (such as untrusted Oscar) performs the computation. If sensitive information, such as health-related data subject to the Health Insurance Portability and Accountability Act, is to be processed with quantum annealers, such a technique could be useful. Constrained-optimization Approach Delivers Superior Classical Performance for Graph Partitioning via Quantum-ready Method Steven Reinhardt (Quantum Computing Inc.)*; Uchenna Chukwu (Quantum Computing Inc.); Raouf Dridi (Quantum Computing Inc.); Jesse Berwald (Quantum Computing Inc.); Michael Booth (Quantum Computing Inc.); John Dawson (Quantum Computing Inc.); DeYung Le (Quantum Computing Inc.); Mark Wainger (Quantum Computing Inc.) Graph partitioning is  one  of  an  important  set  of well-known   compute-intense   (NP-hard)   graph   problems   that devolve  to  discrete  constrained  optimization.  We  sampled  solutions  to  the  problem  via  two  different  quantum-ready  methods to  understand  the  strengths  and  weaknesses  of  each  method. First  we  formulated  and  sampled  the  problem  as  a  quadratic unconstrained   binary   optimization   (QUBO)   problem,   via   the best   known   QUBO   formulation,   using   a   best-in-class   QUBO sampler  running  purely  classically.  Second,  we  formulated  th eproblem   at   a   higher   level,   as   a   set   of   constraints   and   an objective   function,   and   sampled   it   with   a   recently   developed constrained-optimization  sampler  (which  generates  QUBOs  and also  samples  them  classically).  We  find  that  both  approaches often  deliver  better  partitions  than  the  purpose-built  classical graph    partitioners.    Further,    we    find    that    the    constrained-optimization  approach  is  often  able  to  deliver  better  partitions in  less  time  than  the  bespoke-QUBO  approach,  without  specific knowledge  of  the  graph-partitioning  problem. Stepping back from graph partitioning itself, one key question is   whether   bespoke   algorithms   for   high-value   problems   or general  tools  for  classes  of  problems  are  more  likely  to  deliver the  power  of  QCs  to  a  broad  market  of  real-world  users.  We believe   this   early   evidence,   though   anecdotal,   supports   the proposition that general tools may contribute significant benefit to   a   range   of   problems,   expanding   the   impact   of   QCs   o nindustry  and  society.  The  fact  that  this  benefit  is  independent of  the  low-level  sampler  employed,  whether  classical  software or   one   of   a   variety   of   QC   architectures,   reinforces   the   need for   further   work   on   high-level   optimization.   The   commercial availability   in   the   cloud   of   such   software   today,   delivering superior   classical   performance   for   some   problems,   enables quantum-forward  organizations  to  migrate  to  quantum-ready methods  now,  accelerating  progress  toward  quantum  advantage and  ensuring  sampler  software  evolves  to  address  the  problems such  organizations  value. Stepping back from graph partitioning itself, one key contro- versial question is whether bespoke algorithms for high-value problems or general tools for a class of problems are more likely to deliver the power of QCs to a broad market of real- world users. These results bear on that question, though they only use a few instances and require confirmation on other problems and other instances as well as replacement of the low- level sampler by a QC instead of a classical software sampler. Still, this early evidence supports the proposition that general tools may contribute significant benefit to a range of problems, expanding the impact of QCs on industry and society. The fact that this benefit is independent of the low-level sampler employed, whether classical software or one of a variety of QC architectures, reinforces the need for further work on high- level optimization. The commercial availability in the cloud of such software today, delivering superior classical performance for some problems, enables quantum-forward organizations to migrate to quantum- ready methods now, accelerating progress toward quantum advantage and ensuring sampler software evolves to address the problems such organizations value. 1-S2: BRAIDS Special (17:30-19:30 EDT) AI at the Tactical Edge Alexia Schulz (MIT Lincoln Laboratory)*; Pierre Trepagnier (MIT LL) Artificial intelligence (AI) and Machine Learning (ML) are assuming an increasingly prominent role in the modern armamentarium of the warfighter, ideally acting as an always-on- duty assistant. In this Extended Abstract we explore aspects of AI/ML which are particularly characteristic of its deployment at the tactical edge, by which we mean warfighters directly involved in executing the mission at the “tip of the spear.” AI intrinsically depends on compute power and communications. At the tactical edge both these resources are generally in short supply, expensive to provision, and shared among contending needs at the best of times, let alone in more critical situations. Here we enumerate a number of possible applications of AI/ML at the tactical edge, characterizing them by features such as compute power and data required, both at train time and run time. From these illustrative examples we generalize a set of suitability characteristics for tactical edge AI applications to best ensure that they contribute to warfighter resilience rather than fail at the time of greatest need. Multi-Temporal Analysis and Scaling Relations of 100,000,000,000 Network Packets Jeremy Kepner (MIT Lincoln Laboratory)*; Chad Meiners (MIT); Chansup Byun (MIT Lincoln Laboratory); Sarah McGuire (MIT); Timothy Davis (Texas A&M University); William Arcand (MIT); Jonathan Bernays (MIT); David Bestor (MIT); William Bergeon (MIT); Vijay Gadepally (MIT Lincoln Laboratory); Raul Harnasch (MIT); Matthew Hubbell (MIT Lincoln Laboratory); Michael Hurray (MIT); Michael Jones (MIT Lincoln Laboratory); Anna Klein (MIT); Lauren Milechin (MIT); Julie Mullen (MIT Lincoln Laboratory); Andrew Prout (MIT); Albert Reuther (MIT Lincoln Laboratory); Antonio Rosa (MIT); Siddharth Samsi (MIT Lincoln Laboratory); Doug Stetson (MIT); Adam Tse (MIT); Peter Michaleas (MIT Lincoln Laboratory) Our society has never been more dependent on computer networks.  Effective utilization of networks requires a detailed understanding of the normal background behaviors of network traffic.  Large scale measurements of networks are computationally challenging.  Building on prior work in interactive supercomputing and GraphBLAS hypersparse hierarchical traffic matrices, we have developed an efficient method for computing a wide variety of streaming network quantities on diverse time scales.  Applying these methods to 100,000,000,000 anonymized source-destination pairs collected at a network gateway reveals many previously unobserved scaling relationships.  These observations provide new insights into normal network background traffic that could be used for anomaly detection, AI feature engineering, and testing theoretical models of streaming networks.
Monday, September 21
2020 Abstract Book
1-1:  General Purpose GPU Computing Session (11:00-12:15 EDT)
Minesweeper: A Novel and Fast Ordered-Statistic CFAR Algorithm Carl Colena (Lockheed Martin Advanced Technology Laboratories)*; Michael Russell (Lockheed Martin Advanced Technology Laboratories); Stephen Braun (Lockheed Martin Advanced Technology Laboratories) A novel algorithm named ‘Minesweeper’ was developed for computing the Ordered Statistic Constant False Alarm Rate (CFAR) in a computationally efficient and novel way. OS-CFAR processing chains are used in radar applications for noise-floor estimation and target discrimination. Unlike other approaches, this algorithm aims to minimize data reuse by using training cell geometry and an accumulation matrix to compute the noise estimate. Computing the OS-CFAR in this manner affords some unique efficiencies that are novel for this application. This includes runtime invariance by bit- depth of the input data and by training geometry. Three implementations of Minesweeper were developed and benchmarked. The Optimized GPU Implementation (GPU-OPT) performed the best in both throughput and latency for large inputs. This algorithm has potential for use in real-time GPU-accelerated SDR applications. Design, Optimization, and Benchmarking of Dense Linear Algebra Algorithms on AMD GPUs Cade Brown (UTK)*; Ahmad Abdelfattah (UTK); Stanimire Tomov (University of Tennessee); Jack Dongarra (University of Tennessee) Dense linear algebra (DLA) has historically been in the vanguard of software that must be adapted first to hardware changes. This is because DLA is both critical to the accuracy and performance of so many different types of applications, and because they have proved to be outstanding vehicles for finding and implementing solutions to the problems that novel architectures pose. Therefore, in this paper we investigate the portability of the MAGMA DLA library to the latest AMD GPUs. We use auto tools to convert the CUDA code in MAGMA to the Heterogeneous-Computing Interface for Portability (HIP) language. MAGMA provides LAPACK for GPUs and benchmarks for fundamental DLA routines ranging from BLAS to dense factorizations, linear systems and eigen-problem solvers. We port these routines to HIP and quantify currently achievable performance through the MAGMA benchmarks for the main workload algorithms on MI25 and MI50 AMD GPUs. Comparison with performance roofline models and theoretical expectations are used to identify current limitations and directions for future improvements. A Deep Q-Learning Approach for GPU Task Scheduling Ryan Luley (Air Force Research Laboratory)*; Qinru Qiu (Syracuse University) Efficient utilization of resources is critical to system performance and effectiveness for high performance computing systems.  In a graphics processing unit (GPU) –based system, one method for enabling higher utilization is concurrent kernel execution – allowing multiple independent kernels to simultaneously execute on the GPU.  However, resource contention due to the manner in which kernel tasks are scheduled may still lead to suboptimal task performance and utilization.  In this work, we present a deep Q-learning approach to identify an ordering for a given set of tasks which achieves near- optimal average task performance and high resource utilization.  Our solution outperforms other similar approaches and has additional benefit of being adaptable to dynamic task characteristics or GPU resource configurations. GPU-Accelerated Discontinuous Galerkin Methods: 30x Speedup on 345 Billion Unknowns Andrew Kirby (Massachusetts Institute of Technology Lincoln Laboratory)*; Dimitri J. Mavriplis (University of Wyoming ) A discontinuous Galerkin method for the discretization of the compressible Euler equations, the governing equations of inviscid fluid dynamics, on Cartesian meshes is developed for use of Graphical Processing Units via OCCA, a unified approach to performance portability on multi-threaded hardware architectures. A 30x time-to-solution speedup over CPU-only implementations using non-CUDA-Aware MPI communications is demonstrated up to 1,536 NVIDIA V100 GPUs and parallel strong scalability is shown up to 6,144 NVIDIA V100 GPUs for a problem containing 345 billion unknowns. A comparison of CUDA-Aware MPI communication to non- GPUDirect communication is performed demonstrating an additional 24% speedup on eight nodes composed of 32 NVIDIA V100 GPUs. Energy-Efficient Analysis of Synchrophasor Data using the NVIDIA Jetson Nano Suzanne Matthews (US Military Academy)*; Aaron St. Leger (US Military Academy) Smart Grid Technology is an important part of increasing resilience and reliability of power grids. Applying Phasor Measurement Units (PMUs) to obtain synchronized phasor measurements, or synchrophasors, provides more detailed, higher fidelity data that can enhance situational awareness by rapidly detecting anomalous conditions. However, sample rates of PMUs are up to three orders of magnitude faster than traditional telemetry, resulting in large datasets that require novel computing methods to process the data quickly and efficiently. This work aims to improve calculation speed and energy efficiency of anomaly detection by leveraging manycore computing on a NVIDIA Jetson Nano. This work translates an existing PMU anomaly detection scheme into a novel GPU-compute algorithm and compares the computational performance and energy efficiency of the GPU approach to serial and multicore CPU methods. The GPU algorithm was benchmarked on a real dataset of 11.3 million measurements derived from 8 PMUs from a 1:1000 scale emulation of a power grid, and two additional datasets derived from the original dataset. Results show that the GPU detection scheme is up to 51.91 times faster than the serial method, and over 13 times faster than the multicore method. Additionally, the GPU approach exhibits up to 92.3% run-time energy reduction compared to serial method and 78.4% reduction compared to the multicore approach. 1-2: High Performance Data Analysis Session (12:30-13:45 EDT) Large-scale Sparse Tensor Decomposition Using a Damped Gauss-Newton Method Teresa Ranadive (Laboratory for Physical Sciences)*; Muthu Baskaran (Reservoir Labs) CANDECOMP/PARAFAC (CP) tensor decomposition is a popular unsupervised machine learning method with numerous applications.  This process involves modeling a high-dimensional, multi-modal array (a tensor) as the sum of several low-dimensional components.  In order to decompose a tensor, one must solve an optimization problem, whose objective is often given by the sum of the squares of the tensor and decomposition model entry differences.  One algorithm occasionally utilized to solve such problems is CP-OPT-DGN, a damped Gauss-Newton all-at-once optimization method for CP tensor decomposition.  However, there are currently no published results that consider the decomposition of large-scale (with up to billions of non-zeros), sparse tensors using this algorithm.  This work considers the decomposition of large-scale tensors using an efficiently implemented CP-OPT-DGN method.  It is observed that CP-OPT- DGN significantly outperforms CP-ALS (CP-Alternating Least Squares) and CP-OPT-QNR (a quasi-Newton-Raphson all-at-once optimization method for CP tensor decomposition), two other widely used tensor decomposition algorithms, in terms of accuracy and latent behavior detection. Multiscale Data Analysis Using Binning, Tensor Decompositions, and Backtracking Dimitri Leggas (Reservoir Labs, Inc.); Thomas Henretty (Reservoir Labs, Inc.)*; James Ezick (Reservoir Labs); Muthu Baskaran (Reservoir Labs); Brendan von Hofe (Reservoir Labs, Inc.); Grace Cimaszewski (Reservoir Labs, Inc.); M. Harper Langston (Reservoir Labs, Inc); Richard Lethin (Reservoir Labs) Large data sets can contain patterns at multiple scales (spatial, temporal, etc.). In practice, it is useful for data exploration techniques to detect patterns at each relevant scale.  In this paper, we develop an approach to detect activities at multiple scales using tensor decomposition, an unsupervised high-dimensional data analysis technique that finds correlations between different features in the data. This method typically requires that the feature values are discretized during the construction of the tensor in a process called “binning.” We develop a method of constructing and decomposing tensors with different binning schemes of various features in order to uncover patterns across a set of user-defined scales. While binning is necessary to obtain interpretable results from tensor decompositions, it also decreases the specificity of the data. Thus, we develop backtracking methods that enable the recovery of original source data corresponding to patterns found in the decomposition. These technique are discussed in the context of spatiotemporal and network traffic data, and in particular on Automatic Identification System (AIS) data. SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data Keita Teranishi (Sandia National Laboratories)*; Daniel Dunlavy (Sandia National Laboratories); Jeremy Myers (College of William & Mary, Sandia National Laboratories); Richard Barrett (Sandia) Canonical Polyadic tensor decomposition using alternate Poisson regression (CP-APR) is an effective analysis tool for large sparse count datasets. One of the variants using projected damped Newton optimization for row subproblems (PDNR) offers quadratic convergence and is amenable to parallelization. Despite its potential effectiveness, PDNR performance on modern high performance computing (HPC) systems is not well understood. To remedy this, we have developed a parallel implementation of PDNR using Kokkos, a performance portable parallel programming framework supporting efficient runtime of a single code base on multiple HPC systems. We demonstrate that the performance of parallel PDNR can be poor if load imbalance associated with the irregular distribution of nonzero entries in the tensor data is not addressed. Preliminary results using tensors from the FROSTT data set indicate that using multiple kernels to address this imbalance when solving the PDNR row subproblems in parallel can improve performance, with up to 80% speedup on CPUs and 10-fold speedup on NVIDIA GPUs. Scalable Data Generation for Evaluating Mixed-Precision Solvers Piotr Luszczek (University of Tennessee)*; Yaohung Tsai (The University of Tennessee); Neil Lindquist (University of Tennessee); Hartwig Anzt (University of Tennessee); Jack Dongarra (University of Tennessee) We present techniques of generating data for mixed precision solvers that allows to test those solvers in a scalable manner. Our techniques focus on mixed precision hardware and software where both the solver and the hardware can take advantage of mixing multiple floating precision formats. This allows taking advantage of recently released generation of hardware platforms that focus on ML and DNN workloads but can also be utilized for HPC applications if a new breed of algorithms is combined with the custom floating-point formats to deliver performance levels beyond the standard IEEE data types while delivering a comparable accuracy of the results. Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software Jeremy Myers (College of William & Mary, Sandia National Laboratories)*; Daniel Dunlavy (Sandia National Laboratories); Keita Teranishi (Sandia National Laboratories); D.S. Hollman (Sandia National Laboratories) Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large- scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior. 1-3: Multicore Software Technologies Session (14:15-15:30 EDT) Work-Efficient Parallel Algorithms for Accurate Floating-Point Prefix Sums Sean Fraser (MIT); Helen Xu (MIT)*; Charles E. Leisersen (MIT CSAIL) Existing work-efficient parallel algorithms for floating-point prefix sums exhibit either good performance or good numerical accuracy, but not both. Consequently, prefix-sum algorithms cannot easily be used in scientific-computing applications that require both high performance and accuracy.  We have designed and implemented two new algorithms, called CAST_BLK and PAIR_BLK, whose accuracy is significantly higher than that of the high-performing prefix-sum algorithm from the Problem Based Benchmark Suite, while running with comparable performance on modern multicore machines.  Specifically, the root mean squared error of the PBBS code on a large array of uniformly distributed 64-bit floating-point numbers is 8 times higher than that of CAST_BLK and 5.8 times higher than that of PAIR_BLK.  These two codes employ the PBBS three-stage strategy for performance, but they are designed to achieve high accuracy, both theoretically and in practice.  A vectorization enhancement to these two scalar codes trades off a small amount of accuracy to match or outperform the  PBBS code while still maintaining lower error. Machine Learning Algorithm Performance on the Lucata Computer Paul Springer (JPL)* A new parallel computing paradigm has recently become available, one that combines a PIM (processor in memory) architecture with the use of many lightweight threads, where each thread migrates automatically to the memory used by that thread. Our effort focuses on producing performance gains on this architecture for a key machine learning algorithm, Random Forest, that are at least linear in proportion to the number of cores. Beyond that, we show that a data distribution that groups test samples and trees by feature improves run times by a factor more than double the number of cores in the machine.   Automatic Mapping and Optimization to Kokkos with Polyhedral Compilation Muthu Baskaran (Reservoir Labs)*; Charles Jin (Massachusetts Institute of Technology); Benoit Meister (Reservoir Labs); Jonathan springer (Reservoir Labs) In the post-Moore's Law era, the quest for exascale computing has resulted in diverse hardware architecture trends, including novel custom and/or specialized processors to accelerate the systems, asynchronous or self-timed computing cores, and near-memory computing architectures. To contend with such heterogeneous and complex hardware targets, there have been advanced software solutions in the form of new programming models and runtimes. However, using these advanced programming models poses productivity and performance portability challenges. This work takes a significant step towards addressing the performance, productivity, and performance portability challenges faced by the high-performance computing and exascale community. We present an automatic mapping and optimization framework that takes sequential code and automatically generates high-performance parallel code in Kokkos, a performance portable parallel programming model targeted for exascale computing. We demonstrate the productivity and performance benefits of optimized mapping to Kokkos using kernels from a critical application project on climate modeling, the Energy Exascale Earth System Model (E3SM) project. This work thus shows that automatic generation of Kokkos code enhances the productivity of application developers and enables them to fully utilize the benefits of a programming model such as Kokkos. Implementing Sparse Linear Algebra Kernels on the Lucata Pathfinder-A Computer Geraud Krawezik (Emu Technology)*; Shannon Kuntz (Emu Technologies); Peter Kogge (University of Notre Dame) We present the implementation of two sparse linear algebra kernels on a migratory memory-side processing architecture.  The first is the Sparse Matrix-Vector (SpMV) multiplication, and the second is the Symmetric Gauss-Seidel (SymGS) method.  Both were chosen as they account for the largest run time of the HPCG benchmark.  We introduce the system used for the experiments, as well as its programming model and key aspects to get the most performance from it.  We describe the data distribution used to allow an efficient parallelization of the algorithms, and their actual  implementations.  We then present hardware results and simulator traces to explain their behavior. We show an almost linear strong scaling with the code, and discuss future work and improvements. A Scalable Architecture for CNN Accelerators Leveraging High- Performance Memories Maarten Hattink (Columbia University)*; Giuseppe Di Guglielmo (Columbia University); Luca Carloni (Columbia University); Keren Bergman (Columbia University) As FPGA-based accelerators become ubiquitous and more powerful, the demand for integration with High-Performance Memory (HPM) grows. Although HPMs offer a much greater bandwidth than standard DDR4 DRAM, they introduce new design challenges such as increased latency and higher bandwidth mismatch between memory and FPGA cores. This paper presents a scalable architecture for convolutional neural network accelerators conceived specifically to address these challenges and make full use of the memory’s high bandwidth. The accelerator, which was designed using high-level synthesis, is highly configurable. The intrinsic parallelism of its architecture allows near-perfect scaling up to saturating the available memory bandwidth. 1-4: Quantum & Novel Computing Session (15:45- 17:00 EDT) Invited Talk: The Need for Hardware-Accelerated Combinatorial Optimization Dr. Jeffrey Chou (Sync Computing); Dr. Suraj Bramhavar (Sync Computing) Abstract Not Available Invited Talk: Advances in Algorithms for Near-Term Quantum Computers Dr. Yudong Cao (Zapata Computing) Abstract Not Available Invited Talk: Post Quantum Cryptography (PQC) - An Overview Manoj Kumar (IBM)*; Pratap Pattnaik (IBM) We discuss the Post Quantum Cryptography algo- rithms for key establishment under consideration by NIST for standardization. Three of these, Crystals-Kyber, Classic McEliece and Supersingular Isogeny based Key Encapsulation (SIKE), are representatives of the three classes of hard problems underlying the security of almost all 69 candidate algorithms accepted by NIST for consideration in round 1 of evaluation. For each algorithm, we briefly describe the hard problem underlying the algorithm’s cryptographic strength, the algebraic structure i.e., the groups or finite fields, underlying the computations, the basic computations performed in these algorithms, the algorithm itself, and the performance considerations for efficient implementation of the basic algorithm on conventional many-core processors. For Crystals-Kyber and SIKE, we will discuss the potential solutions to improve their performance on many-core processors. Homomorphic Encryption for Quantum Annealing with Spin Reversal Transformations Daniel O'malley (Los Alamos National Laboratory)*; John Golden (Los Alamos National Laboratory) Homomorphic encryption has been an area of study in classical computing for decades. The fundamental goal of homomorphic encryption is to enable (untrusted) Oscar to perform a computation for Alice without Oscar knowing the input to the computation or the output from the computation. Alice encrypts the input before sending it to Oscar, and Oscar performs the computation directly on the encrypted data, producing an encrypted result. Oscar then sends the encrypted result of the computation back to Alice, who can decrypt it. We describe an approach to homomorphic encryption for quantum annealing based on spin reversal transformations and show that it comes with little or no performance penalty. This is in contrast to approaches to homomorphic encryption for classical computing, which incur a significant additional computational cost. This implies that the performance gap between quantum annealing and classical computing is reduced when both paradigms use homomorphic encryption. Further, homomorphic encryption is critical for quantum annealing because quantum annealers are native to the cloud -- a third party (such as untrusted Oscar) performs the computation. If sensitive information, such as health-related data subject to the Health Insurance Portability and Accountability Act, is to be processed with quantum annealers, such a technique could be useful. Constrained-optimization Approach Delivers Superior Classical Performance for Graph Partitioning via Quantum-ready Method Steven Reinhardt (Quantum Computing Inc.)*; Uchenna Chukwu (Quantum Computing Inc.); Raouf Dridi (Quantum Computing Inc.); Jesse Berwald (Quantum Computing Inc.); Michael Booth (Quantum Computing Inc.); John Dawson (Quantum Computing Inc.); DeYung Le (Quantum Computing Inc.); Mark Wainger (Quantum Computing Inc.) Graph partitioning is  one  of  an  important  set  of well-known   compute-intense   (NP-hard)   graph   problems   that devolve  to  discrete  constrained  optimization.  We  sampled  solutions  to  the  problem  via  two  different  quantum-ready  methods to  understand  the  strengths  and  weaknesses  of  each  method. First  we  formulated  and  sampled  the  problem  as  a  quadratic unconstrained   binary   optimization   (QUBO)   problem,   via   the best   known   QUBO   formulation,   using   a   best-in-class   QUBO sampler  running  purely  classically.  Second,  we  formulated  th eproblem   at   a   higher   level,   as   a   set   of   constraints   and   an objective   function,   and   sampled   it   with   a   recently   developed constrained-optimization  sampler  (which  generates  QUBOs  and also  samples  them  classically).  We  find  that  both  approaches often  deliver  better  partitions  than  the  purpose-built  classical graph    partitioners.    Further,    we    find    that    the    constrained-optimization  approach  is  often  able  to  deliver  better  partitions in  less  time  than  the  bespoke-QUBO  approach,  without  specific knowledge  of  the  graph-partitioning  problem. Stepping back from graph partitioning itself, one key question is   whether   bespoke   algorithms   for   high-value   problems   or general  tools  for  classes  of  problems  are  more  likely  to  deliver the  power  of  QCs  to  a  broad  market  of  real-world  users.  We believe   this   early   evidence,   though   anecdotal,   supports   the proposition that general tools may contribute significant benefit to   a   range   of   problems,   expanding   the   impact   of   QCs   o nindustry  and  society.  The  fact  that  this  benefit  is  independent of  the  low-level  sampler  employed,  whether  classical  software or   one   of   a   variety   of   QC   architectures,   reinforces   the   need for   further   work   on   high-level   optimization.   The   commercial availability   in   the   cloud   of   such   software   today,   delivering superior   classical   performance   for   some   problems,   enables quantum-forward  organizations  to  migrate  to  quantum-ready methods  now,  accelerating  progress  toward  quantum  advantage and  ensuring  sampler  software  evolves  to  address  the  problems such  organizations  value. Stepping back from graph partitioning itself, one key contro- versial question is whether bespoke algorithms for high-value problems or general tools for a class of problems are more likely to deliver the power of QCs to a broad market of real- world users. These results bear on that question, though they only use a few instances and require confirmation on other problems and other instances as well as replacement of the low- level sampler by a QC instead of a classical software sampler. Still, this early evidence supports the proposition that general tools may contribute significant benefit to a range of problems, expanding the impact of QCs on industry and society. The fact that this benefit is independent of the low-level sampler employed, whether classical software or one of a variety of QC architectures, reinforces the need for further work on high- level optimization. The commercial availability in the cloud of such software today, delivering superior classical performance for some problems, enables quantum-forward organizations to migrate to quantum-ready methods now, accelerating progress toward quantum advantage and ensuring sampler software evolves to address the problems such organizations value. 1-S2: BRAIDS Special (17:30-19:30 EDT) AI at the Tactical Edge Alexia Schulz (MIT Lincoln Laboratory)*; Pierre Trepagnier (MIT LL) Artificial intelligence (AI) and Machine Learning (ML) are assuming an increasingly prominent role in the modern armamentarium of the warfighter, ideally acting as an always-on- duty assistant. In this Extended Abstract we explore aspects of AI/ML which are particularly characteristic of its deployment at the tactical edge, by which we mean warfighters directly involved in executing the mission at the “tip of the spear.” AI intrinsically depends on compute power and communications. At the tactical edge both these resources are generally in short supply, expensive to provision, and shared among contending needs at the best of times, let alone in more critical situations. Here we enumerate a number of possible applications of AI/ML at the tactical edge, characterizing them by features such as compute power and data required, both at train time and run time. From these illustrative examples we generalize a set of suitability characteristics for tactical edge AI applications to best ensure that they contribute to warfighter resilience rather than fail at the time of greatest need. Multi-Temporal Analysis and Scaling Relations of 100,000,000,000 Network Packets Jeremy Kepner (MIT Lincoln Laboratory)*; Chad Meiners (MIT); Chansup Byun (MIT Lincoln Laboratory); Sarah McGuire (MIT); Timothy Davis (Texas A&M University); William Arcand (MIT); Jonathan Bernays (MIT); David Bestor (MIT); William Bergeon (MIT); Vijay Gadepally (MIT Lincoln Laboratory); Raul Harnasch (MIT); Matthew Hubbell (MIT Lincoln Laboratory); Michael Hurray (MIT); Michael Jones (MIT Lincoln Laboratory); Anna Klein (MIT); Lauren Milechin (MIT); Julie Mullen (MIT Lincoln Laboratory); Andrew Prout (MIT); Albert Reuther (MIT Lincoln Laboratory); Antonio Rosa (MIT); Siddharth Samsi (MIT Lincoln Laboratory); Doug Stetson (MIT); Adam Tse (MIT); Peter Michaleas (MIT Lincoln Laboratory) Our society has never been more dependent on computer networks.  Effective utilization of networks requires a detailed understanding of the normal background behaviors of network traffic.  Large scale measurements of networks are computationally challenging.  Building on prior work in interactive supercomputing and GraphBLAS hypersparse hierarchical traffic matrices, we have developed an efficient method for computing a wide variety of streaming network quantities on diverse time scales.  Applying these methods to 100,000,000,000 anonymized source-destination pairs collected at a network gateway reveals many previously unobserved scaling relationships.  These observations provide new insights into normal network background traffic that could be used for anomaly detection, AI feature engineering, and testing theoretical models of streaming networks.
Monday, September 21