IEEE High Performane Extreme Computing

2022 IEEE High Performance Extreme Computing Virtual Conference 19 - 23 September 2022

Welcome

Organizers

Advisory Board

Technical Committee

5-V: Sponsor Showcase – Dell Session (10:30-11:00) Co-Chairs: Albert Reuther How to Wrestle with Global Scale ML and Win Dr. Ben Fauber (Dell Technologies) 5-1: High Performance Data Analysis 1 Session (11:00-12:15) Co-Chairs: Darrell Ricke & Ken Cain Optimizing Performance and Storage of Memory-Mapped Persistent Data Structures [Outstanding Student Paper Award] Karim Youssef (Virginia Tech); Abdullah Al Raqibul Islam (Univ. of North Carolina at Charlotte); Keita Iwabuchi (Lawrence Livermore National Laboratory); Wu-chun Feng (Virginia Tech); Roger Pearce (Lawrence Livermore National Laboratory) Persistent data structures represent a core component of high-performance data analytics. Multiple data processing systems persist data structures using memory-mapped files. Memory-mapped file I/O provides a productive and unified programming interface to different types of storage systems. However, it suffers from multiple limitations, including performance bottlenecks caused by system-wide configurations and a lack of support for efficient incremental versioning. Therefore, many such systems only support versioning via full-copy snapshots, resulting in poor performance and storage capacity bottlenecks. To address these limitations, we present Privateer 2.0, a virtual memory and storage interface that optimizes performance and storage capacity for versioned persistent data structures. Privateer 2.0 improves over the previous version by supporting userspace virtual memory management and block compression. We integrated Privateer 2.0 into Metall, a C++ persistent data structure allocator, and LMDB, a widely-used key-value store database. Privateer 2.0 yielded up to 7.5× speedup and up to 300× storage space reduction for Metall incremental snapshots and 1.25× speedup with 11.7× storage space reduction for LMDB incremental snapshots. Processing Particle Data Flows with SmartNICs [Outstanding Student Paper Award] Jianshen Liu; Carlos Maltzahn (UC Santa Cruz); Matthew Curry; Craig Ulmer (Sandia National Laboratories) Many distributed applications implement complex data flows and need a flexible mechanism for routing data between producers and consumers. Recent advances in programmable network interface cards, or SmartNICs, represent an opportunity to offload data-flow tasks into the network fabric, thereby freeing the hosts to perform other work. System architects in this space face multiple questions about the best way to leverage SmartNICs as processing elements in data flows. In this paper, we advocate the use of Apache Arrow as a foundation for implementing data-flow tasks on SmartNICs. We report on our experiences adapting a partitioning algorithm for particle data to Apache Arrow and measure the on-card processing performance for the BlueField-2 SmartNIC. Our experiments confirm that the BlueField-2's (de)compression hardware can have a significant impact on in-transit workflows where data must be unpacked, processed, and repacked. AUTOPAGER: Auto-tuning Memory-Mapped I/O Parameters in Userspace [Outstanding Student Paper Award] Karim Youssef ; Niteya Shah (Virginia Tech); Maya B Gokhale; Roger Pearce (Lawrence Livermore National Laboratory); Wu-chun Feng (Virginia Tech) The exponential growth in dataset sizes has shifted the bottleneck of high-performance data analytics from the compute subsystem toward the memory and storage subsystems. This bottleneck has led to the proliferation of non-volatile memory (NVM). To bridge the performance gap between the Linux I/O subsystem and NVM, userspace memory-mapped I/O enables application-specific I/O optimizations. Specifically, UMap, an open-source userspace memory-mapping tool, exposes tunable paging parameters to application users, such as page size and degree of paging concurrency. Tuning these parameters is computationally intractable due to the vast search space and the cost of evaluating each parameter combination. To address this challenge, we present AUTOPAGER, a tool for auto-tuning userspace paging parameters. Our evaluation, using five data-intensive applications with UMap, shows that AUTOPAGER automatically achieves comparable performance to exhaustive tuning with 10x less tuning overhead. and 16.3X and 1.52X speedup over UMap with default parameters and UMap with page-size only tuning, respectively. An SSD-Based Accelerator for Singular Value Decomposition Recommendation Algorithm on Edge Wei Wu; Letian Zhao; Qizhe Wu; Xiaotian Wang; Teng Tian; Xi Jin (Univ. of Science and Technology of China) Recommender system (RS) is widely used in social networks, computational advertising, video platform and many other Internet applications. Most RSs are based on the cloud-to-edge framework. Recommended item lists are computed in the cloud server and then transmitted to edge device. Network bandwidth and latency between cloud server and edge may cause the delay for recommendation. Edge computing could help get user’s real-time preferences and thus improve the performance of recommendation. However, the increasing complexity of recommendation algorithm and data scale cause challenges to real-time recommendation on edge. To solve these problems, in this paper, we mainly focus on the Jacobi-based singular value decomposition (SVD) algorithm because of its high parallel processing potential and cost effective NVM-storage. We propose an SSD-based accelerator for one-sided Jacobi transformation algorithm. We implement a hardware prototype on a real Xilinx FPGA development board. Experimental results show that the proposed SVD engine can achieve 3.4x speedup to 5.8x speedup compared with software SVD solvers such as MATLAB running on high-performance CPU. Hardware Software Codesign of Applications on the Edge: Accelerating Digital PreDistortion for Wireless Communications Zhaoyang Han; Yiyue Jiang (Northeastern Univ.); Rahul Mushini; John Dooley (Maynooth Univ.); Miriam Leeser (Northeastern Univ.) We present a real-time adaptive Digital PreDistortion (DPD) system developed on a System-on-Chip (SoC) platform with integrated RF front end, namely the AMD/Xilinx RFSoC. The design utilizes the heterogeneity of the RFSoC and is carefully partitioned. The control logic and training algorithm are implemented on the embedded ARM processor, while the %computationally expensive predistorter module is placed on the FPGA fabric. To better coordinate both the hardware and software implementations, the training algorithm has been optimized for a shorter training time which results in a system that adapts to current environmental conditions with a shorter latency. Specifically, the number of signal samples used in training are reduced by applying the probability distribution information from the input signal in order to reduce the training time while retaining the important data samples. Results show that this reduced training set maintains the accuracy of the full data set. The implemented design balances the processing on the ARM processor and FPGA fabric resulting in a computationally efficient solution which makes good use of the different resources available. It has been experimentally validated on an AMD/Xilinx Gen3 RFSoC board with an exsternal GaN Power Amplifier (PA). 5-2: High Performance Data Analysis 2 Session (12:30-13:45) Co-Chairs: Darrell Ricke & David Cousins Im2win: Memory Efficient Convolution On SIMD Architectures Shuai Lu; Jun Chu (Nanchang Hangkong Univ.); Xu T. Liu (Univ. of Washington) Convolution is the most expensive operation among neural networks operations, thus its performance is critical to the overall performance of neural networks. Commonly used convolution approaches, including general matrix multiplication (GEMM)-based convolution and direct convolution, rely on im2col for data transformation or do not use data transformation at all, respectively. However, the im2col data transformation can lead to at least 2X memory footprint compared to not using data transformation at all, thus limiting the size of neural network models running on memory-limited systems. Meanwhile, not using data transformation usually performs poorly due to nonconsecutive memory access although it consumes less memory. To solve those problems, we propose a new memory- efficient data transformation algorithm, called im2win. This algorithm refactorizes a row of square or rectangle dot product windows of the input image and flattens unique elements within these windows into a row in the output tensor, which enables consecutive memory access and data reuse, and thus greatly reduces the memory overhead. Furthermore, we propose a high-performance im2win-based convolution algorithm with various optimizations, including vectorization, loop reordering, etc. Our experimental results show that our algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation based on im2col, and achieves average to 3.6X and 5.3X speedup in performance compared to the im2col-based convolution and not using data transformation, respectively. Towards Full-Stack Acceleration for Fully Homomorphic Encryption Naifeng Zhang (Carnegie Mellon Univ.); Homer Gamil (New York Univ.); Patrick Brinich (Drexel Univ.); Benedict Reynwar (USC ISI); Ahmad Al Badawi (Duality Technologies); Negar Neda; Deepraj Soni (New York Univ.); Yuriy Polyakov (Duality Technologies); Patrick Broderick (SpiralGen, Inc.); Michail Maniatakos (New York Univ.); Andrew Schmidt (USC ISI); Mike Franusich (SpiralGen, Inc.); Jeremy Johnson (Drexel Univ.); Brandon Reagen (New York Univ.); David Bruce Cousins (Duality Technologies); Franz Franchetti (Carnegie Mellon Univ.) This paper provides a first look at the end-to-end Fully Homomorphic Encryption (FHE) accelerator, which is optimized by PALISADE on the algorithmic level, by NTTX from SPIRAL on the code generation level, by TILE on the microarchitecture level. Our work exhibits the necessary structure and components for an integrated end-to-end system for FHE acceleration. Python Implementation of the Dynamic Distributed Dimensional Data Model Hayden R Jananthan (MIT LLSC); Lauren Milechin (MIT): Michael Jones; William Arcand; William Bergeron; David Bestor; Chansup Byun; Michale Houle; Matthew Hubbell; Vijay Gadepally; Anna Klein; Peter Michaleas; Guillermo Morales; Julie Mullen; Andrew Prout; Albert Reuther; Antonio Rosa; Siddharth Samsi; Charles Yee; Jeremy Kepner (MIT LLSC) Python has become a standard scientific computing language with fast-growing support of machine learning anddata analysis modules, as well as an increasing usage of big data. The Dynamic Distributed Dimensional Data Model (D4M) offers a highly composable, unified data model with strong performance built to handle big data fast and efficiently. In this work we present an implementation of D4M in Python. D4M.py implements all foundational functionality of D4M and includes Accumulo and SQL database support via Graphulo. We describe the mathematical background and motivation, an explanation of the approaches made for its fundamental functions and building blocks, and performance results which compare D4M.py’s performance to D4M-MATLAB and D4M.jl. Powering Practical Performance: Accelerated Numerical Computing in Pure Python Matthew Penn; Christopher Milroy (NVIDIA) In this paper, we tackle a generic n-dimensional numerical computing problem to compare performance and analyze tradeoffs between popular frameworks using open source Jupyter notebook examples. Most data science practitioners perform their work in Python because of its high-level abstraction and rich set of numerical computing libraries. However, the choice of library and methodology is driven by complexity-impacting constraints like problem size, latency, memory, physical size, weight, power, hardware, and others. To that end, we demonstrate that a wide selection of GPU-accelerated libraries (RAPIDS, CuPy, Numba, Dask), including the development of hand-tuned CUDA kernels, are accessible to data scientists without ever leaving Python. We address the Python developer community by showing C/C++ is not necessary to access single/multi-GPU acceleration for data science applications. We solve a common numerical computing problem -- finding the closest point in array B from every point (and its index) in array A, requiring up to 8.8 trillion distance comparisons – on a GPU-equipped workstation without writing a line of C/C++. Parallel Computing with DNA Forensics Data Adam Michaleas; Philip Fremont-Smith; Chelsea Lennartz; Darrell O. Ricke (MIT Lincoln Laboratory) High-throughput sequencing (HTS) of single nucleotide polymorphisms (SNPs) provides advanced DNA forensics capabilities including complex mixture analysis. This paper describes a scalable pipeline for large DNA forensics data which can either be utilized on a standalone system or can also be used on high performance computing systems. This pipeline enables parallelization of processing of multiple samples. Surveillance modules detect completed sequencing datasets on both Illumina and Ion Torrent platforms. GrigoraSNPs is used for automated SNP allele calling from FASTQ files. These results are automatically loaded into the IdPrism DNA mixture analysis system. HTS SNP data analysis typically completes in roughly 7 minutes for 100M sequences, including SNP allele calling, enabling rapid access to the results within the IdPrism system for identification and complex mixture analysis of multiplexed samples. 5-3: Big Data and Distributed Computing 1 Session (14:15-15:30) Co-Chairs: Sadas Shankar & Chansup Byun Invited Talk: Data-Driven Precision Neuroscience Dr. John Reynders (Neumora) Distributed Out-of-Memory SVD on CPU/GPU Architectures [Outstanding Paper Award] Ismael Boureima; Manish Bhattarai; Maksim E Eren; Nick Solovyev; Hirsto Djidjev; Boian Alexandrov (Los Alamos National Laboratory) We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous high performance computing (HPC) systems. Various implementations of SVD have been proposed, with most only estimate the singular values as the estimation of the singular vectors can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks in the power method used to decompose a matrix A are typically associated with the computation of the Gram matrix A^T A , which can be significant when A is large and dense, or when A is super- large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of A^T A by using a batching strategy where the intermediate factors are computed block by block, and we hide I/O latency associated with both host-to-device (H2D) and device-to- host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized NCCL based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of 1e-6 sparsity with size of 128~PB in dense format. HuGraph: Acceleration of GCN Training on Heterogeneous FPGA Clusters with Quantization [Outstanding Student Paper Award] Letian Zhao; Qizhe Wu; Xiaotian Wang; Teng Tian; Wei Wu; Xi Jin (Univ. of Science and Technology of China) Graph convolutional networks (GCNs) have succeeded significantly in numerous fields, but the need for higher performance and energy efficiency training GCN on larger graphs continues unabated. At the same time, since reconfigurable accelerators have the ability to fine-grained custom computing modules and data movement, FPGAs can solve problems such as irregular memory access for GCN computing. Furthermore, to scale GCN computation, the use of heterogeneous FPGAs is inevitable due to the constant iteration of new FPGAs. In this paper, we propose a novel framework, HuGraph, which automatically maps GCN training on heterogeneous FPGA clusters. With HuGraph, FPGAs work in synchronous data parallelism using a simple ring 1D topology that is suitable for most off-the- shelf FPGA clusters. HuGraph uses three approaches to advance performance and energy efficiency. First, HuGraph applies full- process quantization for neighbor-sampling-based data parallel training, thereby reducing computation and memory consumption. Second, a novel balanced sampler is used to balance workloads among heterogeneous FPGAs so that FPGAs with fewer resources do not become bottlenecks in the cluster. Third, HuGraph schedules the execution order of GCN training to minimize time overhead. We implement a prototype on a single FPGA and evaluate cluster-level performance with a cycle-accurate simulator. Experiments show that HuGraph achieves up to 102.3x, 4.62x, and 11.1x speedup compared with the state-of-the-art works on CPU, GPU, and FPGA platforms, respectively, with negligible accuracy loss. A Scalable Inference Pipeline for 3D Axon Tracing Algorithms Benjami n M Fenelon; Lars Gjesteby (MIT Lincoln Laboratory); Webster Guan; Juhyuk Park; Kwanghun Chung (MIT); Laura Brattain (MIT Lincoln Laboratory) High inference times of machine learning-based axon tracing algorithms pose a significant challenge to the practical analysis and interpretation of large-scale brain imagery. This paper explores a distributed data pipeline that employs a SLURM-based job array to run multiple machine learning algorithm predictions simultaneously. Image volumes were split into N (1-16) equal chunks that are each handled by a unique compute node and stitched back together into a single 3D prediction. Preliminary results comparing the inference speed of 1 versus 16 node job arrays demonstrated a 90.95% decrease in compute time for 32 GB input volume and 88.41% for 4 GB input volume. The general pipeline may serve as a baseline for future improved implementations on larger input volumes which can be tuned to various application domains. Exploring the Impacts of Software Cache Configuration for In-line Data Compression Sansriti Ranjan; Dakota Fulp; Jon C Calhoun (Clemson Univ.) In order to compute on or analyze large data sets, applications need access to large amounts of memory. To increase the amount of physical memory requires costly hardware upgrades. Compressing large arrays stored in an application's memory does not require hardware upgrades, while enabling the appearance of more physical memory. In-line compressed arrays compress and decompress data needed by the application as it moves in and out of it's working set that resides in main memory. Naive compressed arrays require a compression or decompression operation for each store or load, respectively, which significantly hurts performance. Caching decompressed values in a software managed cache limits the number of compression/decompression operations, improving performance. The structure of the software cache impacts the performance of the application. In this paper, we build and utilize a compression cache simulator to analyze and simulate various cache configurations for an application. Our simulator is able to leverage and model the multidimensional nature of high-performance computing (HPC) data and compressors. We evaluate both direct-mapped and set-associative caches on five HPC kernels. Finally, we construct a performance model to explore runtime impacts of cache configurations. Results show that cache policy tuning by increasing the block size, associativity and cache size improves the hit rate significantly for all applications. Incorporating dimensionality further improves locality and hit rate, achieving speedup in the performance of an application by up to 28.25%. 5-4: Big Data and Distributed Computing 2 Session (15:45-17:00) Co-Chairs: Rich Vuduc & Nikos Pitsianis Invited Talk: HPC Graphs in the AWS Cloud Roger Pearce (LLNL) pPython for Parallel Python Programming Chansup Byun; William Arcand; David Bestor; Bill Bergeron; Vijay Gadepally; Michael Houle; Matthew Hubbell; Hayden Jananthan; Michael Jones (MIT LLSC); Kurt Keville (MIT); Anna Klein; Peter Michaleas (MIT LLSC); Lauren Milechin (MIT); Guillermo Morales; Julie Mullen; Andrew Prout; Albert Reuther; Antonio Rosa; Siddharth Samsi; Charles Yee; Jeremy Kepner (MIT LLSC) pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python. The core data structure in pPython is a distributed numerical array whose distribution onto multiple processors is specified with a ‘map’ construct. Communication operations between distributed arrays are abstracted away from the user and pPython transparently supports redistribution between any block-cyclic-overlapped distributions in up to four dimensions. pPython follows a SPMD (single program multiple data) model of computation. pPython runs on any combination of heterogeneous systems that support Python, including Windows, Linux, and MacOS operating systems. In addition, to running transparently on single-node (e.g., a laptop), pPython provides a scheduler interface, so that pPython can be executed in a massively parallel computing environment. The initial implementation uses the Slurm scheduler. Performance of pPython on the HPC Challenge benchmark suite demonstrate both ease of programming and scalability. Arachne: An Arkouda Package for Large-Scale Graph Analytics Oliver A Alvarado Rodriguez; Zhihui Du; Joseph T Patchett; Fuhuan Li; David Bader (New Jersey Inst. of Tech.) Due to the emergence of massive real-world graphs, whose sizes may extend to terabytes, new tools must be developed to enable data scientists to handle such graphs efficiently. These graphs may include social networks, computer networks, and genomes. In this paper, we propose a novel graph package, Arachne, to make large-scale graph analytics more effortless and efficient based on the open- source Arkouda framework. Arkouda has been developed to allow users to perform massively parallel computations on distributed data with an interface similar to NumPy. In this package, we developed a fundamental sparse graph data structure and then built several useful graph algorithms around our data structure to form a basic algorithmic library. Benchmarks and tools were also developed to evaluate and demonstrate the use of our graph algorithms. The graph algorithms we have implemented thus far include breadth-first search (BFS), connected components (CC), k-Truss (KT), Jaccard coefficients (JC), triangle counting (TC), and triangle centrality (TCE). Their corresponding experimental results based on real- world and synthetic graphs are presented. Arachne is organized as an Arkouda extension package and is publicly available on GitHub (https://github.com/Bears-R-Us/arkouda-njit). The Viability of Using Online Prediction to Perform Extra Work while Executing BSP Applications Po Hao Chen; Pouya Haghi; Jae Yoon Chung (Boston Univ.); Tong Geng (Univ. of Rochester); Richard West (Boston Univ.); Anthony Skjellum (UTC); Martin Herbordt (Boston Univ.) A fundamental problem in parallel processing is the difficulty in efficiently partitioning work with the result that much of a parallel program’s execution time is often spent idle or performing overhead operations. We propose to improve the efficiency of system resource utilization by having idle processes execute extra work. We develop a method whereby the execution of extra work is optimized through performance prediction and the setting of limits (a deadline) on the duration of the extra work execution. In our preliminary experiments of proxy BSP applications on a production supercomputer we find that this approach is promising with two applications benefiting significantly from this approach. Real-Time Software Architecture for EM-Based Radar Signal Processing and Tracking Alan W Nussbaum (Georgia Tech, GTRI); Byron Keel (GTRI); William Dale Blair (GTRI, Georgia Tech); Umakishore Ramachandran (Georgia Tech) While a radar tracks the kinematic state (position, velocity, and acceleration) of the target, an optimal signal processing requires knowledge of the target's range rate and radial acceleration that are derived from the tracking function in real time. High precision tracks are achieved through precise range and angle measurements whose precision are determined by the signal-to-noise ratio (SNR) of the received signal. The SNR is maximized by minimizing the matched filter loss due to uncertainties in the radial velocity and acceleration of the target. In this paper, the Expectation-Maximization (EM) algorithm is proposed as an iterative signal processing scheme for maximizing the SNR by executing enhanced range walk compensation (i.e., correction for errors in the radial velocity and acceleration) in the real-time control loop software architecture. Maintaining a stringent timeline and adhering to latency requirements are essential for real-time sensor signal processing. This research aims to examine existing methods and explore new approaches and technologies to mitigate the harmful effects of range walk in tracking radar systems with an EM-Based iterative algorithm and implement the new control loop steering methods in a real-time computing environment.

2022 Abstract Book