2022
IEEE High Performance Extreme Computing
Virtual Conference
19 - 23 September 2022
5-V: Sponsor Showcase – Dell Session (10:30-11:00)
Co-Chairs: Albert Reuther
How to Wrestle with Global Scale ML and Win
Dr. Ben Fauber (Dell Technologies)
5-1: High Performance Data Analysis 1 Session (11:00-12:15)
Co-Chairs: Darrell Ricke & Ken Cain
Optimizing Performance and Storage of Memory-Mapped Persistent Data Structures [Outstanding Student Paper Award]
Karim Youssef (Virginia Tech); Abdullah Al Raqibul Islam (Univ. of North Carolina at Charlotte); Keita Iwabuchi (Lawrence Livermore
National Laboratory); Wu-chun Feng (Virginia Tech); Roger Pearce (Lawrence Livermore National Laboratory)
Persistent data structures represent a core component of high-performance data analytics. Multiple data processing systems persist
data structures using memory-mapped files. Memory-mapped file I/O provides a productive and unified programming interface to
different types of storage systems.
However, it suffers from multiple limitations, including performance bottlenecks caused by system-wide configurations and a lack of
support for efficient incremental versioning. Therefore, many such systems only support versioning via full-copy snapshots, resulting in
poor performance and storage capacity bottlenecks. To address these limitations, we present Privateer 2.0, a virtual memory and
storage interface that optimizes performance and storage capacity for versioned persistent data structures. Privateer 2.0 improves over
the previous version by supporting userspace virtual memory management and block compression. We integrated Privateer 2.0 into
Metall, a C++ persistent data structure allocator, and LMDB, a widely-used key-value store database. Privateer 2.0 yielded up to 7.5×
speedup and up to 300× storage space reduction for Metall incremental snapshots and 1.25× speedup with 11.7× storage space
reduction for LMDB incremental snapshots.
Processing Particle Data Flows with SmartNICs [Outstanding Student Paper Award]
Jianshen Liu; Carlos Maltzahn (UC Santa Cruz); Matthew Curry; Craig Ulmer (Sandia National Laboratories)
Many distributed applications implement complex data flows and need a flexible mechanism for routing data between producers and
consumers. Recent advances in programmable network interface cards, or SmartNICs, represent an opportunity to offload data-flow
tasks into the network fabric, thereby freeing the hosts to perform other work. System architects in this space face multiple questions
about the best way to leverage SmartNICs as processing elements in data flows. In this paper, we advocate the use of Apache Arrow as
a foundation for implementing data-flow tasks on SmartNICs. We report on our experiences adapting a partitioning algorithm for particle
data to Apache Arrow and measure the on-card processing performance for the BlueField-2 SmartNIC. Our experiments confirm that
the BlueField-2's (de)compression hardware can have a significant impact on in-transit workflows where data must be unpacked,
processed, and repacked.
AUTOPAGER: Auto-tuning Memory-Mapped I/O Parameters in Userspace [Outstanding Student Paper Award]
Karim Youssef ; Niteya Shah (Virginia Tech); Maya B Gokhale; Roger Pearce (Lawrence Livermore National Laboratory); Wu-chun
Feng (Virginia Tech)
The exponential growth in dataset sizes has shifted the bottleneck of high-performance data analytics from the compute subsystem
toward the memory and storage subsystems. This bottleneck has led to the proliferation of non-volatile memory (NVM). To bridge the
performance gap between the Linux I/O subsystem and NVM, userspace memory-mapped I/O enables application-specific I/O
optimizations. Specifically, UMap, an open-source userspace memory-mapping tool, exposes tunable paging parameters to application
users, such as page size and degree of paging concurrency. Tuning these parameters is computationally intractable due to the vast
search space and the cost of evaluating each parameter combination. To address this challenge, we present AUTOPAGER, a tool for
auto-tuning userspace paging parameters. Our evaluation, using five data-intensive applications with UMap, shows that AUTOPAGER
automatically achieves comparable performance to exhaustive tuning with 10x less tuning overhead. and 16.3X and 1.52X speedup
over UMap with default parameters and UMap with page-size only tuning, respectively.
An SSD-Based Accelerator for Singular Value Decomposition Recommendation Algorithm on Edge
Wei Wu; Letian Zhao; Qizhe Wu; Xiaotian Wang; Teng Tian; Xi Jin (Univ. of Science and Technology of China)
Recommender system (RS) is widely used in social networks, computational advertising, video platform and many other Internet
applications. Most RSs are based on the cloud-to-edge framework. Recommended item lists are computed in the cloud server and then
transmitted to edge device. Network bandwidth and latency between cloud server and edge may cause the delay for recommendation.
Edge computing could help get user’s real-time preferences and thus improve the performance of recommendation. However, the
increasing complexity of recommendation algorithm and data scale cause challenges to real-time recommendation on edge. To solve
these problems, in this paper, we mainly focus on the Jacobi-based singular value decomposition (SVD) algorithm because of its high
parallel processing potential and cost effective NVM-storage. We propose an SSD-based accelerator for one-sided Jacobi
transformation algorithm. We implement a hardware prototype on a real Xilinx FPGA development board. Experimental results show
that the proposed SVD engine can achieve 3.4x speedup to 5.8x speedup compared with software SVD solvers such as MATLAB
running on high-performance CPU.
Hardware Software Codesign of Applications on the Edge: Accelerating Digital PreDistortion for Wireless Communications
Zhaoyang Han; Yiyue Jiang (Northeastern Univ.); Rahul Mushini; John Dooley (Maynooth Univ.); Miriam Leeser (Northeastern Univ.)
We present a real-time adaptive Digital PreDistortion (DPD) system developed on a System-on-Chip (SoC) platform with integrated RF
front end, namely the AMD/Xilinx RFSoC. The design utilizes the heterogeneity of the RFSoC and is carefully partitioned. The control
logic and training algorithm are implemented on the embedded ARM processor, while the %computationally expensive
predistorter module is placed on the FPGA fabric. To better coordinate both the hardware and software implementations, the training
algorithm has been optimized for a shorter training time which results in a system that adapts to current environmental conditions with a
shorter latency. Specifically, the number of signal samples used in training are reduced by applying the probability distribution
information from the input signal in order to reduce the training time while retaining the important data samples. Results show that this
reduced training set maintains the accuracy of the full data set. The implemented design balances the processing on the ARM
processor and FPGA fabric resulting in a computationally efficient solution which makes good use of the different resources available. It
has been experimentally validated on an AMD/Xilinx Gen3 RFSoC board with an exsternal GaN Power Amplifier (PA).
5-2: High Performance Data Analysis 2 Session (12:30-13:45)
Co-Chairs: Darrell Ricke & David Cousins
Im2win: Memory Efficient Convolution On SIMD Architectures
Shuai Lu; Jun Chu (Nanchang Hangkong Univ.); Xu T. Liu (Univ. of Washington)
Convolution is the most expensive operation among neural networks operations, thus its performance is critical to the overall
performance of neural networks. Commonly used convolution approaches, including general matrix multiplication (GEMM)-based
convolution and direct convolution, rely on im2col for data transformation or do not use data transformation at all, respectively. However,
the im2col data transformation can lead to at least 2X memory footprint compared to not using data transformation at all, thus limiting
the size of neural network models running on memory-limited systems. Meanwhile, not using data transformation usually performs
poorly due to nonconsecutive memory access although it consumes less memory. To solve those problems, we propose a new memory-
efficient data transformation algorithm, called im2win. This algorithm refactorizes a row of square or rectangle dot product windows of
the input image and flattens unique elements within these windows into a row in the output tensor, which enables consecutive memory
access and data reuse, and thus greatly reduces the memory overhead. Furthermore, we propose a high-performance im2win-based
convolution algorithm with various optimizations, including vectorization, loop reordering, etc. Our experimental results show that our
algorithm reduces the memory overhead by average to 41.6% compared to the PyTorch's convolution implementation based on im2col,
and achieves average to 3.6X and 5.3X speedup in performance compared to the im2col-based convolution and not using data
transformation, respectively.
Towards Full-Stack Acceleration for Fully Homomorphic Encryption
Naifeng Zhang (Carnegie Mellon Univ.); Homer Gamil (New York Univ.); Patrick Brinich (Drexel Univ.); Benedict Reynwar (USC ISI);
Ahmad Al Badawi (Duality Technologies); Negar Neda; Deepraj Soni (New York Univ.); Yuriy Polyakov (Duality Technologies); Patrick
Broderick (SpiralGen, Inc.); Michail Maniatakos (New York Univ.); Andrew Schmidt (USC ISI); Mike Franusich (SpiralGen, Inc.); Jeremy
Johnson (Drexel Univ.); Brandon Reagen (New York Univ.); David Bruce Cousins (Duality Technologies); Franz Franchetti (Carnegie
Mellon Univ.)
This paper provides a first look at the end-to-end Fully Homomorphic Encryption (FHE) accelerator, which is optimized by PALISADE on
the algorithmic level, by NTTX from SPIRAL on the code generation level, by TILE on the microarchitecture level. Our work exhibits the
necessary structure and components for an integrated end-to-end system for FHE acceleration.
Python Implementation of the Dynamic Distributed Dimensional Data Model
Hayden R Jananthan (MIT LLSC); Lauren Milechin (MIT): Michael Jones; William Arcand; William Bergeron; David Bestor; Chansup
Byun; Michale Houle; Matthew Hubbell; Vijay Gadepally; Anna Klein; Peter Michaleas; Guillermo Morales; Julie Mullen; Andrew Prout;
Albert Reuther; Antonio Rosa; Siddharth Samsi; Charles Yee; Jeremy Kepner (MIT LLSC)
Python has become a standard scientific computing language with fast-growing support of machine learning anddata analysis modules,
as well as an increasing usage of big data. The Dynamic Distributed Dimensional Data Model (D4M) offers a highly composable, unified
data model with strong performance built to handle big data fast and efficiently. In this work we present an implementation of D4M in
Python.
D4M.py implements all foundational functionality of D4M and includes Accumulo and SQL database support via Graphulo. We describe
the mathematical background and motivation, an explanation of the approaches made for its fundamental functions and building blocks,
and performance results which compare D4M.py’s performance to D4M-MATLAB and D4M.jl.
Powering Practical Performance: Accelerated Numerical Computing in Pure Python
Matthew Penn; Christopher Milroy (NVIDIA)
In this paper, we tackle a generic n-dimensional numerical computing problem to compare performance and analyze tradeoffs between
popular frameworks using open source Jupyter notebook examples. Most data science practitioners perform their work in Python
because of its high-level abstraction and rich set of numerical computing libraries. However, the choice of library and methodology is
driven by complexity-impacting constraints like problem size, latency, memory, physical size, weight, power, hardware, and others. To
that end, we demonstrate that a wide selection of GPU-accelerated libraries (RAPIDS, CuPy, Numba, Dask), including the development
of hand-tuned CUDA kernels, are accessible to data scientists without ever leaving Python. We address the Python developer
community by showing C/C++ is not necessary to access single/multi-GPU acceleration for data science applications. We solve a
common numerical computing problem -- finding the closest point in array B from every point (and its index) in array A, requiring up to
8.8 trillion distance comparisons – on a GPU-equipped workstation without writing a line of C/C++.
Parallel Computing with DNA Forensics Data
Adam Michaleas; Philip Fremont-Smith; Chelsea Lennartz; Darrell O. Ricke (MIT Lincoln Laboratory)
High-throughput sequencing (HTS) of single nucleotide polymorphisms (SNPs) provides advanced DNA forensics capabilities including
complex mixture analysis. This paper describes a scalable pipeline for large DNA forensics data which can either be utilized on a
standalone system or can also be used on high performance computing systems. This pipeline enables parallelization of processing of
multiple samples. Surveillance modules detect completed sequencing datasets on both Illumina and Ion Torrent platforms.
GrigoraSNPs is used for automated SNP allele calling from FASTQ files. These results are automatically loaded into the IdPrism DNA
mixture analysis system. HTS SNP data analysis typically completes in roughly 7 minutes for 100M sequences, including SNP allele
calling, enabling rapid access to the results within the IdPrism system for identification and complex mixture analysis of multiplexed
samples.
5-3: Big Data and Distributed Computing 1 Session (14:15-15:30)
Co-Chairs: Sadas Shankar & Chansup Byun
Invited Talk: Data-Driven Precision Neuroscience
Dr. John Reynders (Neumora)
Distributed Out-of-Memory SVD on CPU/GPU Architectures [Outstanding Paper Award]
Ismael Boureima; Manish Bhattarai; Maksim E Eren; Nick Solovyev; Hirsto Djidjev; Boian Alexandrov (Los Alamos National Laboratory)
We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for
heterogeneous high performance computing (HPC) systems. Various implementations of SVD have been proposed, with most only
estimate the singular values as the estimation of the singular vectors can significantly increase the time and memory complexity of the
algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and
singular vectors estimation method. Memory utilization bottlenecks in the power method used to decompose a matrix A are typically
associated with the computation of the Gram matrix A^T A , which can be significant when A is large and dense, or when A is super-
large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a
given matrix is greater than the available GPU memory. We reduce the memory complexity of A^T A by using a batching strategy where
the intermediate factors are computed block by block, and we hide I/O latency associated with both host-to-device (H2D) and device-to-
host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized NCCL
based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition,
sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an
implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully
decompose dense matrix of size 1TB and sparse matrix of 1e-6 sparsity with size of 128~PB in dense format.
HuGraph: Acceleration of GCN Training on Heterogeneous FPGA Clusters with Quantization [Outstanding Student Paper
Award]
Letian Zhao; Qizhe Wu; Xiaotian Wang; Teng Tian; Wei Wu; Xi Jin (Univ. of Science and Technology of China)
Graph convolutional networks (GCNs) have succeeded significantly in numerous fields, but the need for higher performance and energy
efficiency training GCN on larger graphs continues unabated. At the same time, since reconfigurable accelerators have the ability to
fine-grained custom computing modules and data movement, FPGAs can solve problems such as irregular memory access for GCN
computing. Furthermore, to scale GCN computation, the use of heterogeneous FPGAs is inevitable due to the constant iteration of new
FPGAs. In this paper, we propose a novel framework, HuGraph, which automatically maps GCN training on heterogeneous FPGA
clusters. With HuGraph, FPGAs work in synchronous data parallelism using a simple ring 1D topology that is suitable for most off-the-
shelf FPGA clusters. HuGraph uses three approaches to advance performance and energy efficiency. First, HuGraph applies full-
process quantization for neighbor-sampling-based data parallel training, thereby reducing computation and memory consumption.
Second, a novel balanced sampler is used to balance workloads among heterogeneous FPGAs so that FPGAs with fewer resources do
not become bottlenecks in the cluster. Third, HuGraph schedules the execution order of GCN training to minimize time overhead. We
implement a prototype on a single FPGA and evaluate cluster-level performance with a cycle-accurate simulator. Experiments show that
HuGraph achieves up to 102.3x, 4.62x, and 11.1x speedup compared with the state-of-the-art works on CPU, GPU, and FPGA
platforms, respectively, with negligible accuracy loss.
A Scalable Inference Pipeline for 3D Axon Tracing Algorithms
Benjami n M Fenelon; Lars Gjesteby (MIT Lincoln Laboratory); Webster Guan; Juhyuk Park; Kwanghun Chung (MIT); Laura Brattain
(MIT Lincoln Laboratory)
High inference times of machine learning-based axon tracing algorithms pose a significant challenge to the practical analysis and
interpretation of large-scale brain imagery. This paper explores a distributed data pipeline that employs a SLURM-based job array to run
multiple machine learning algorithm predictions simultaneously. Image volumes were split into N (1-16) equal chunks that are each
handled by a unique compute node and stitched back together into a single 3D prediction. Preliminary results comparing the inference
speed of 1 versus 16 node job arrays demonstrated a 90.95% decrease in compute time for 32 GB input volume and 88.41% for 4 GB
input volume. The general pipeline may serve as a baseline for future improved implementations on larger input volumes which can be
tuned to various application domains.
Exploring the Impacts of Software Cache Configuration for In-line Data Compression
Sansriti Ranjan; Dakota Fulp; Jon C Calhoun (Clemson Univ.)
In order to compute on or analyze large data sets, applications need access to large amounts of memory. To increase the amount of
physical memory requires costly hardware upgrades. Compressing large arrays stored in an application's memory does not require
hardware upgrades, while enabling the appearance of more physical memory. In-line compressed arrays compress and decompress
data needed by the application as it moves in and out of it's working set that resides in main memory. Naive compressed arrays require
a compression or decompression operation for each store or load, respectively, which significantly hurts performance. Caching
decompressed values in a software managed cache limits the number of compression/decompression operations, improving
performance. The structure of the software cache impacts the performance of the application. In this paper, we build and utilize a
compression cache simulator to analyze and simulate various cache configurations for an application. Our simulator is able to leverage
and model the multidimensional nature of high-performance computing (HPC) data and compressors. We evaluate both direct-mapped
and set-associative caches on five HPC kernels. Finally, we construct a performance model to explore runtime impacts of cache
configurations. Results show that cache policy tuning by increasing the block size, associativity and cache size improves the hit rate
significantly for all applications. Incorporating dimensionality further improves locality and hit rate, achieving speedup in the performance
of an application by up to 28.25%.
5-4: Big Data and Distributed Computing 2 Session (15:45-17:00)
Co-Chairs: Rich Vuduc & Nikos Pitsianis
Invited Talk: HPC Graphs in the AWS Cloud
Roger Pearce (LLNL)
pPython for Parallel Python Programming
Chansup Byun; William Arcand; David Bestor; Bill Bergeron; Vijay Gadepally; Michael Houle; Matthew Hubbell; Hayden Jananthan;
Michael Jones (MIT LLSC); Kurt Keville (MIT); Anna Klein; Peter Michaleas (MIT LLSC); Lauren Milechin (MIT); Guillermo Morales;
Julie Mullen; Andrew Prout; Albert Reuther; Antonio Rosa; Siddharth Samsi; Charles Yee; Jeremy Kepner (MIT LLSC)
pPython seeks to provide a parallel capability that provides good speed-up without sacrificing the ease of programming in Python by
implementing partitioned global array semantics (PGAS) on top of a simple file-based messaging library (PythonMPI) in pure Python.
The core data structure in pPython is a distributed numerical array whose distribution onto multiple processors is specified with a ‘map’
construct. Communication operations between distributed arrays are abstracted away from the user and pPython transparently supports
redistribution between any block-cyclic-overlapped distributions in up to four dimensions. pPython follows a SPMD (single program
multiple data) model of computation. pPython runs on any combination of heterogeneous systems that support Python, including
Windows, Linux, and MacOS operating systems. In addition, to running transparently on single-node (e.g., a laptop), pPython provides
a scheduler interface, so that pPython can be executed in a massively parallel computing environment. The initial implementation uses
the Slurm scheduler. Performance of pPython on the HPC Challenge benchmark suite demonstrate both ease of programming and
scalability.
Arachne: An Arkouda Package for Large-Scale Graph Analytics
Oliver A Alvarado Rodriguez; Zhihui Du; Joseph T Patchett; Fuhuan Li; David Bader (New Jersey Inst. of Tech.)
Due to the emergence of massive real-world graphs, whose sizes may extend to terabytes, new tools must be developed to enable data
scientists to handle such graphs efficiently. These graphs may include social networks, computer networks, and genomes. In this paper,
we propose a novel graph package, Arachne, to make large-scale graph analytics more effortless and efficient based on the open-
source Arkouda framework. Arkouda has been developed to allow users to perform massively parallel computations on distributed data
with an interface similar to NumPy. In this package, we developed a fundamental sparse graph data structure and then built several
useful graph algorithms around our data structure to form a basic algorithmic library. Benchmarks and tools were also developed to
evaluate and demonstrate the use of our graph algorithms. The graph algorithms we have implemented thus far include breadth-first
search (BFS), connected components (CC), k-Truss (KT), Jaccard coefficients (JC), triangle counting (TC), and triangle centrality
(TCE). Their corresponding experimental results based on real- world and synthetic graphs are presented. Arachne is organized as an
Arkouda extension package and is publicly available on GitHub (https://github.com/Bears-R-Us/arkouda-njit).
The Viability of Using Online Prediction to Perform Extra Work while Executing BSP Applications
Po Hao Chen; Pouya Haghi; Jae Yoon Chung (Boston Univ.); Tong Geng (Univ. of Rochester); Richard West (Boston Univ.); Anthony
Skjellum (UTC); Martin Herbordt (Boston Univ.)
A fundamental problem in parallel processing is the difficulty in efficiently partitioning work with the result that much of a parallel
program’s execution time is often spent idle or performing overhead operations. We propose to improve the efficiency of system
resource utilization by having idle processes execute extra work. We develop a method whereby the execution of extra work is
optimized through performance prediction and the setting of limits (a deadline) on the duration of the extra work execution. In our
preliminary experiments of proxy BSP applications on a production supercomputer we find that this approach is promising with two
applications benefiting significantly from this approach.
Real-Time Software Architecture for EM-Based Radar Signal Processing and Tracking
Alan W Nussbaum (Georgia Tech, GTRI); Byron Keel (GTRI); William Dale Blair (GTRI, Georgia Tech); Umakishore Ramachandran
(Georgia Tech)
While a radar tracks the kinematic state (position, velocity, and acceleration) of the target, an optimal signal processing requires
knowledge of the target's range rate and radial acceleration that are derived from the tracking function in real time. High precision
tracks are achieved through precise range and angle measurements whose precision are determined by the signal-to-noise ratio (SNR)
of the received signal. The SNR is maximized by minimizing the matched filter loss due to uncertainties in the radial velocity and
acceleration of the target. In this paper, the Expectation-Maximization (EM) algorithm is proposed as an iterative signal processing
scheme for maximizing the SNR by executing enhanced range walk compensation (i.e., correction for errors in the radial velocity and
acceleration) in the real-time control loop software architecture. Maintaining a stringent timeline and adhering to latency requirements
are essential for real-time sensor signal processing. This research aims to examine existing methods and explore new approaches and
technologies to mitigate the harmful effects of range walk in tracking radar systems with an EM-Based iterative algorithm and implement
the new control loop steering methods in a real-time computing environment.
2022 Abstract Book