2022
IEEE High Performance Extreme Computing
Virtual Conference
19 - 23 September 2022
Wednesday, September 21
3-V Session (10:30-11:00)
Co-Chairs: Albert Reuther
3-1: AI / Machine Learning 1 Session (11:00-12:15)
Co-Chairs: Ashok Krishnamurthy & Julie Mullen
Invited Talk: Making AI Real: Insights from the Lab to Operations
Maj. Michael Kanaan (USAF)
Benchmarking Resource Usage for Efficient Distributed Deep Learning [Outstanding Paper Award]
Nathan C Frey (MIT LLSC); Baolin Li (Northeastern Univ.); Joseph P McDonald; Dan Zhao; Michael S Jones; David Bestor (MIT
LLSC); Devesh Tiwari (Northeastern Univ.); Vijay Gadepally; Siddharth Samsi (MIT LLSC)
Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As
such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and
energy resources---especially specialized computationally-intensive models across different domains and applications. In this paper,
we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks---natural language
processing, computer vision, and chemistry---on up to 424 graphics processing units (GPUs). During training, our experiments
systematically vary compute resource characteristics and energy-saving mechanisms such as power utilization and GPU clock rate
limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various
resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute
resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing
providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows
with minimal impact on training.
Ultra Low-Power Deep Learning Applications at the Edge with Jetson Orin AGX Hardware
Mark Barnell (AFRL); Courtney Raymond (AFRL); Steven Smiley; Darrek Isereau; Daniel Brown (SRC, Inc.)
The latest NVIDIA Jetson Orin AGX hardware provides new capabilities for “at the edge processing,”, where sensor information is
collected. The computing architecture does this by providing massive computation due to its high performance, small form factor,
and low power consumption. The recently released (2022) Orin and the novel research completed on this effort were combined to
accelerate development and demonstration of a new concept of operation for machine learning at the edge. This research included
the development of a concept that uses the deep learning object detector, YOLOv4-tiny with the Jetson Orin AGX that obtains data
through a video feed from a drone to emulate autonomous capabilities for onboard embedded computing. Further, this research
included the development of model-based solutions on the both the public (VisDrone) and newly collected optical datasets.
Extending on this further, the technical approach included applying these concepts through experiments and demonstrations.
Specifically, a data collection and processing plan were developed and implemented. Importantly, our technical approach allowed us
to rapidly move from non-real time processing and successfully demonstrate real-time, in flight capabilities. In summary, this
research included the use of new compute hardware, novel processing algorithms, and a unique concept of operation. This technical
approach resulted in the real-time detection of targets (vehicles) from various flight altitudes (nominally 400 ft) using newly collected
electro-optical (EO) data obtained in real time through the drone’s High-Definition Multimedia Interface (HDMI).
Performance Estimation for Efficient Image Segmentation Training of Weather Radar Algorithms
Joseph P McDonald (MIT LLSC); James Kurdzo; Phillip Stepanian; Mark Veillette (MIT Lincoln Laboratory); David Bestor; Michael
Jones; Vijay Gadepally; Siddharth Samsi (MIT LLSC)
Deep Learning has a dramatically increasing demand for compute resources and a corresponding increase in the energy required to
develop, explore, and test model architectures for various applications. Parameter tuning for networks customarily involves training
multiple models in a search over a grid of parameter choices either randomly or exhaustively, and strategies applying complex
search methods to identify candidate model architectures require significant computation for each possible architecture sampled in
the model spaces. However, these approaches of extensively training many individual models in order to choose a single best
performing model for future inference can seem unnecessarily wasteful at a time when energy efficiency and minimizing computing’s
environmental impact are increasingly important. Techniques or algorithms that reduce the computational budget to identify and train
accurate deep networks among many options are of great need.
This work considers one recently proposed approach, Train- ing Speed Estimation, alongside deep learning approaches for a
common hydrometeor classification problem, hail prediction through semantic image segmentation. We apply this method to the
training of a variety of segmentation models and evaluate its effectiveness as a performance tracking approach for energy- aware
neural network applications. This approach, together with early-stopping, offers a straightforward strategy for minimizing energy
expenditure. By measuring consumption and estimating the level of energy savings, we are able to characterize this strategy as a
practical method for minimizing deep learning’s energy and carbon impact.
Deep Gaussian Process with Multitask and Transfer Learning for Performance Optimization
Wissam Sid-Lakhdar (Univ. of Tennessee Knoxville); Mohsen Aznaveh (Texas A&M Univ.); Piotr Luszczek (Univ. of Tennessee
Knoxville); Jack Dongarra (Univ. of Tennessee Knoxville, ORNL)
This paper combines deep Gaussian processes with multitask and transfer learning for the performance modeling and optimization
of HPC applications. Deep Gaussian processes combine the uncertainty quantification advantage of Gaussian processes with the
predictive power of deep learning. Multitask and transfer learning allow for improved learning, respectively, when several similar
tasks are to be learned simultaneously and when previous learning are sought to help in the learning of new tasks. A comparison
with state-of-the-art autotuners shows the advantage of our approach on two application problems.
Tutorial Session: 3-T (12:15-15:45): SPIRAL Tutorial
Organizer(s): Franz Franchetti & Mike Franusich
3-2: AI / Machine Learning 2 Session (12:30-13:45)
Co-Chairs: Janice Gordon & Sanmukh Rao Kuppannagari
Invited Talk: Challenges in Geospatial Computing
Prof. Taylor Perron (MIT Geology)
A High-performance Deployment Framework for Pipelined CNN Accelerators with Flexible DSE Strategy
Conghui Luo; Wenjin Huang; Dehao Xiang; Yihua Huang (Sun Yat-sen Univ.)
The pipelined DCNN(Deep Convolutional Neural Networks) accelerator can effectively take advantage of the inter-layer parallelism,
so it is widely used, e.g., video stream processing. But the large amount of intermediate results generated in the pipelined
accelerator imposes a considerable burden on the on-chip storage resources on FPGAs. To ease the overburden storage demand, a
storage-optimized design space exploration (DSE) method is proposed at the cost of a slight drop of computing resource utilization
ratio. The experimental results show that the DSE strategy can achieve 98.49% and 98.00% CE (Computation Engines) utilization
ratio on VGG16 and ResNet101, respectively. In addition, the resource optimization strategy can save 27.84% of BRAM resources
on VGG16, while the CE utilization ratio dropped by only 3.04%. An automated deployment framework that is adaptable to different
networks with high computing resource utilization ratio is also proposed in this paper, which can achieve workload balancing
automatically by optimizing the computing resource allocation of each layer.
Enabling Transformers to Understand Low-Level Programs
Zifan Guo; William S Moses (MIT)
Unlike prior approaches to machine learning, Transformer models can first be trained on a large corpus of unlabeled data with a
generic objective and then on a smaller task-specific dataset. This versatility has led to both larger models and datasets.
Consequently, Transformers have led to breakthroughs in the field of natural language processing. Generic program optimization
presently operates on low-level programs such as LLVM. Unlike the high-level languages (e.g. C, Python, Java), which have seen
initial success in machine-learning analyses, lower-level languages tend to be more verbose and repetitive to precisely specify
program behavior, provide more details about microarchitecture, and derive properties necessary for optimization, all of which makes
it difficult for machine learning.
In this work, we apply transfer learning to low-level (LLVM) programs and study how low-level programs can be made more
amenable to Transformer models through various techniques, including preprocessing, infix/prefix operators, and information
deduplication. We evaluate the effectiveness of these techniques through a series of ablation studies on the task of translating C to
both unoptimized (-O0) and optimized (-O1) LLVM IR. On the AnghaBench dataset, our model achieves a 49.57% verbatim match
and BLEU score of 87.68 against Clang -O0 and 38.73% verbatim match and BLEU score of 77.03 against Clang -O1.
Online Detection and Classification of State Transitions of Multivariate Shock and Vibration Data
Nicklaus Przybylski; William M Jones (Coastal Carolina Univ.); Nathan DeBardeleben (Los Alamos National Laboratory)
The US Department of Energy’s (DOE) Los Alamos National Laboratory (LANL) is interested in automatic anomaly detection and
classification applied to highly instrumented flight shock and vibration data for the purpose of providing insight into operational safety.
For example, the safe and secure transport of materials and devices during a variety of conditions is particularly of interest. In this
work, we apply well-known Machine Learning (ML) techniques to a publicly available motor vibration data set that serves as a proxy
to the actual LANL data. We successfully train a random forest to classify anomalous motor states using the signal data set, and use
this model to simulate real-time anomaly detection and event classification on multi-variate time series data. Furthermore, we
perform an extensive suite of computational studies on a large cluster computer to determine optimal parametric settings for our
framework and evaluate the cost-benefit of these parameters.
Surrogate ML/AI Model Benchmarking for FAIR Principles’ Conformance
Piotr Luszczek; Cade E Brown (Univ. of Tennessee Knoxville)
We present benchmarking platform for surrogate ML/AI models that enables the essential properties for open science and allow
them to be findable, accessible, interoperable, and reusable. We also present a use case of cloud cover modeling, analysis, and
experimental testing based on a large dataset of multi-spectral satellite sensor data. We use this particular evaluation to highlight the
plethora of choices that need resolution for the life cycle of supporting the scientific workflows with data-driven models that need to
be first trained to satisfactory accuracy and later monitored during field usage for proper feedback into both computational results
and future data model improvements. Unlike traditional testing, performance, or analysis efforts, we focus exclusively on science-
oriented metrics as the relevant figures of merit.
3-3: AI / Machine Learning 3 Session (14:15-15:30)
Co-Chairs: Janice Gordon & Sanmukh Rao Kuppannagari
Invited Talk: Trends in Energy Estimates for Computing in AI/Machine Learning Accelerators, Supercomputers, and
Compute-Intensive Applications
Albert Reuther (MIT LLSC); Sadasivan Shankar (Stanford Univ.)
We examine the computational energy requirements of different systems driven by the geometrical scaling law (known as Moore’s
law or Dennard Scaling for geometry) and increasing use of Artificial Intelligence/ Machine Learning (AI/ML) over the last decade.
With more scientific and technology applications based on data-driven discovery, machine learning methods, especially deep neural
networks, have become widely used. In order to enable such applications, both hardware accelerators and advanced AI/ML
methods have led to introduction of new architectures, system designs, algorithms, and software. Our analysis of energy trends
indicates three important observations: 1) Energy efficiency due to geometrical scaling is slowing down; 2) The energy efficiency at
the bit-level does not translate into efficiency at the instruction level, or at the system level for a variety of systems, especially for
large-scale supercomputers; 3) At the application level, general-purpose ML/AI methods can be computationally energy intensive,
off-setting the gains in energy from geometrical scaling and special purpose accelerators. Further, our analysis provides specific
pointers for integrating energy efficiency with performance analysis for enabling ML/AI-driven and high-performance computing
applications in the future.
Walker Activity Tracking Using Machine Learning
Maxwell A. Huang; Edward A. Clancy (Worcester Polytechnic Institute)
An accurate, economical, and reliable algorithm for detecting falls in persons ambulating with the assistance of an orthopedic walker
is crucially important for the elderly and patients recovering from surgery, but existing tracking devices largely fail in these aspects.
This project proposes a novel solution that employs motion tracking by attaching a wireless inertial measurement unit (IMU) sensor
directly to a walker. Collected IMU data are transferred to a computer through a wireless link for processing. Data augmentation and
machine learning are applied to train a convolutional neural network (CNN) to classify the movements as standing, walking, or
(possible) falling. Preliminary testing shows that the CNN can produce a classification accuracy of 99.8% and can consistently detect
falls. The machine learning algorithm can potentially be targeted to an on-board embedded processor in the future.
An Evaluation of Low Overhead Time Series Preprocessing Techniques for Downstream Machine Learning
Matthew L. Weiss; Joseph McDonald; David Bestor; Charles Yee (MIT LLSC); Daniel Edelman (MIT); Michael S Jones (MIT LLSC);
Andrew Prout (MIT LLSC); Andrew Bowne; Lindsey McEvoy (US Air Force); Vijay Gadepally; Siddharth Samsi (MIT LLSC)
In this paper we address the application of preprocessing techniques to multi-channel time series data with varying lengths, which
we refer to as the alignment problem, for downstream machine learning. The misalignment of multi-channel time series data may
occur for a variety of reasons, such as missing data, varying sampling rates, or inconsistent collection times. We consider multi-
channel time series data collected from the MIT SuperCloud High Performance Computing (HPC) center, where different job start
times and varying run times of HPC jobs result in misaligned data. This misalignment makes it challenging to build AI/ML approaches
for tasks such as compute workload classification. Building on previous supervised classification work with the MIT SuperCloud
Dataset, we address the alignment problem via three broad, low overhead approaches: sampling a fixed subset from a full time
series, performing summary statistics on a full time series, and sampling a subset of coefficients from time series mapped to the
frequency domain. Our best performing models achieve a classification accuracy greater than 95%, outperforming previous
approaches to multi-channel time series classification with the MIT SuperCloud Dataset by 5%. These results indicate our low
overhead approaches to solving the alignment problem, in conjunction with standard machine learning techniques, are able to
achieve high levels of classification accuracy, and serve as a baseline for future approaches to addressing the alignment problem,
such as kernel methods.
Deep Learning For Tissue Classification
Kimberly Robasky (RENCI/UNC Chapel Hill)
Although classification of human tissue samples by gene expression is a valuable application, whole genome expression data
include tens of thousands of features and clinical samples are relatively scarce due to cost of collection and concerns for patient
privacy and confidentiality. Analytical methods risk over-fitting and are challenged by the highly nonlinear nature of gene expression
covariates. Deep learning has gained attention in radiology, pathology, and clinical informatics for facilitating decision support with
image classification and natural language processing. Biomolecular deep learning applications are likewise growing quickly in the
areas of drug design and diagnostics. Deep learning has opened opportunities for understanding and deploying best-of-breed
analytical tools, especially for tissue classification, which would be difficult to implement to such a degree of accuracy and reusability
without the benefit of high-performance computing. Presented here is a new tissue classification model built with deep learning tools
on high performance computing hardware. Twenty-six (26) classes from the GTEx dataset, each with at least 100 samples, were
used to train a simple multi-layer perceptron (MLP) on only 2,080 samples and 18,964 features. Despite the model having close to
nineteen (19) million trainable parameters, a weighted F-score of 0.98 and 98% accuracy was achieved, as compared with a
multinomial regression that achieved 95% accuracy after 1000 training steps on the same data. The difference in performance
between these models is expected, given the nature of variance exhibited by gene expression data. While the neural network
method requires resources beyond standard consumer-grade electronics for training, the trained model is suitable for deployment on
hardware readily available in most labs. Furthermore, these deep learning methods are robust to various gene expression
technologies, including bulk RNASeq and gene expression microarrays.
3-4: General Purpose GPU Computing 1 Session (15:45-17:00)
Co-Chairs: Sadas Shankar & Hameed Badawy
Invited Talk: New Frontiers in Performance at Wafer Scale
Dr. Rob Schreiber (Cerebras)
AI and ML Accelerator Survey and Trends
Albert Reuther; Peter Michaleas; Michael S Jones; Vijay Gadepally; Siddharth Samsi; Jeremy Kepner (MIT LLSC)
This paper updates the survey of AI accelerators and processors from past three years. This paper collects and summarizes the
current commercial accelerators that have been publicly announced with peak performance and power consump- tion numbers. The
performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this
plot are again discussed and analyzed. Two new trends plots based on accelerator release dates are included in this year’s paper,
along with the additional trends of some neuromorphic, photonic, and memristor-based inference accelerators.
A Multi-GPU Parallel Genetic Algorithm For Large-Scale Vehicle Routing Problems
Marwan Abdelatti; Manbir S Sodhi; Resit Sendag (Univ. of Rhode Island)
The Vehicle Routing Problem (VRP) is fundamental to logistics operations. Finding optimal solutions for VRPs related to large, real-
world operations is computationally expensive. Genetic algorithms (GA) have been used to find good solutions for different types of
VRPs but are slow to converge. This work utilizes high-performance computing (HPC) platforms to design a parallel GA (PGA)
algorithm for solving large-scale VRP problems. The algorithm is implemented on an eight-GPU NVIDIA DGX-1 server. Maximum
parallelism is achieved by mapping all algorithm arrays into block threads to achieve high throughput and reduced latency for full
GPU utilization. Tests with VRP benchmark problems of up to 20,000 nodes compare the algorithm performance (speed) with
different GPU counts and a multi-CPU implementation. The developed algorithm provides the following improvements over CPU or
single-GPU-based algorithms: (i) larger problem sizes up to 20,000 nodes are handled, (ii) execution time is reduced over the CPU
by a factor of 1,700, and iii) for the range tested, the performance increases monotonically with the number of GPUs.
DASH: Scheduling Deep Learning Workloads on Multi-Generational GPU-Accelerated Clusters
Baolin Li; Tirthak Patel (Northeastern Univ.); Vijay Gadepally (MIT LLSC); Karen Gettings (MIT Lincoln Laboratory); Siddharth Samsi
(MIT LLSC); Devesh Tiwari (Northeastern Univ.)
Two notable characteristics of modern GPU-accelerated HPC clusters are: (1) they increasingly run deep learning (DL) model-
training workloads, and (2) they consist of multiple generations of GPUs, i.e., they are heterogeneous. However, existing works in
GPU cluster scheduling for DL workloads have not addressed the GPU multi-generation problem. We propose DASH, a GPU cluster
scheduler designed to optimally make a match between different DL workloads and GPU types in a multi-generational GPU
environment. By leveraging execution characteristics of co-scheduled DL workloads, DASH can improve the average job runtime by
17% and the average job completion time by 14% compared to the traditional heterogeneity-unaware job scheduler.
Evaluation of a Novel Scratchpad Memory Through Compiler Supported Simulation
Essa Imhmed (Eastern New Mexico Univ.); Jonathan Cook; Hameed Badawy (New Mexico State Univ.)
Local Memory Store (LMStore) is a novel hardware-controlled, compiler-managed Scratchpad memory (SPM) design, with an initial
research evaluation that showed its possibility for improving program performance. This initial evaluation was performed over
memory traces prior to the development of compiler support for LMStore. In this paper, we present compiler support for the LMStore
design, and present experimental results that better evaluate LMStore performance. Experimental results on benchmarks from
Malardalen benchmark suite executing on the LMStore architecture modeled in Multi2Sim demonstrate that a hybrid LMStore-Cache
architecture improves execution time by an average of 19.8%, compared to a conventional cache- only architecture.
3-S1: AI Challenges Special (17:30-19:30)
Organizers: Vijay Gadepally
AIA Challenges – Status Update
Vijay Gadepally (MIT LLSC); Andy Bowne (US Air Force)
SEVIR Challenge
Mark Velliette; Esther Wolff (MIT Lincoln Laboratory)
Datacenter Challenge
Siddarth Samsi; Matthew Weiss (MIT LLSC)
Rainforest Challenge
Miriam Cha (MIT Lincoln Laboratory)
ManeuverID Challenge
Kaira Samuel (MIT)
MagNav Challenge
Jonathan Taylor (MIT Lincoln Laboratory)
CogPilot Challenge
Sophia Yuditskaya; Laura Brattain (MIT Lincoln Laboratory)
3-S2: Emerging Technologies Special (17:30-19:30)
Organizers: Kurt Keville, Donato Kava, Po Hao Chen
An HPC Watershed – Next Generation Arithmetic
John Gustafson (Arizona State Univ.)
2022 Abstract Book