IEEE High Performane Extreme Computing

2022 IEEE High Performance Extreme Computing Virtual Conference 19 - 23 September 2022

Wednesday, September 21 3-V Session (10:30-11:00) Co-Chairs: Albert Reuther 3-1: AI / Machine Learning 1 Session (11:00-12:15) Co-Chairs: Ashok Krishnamurthy & Julie Mullen Invited Talk: Making AI Real: Insights from the Lab to Operations Maj. Michael Kanaan (USAF) Benchmarking Resource Usage for Efficient Distributed Deep Learning [Outstanding Paper Award] Nathan C Frey (MIT LLSC); Baolin Li (Northeastern Univ.); Joseph P McDonald; Dan Zhao; Michael S Jones; David Bestor (MIT LLSC); Devesh Tiwari (Northeastern Univ.); Vijay Gadepally; Siddharth Samsi (MIT LLSC) Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources---especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks---natural language processing, computer vision, and chemistry---on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy-saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energy-constrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide high-performance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training. Ultra Low-Power Deep Learning Applications at the Edge with Jetson Orin AGX Hardware Mark Barnell (AFRL); Courtney Raymond (AFRL); Steven Smiley; Darrek Isereau; Daniel Brown (SRC, Inc.) The latest NVIDIA Jetson Orin AGX hardware provides new capabilities for “at the edge processing,”, where sensor information is collected. The computing architecture does this by providing massive computation due to its high performance, small form factor, and low power consumption. The recently released (2022) Orin and the novel research completed on this effort were combined to accelerate development and demonstration of a new concept of operation for machine learning at the edge. This research included the development of a concept that uses the deep learning object detector, YOLOv4-tiny with the Jetson Orin AGX that obtains data through a video feed from a drone to emulate autonomous capabilities for onboard embedded computing. Further, this research included the development of model-based solutions on the both the public (VisDrone) and newly collected optical datasets. Extending on this further, the technical approach included applying these concepts through experiments and demonstrations. Specifically, a data collection and processing plan were developed and implemented. Importantly, our technical approach allowed us to rapidly move from non-real time processing and successfully demonstrate real-time, in flight capabilities. In summary, this research included the use of new compute hardware, novel processing algorithms, and a unique concept of operation. This technical approach resulted in the real-time detection of targets (vehicles) from various flight altitudes (nominally 400 ft) using newly collected electro-optical (EO) data obtained in real time through the drone’s High-Definition Multimedia Interface (HDMI). Performance Estimation for Efficient Image Segmentation Training of Weather Radar Algorithms Joseph P McDonald (MIT LLSC); James Kurdzo; Phillip Stepanian; Mark Veillette (MIT Lincoln Laboratory); David Bestor; Michael Jones; Vijay Gadepally; Siddharth Samsi (MIT LLSC) Deep Learning has a dramatically increasing demand for compute resources and a corresponding increase in the energy required to develop, explore, and test model architectures for various applications. Parameter tuning for networks customarily involves training multiple models in a search over a grid of parameter choices either randomly or exhaustively, and strategies applying complex search methods to identify candidate model architectures require significant computation for each possible architecture sampled in the model spaces. However, these approaches of extensively training many individual models in order to choose a single best performing model for future inference can seem unnecessarily wasteful at a time when energy efficiency and minimizing computing’s environmental impact are increasingly important. Techniques or algorithms that reduce the computational budget to identify and train accurate deep networks among many options are of great need. This work considers one recently proposed approach, Train- ing Speed Estimation, alongside deep learning approaches for a common hydrometeor classification problem, hail prediction through semantic image segmentation. We apply this method to the training of a variety of segmentation models and evaluate its effectiveness as a performance tracking approach for energy- aware neural network applications. This approach, together with early-stopping, offers a straightforward strategy for minimizing energy expenditure. By measuring consumption and estimating the level of energy savings, we are able to characterize this strategy as a practical method for minimizing deep learning’s energy and carbon impact. Deep Gaussian Process with Multitask and Transfer Learning for Performance Optimization Wissam Sid-Lakhdar (Univ. of Tennessee Knoxville); Mohsen Aznaveh (Texas A&M Univ.); Piotr Luszczek (Univ. of Tennessee Knoxville); Jack Dongarra (Univ. of Tennessee Knoxville, ORNL) This paper combines deep Gaussian processes with multitask and transfer learning for the performance modeling and optimization of HPC applications. Deep Gaussian processes combine the uncertainty quantification advantage of Gaussian processes with the predictive power of deep learning. Multitask and transfer learning allow for improved learning, respectively, when several similar tasks are to be learned simultaneously and when previous learning are sought to help in the learning of new tasks. A comparison with state-of-the-art autotuners shows the advantage of our approach on two application problems. Tutorial Session: 3-T (12:15-15:45): SPIRAL Tutorial Organizer(s): Franz Franchetti & Mike Franusich 3-2: AI / Machine Learning 2 Session (12:30-13:45) Co-Chairs: Janice Gordon & Sanmukh Rao Kuppannagari Invited Talk: Challenges in Geospatial Computing Prof. Taylor Perron (MIT Geology) A High-performance Deployment Framework for Pipelined CNN Accelerators with Flexible DSE Strategy Conghui Luo; Wenjin Huang; Dehao Xiang; Yihua Huang (Sun Yat-sen Univ.) The pipelined DCNN(Deep Convolutional Neural Networks) accelerator can effectively take advantage of the inter-layer parallelism, so it is widely used, e.g., video stream processing. But the large amount of intermediate results generated in the pipelined accelerator imposes a considerable burden on the on-chip storage resources on FPGAs. To ease the overburden storage demand, a storage-optimized design space exploration (DSE) method is proposed at the cost of a slight drop of computing resource utilization ratio. The experimental results show that the DSE strategy can achieve 98.49% and 98.00% CE (Computation Engines) utilization ratio on VGG16 and ResNet101, respectively. In addition, the resource optimization strategy can save 27.84% of BRAM resources on VGG16, while the CE utilization ratio dropped by only 3.04%. An automated deployment framework that is adaptable to different networks with high computing resource utilization ratio is also proposed in this paper, which can achieve workload balancing automatically by optimizing the computing resource allocation of each layer. Enabling Transformers to Understand Low-Level Programs Zifan Guo; William S Moses (MIT) Unlike prior approaches to machine learning, Transformer models can first be trained on a large corpus of unlabeled data with a generic objective and then on a smaller task-specific dataset. This versatility has led to both larger models and datasets. Consequently, Transformers have led to breakthroughs in the field of natural language processing. Generic program optimization presently operates on low-level programs such as LLVM. Unlike the high-level languages (e.g. C, Python, Java), which have seen initial success in machine-learning analyses, lower-level languages tend to be more verbose and repetitive to precisely specify program behavior, provide more details about microarchitecture, and derive properties necessary for optimization, all of which makes it difficult for machine learning. In this work, we apply transfer learning to low-level (LLVM) programs and study how low-level programs can be made more amenable to Transformer models through various techniques, including preprocessing, infix/prefix operators, and information deduplication. We evaluate the effectiveness of these techniques through a series of ablation studies on the task of translating C to both unoptimized (-O0) and optimized (-O1) LLVM IR. On the AnghaBench dataset, our model achieves a 49.57% verbatim match and BLEU score of 87.68 against Clang -O0 and 38.73% verbatim match and BLEU score of 77.03 against Clang -O1. Online Detection and Classification of State Transitions of Multivariate Shock and Vibration Data Nicklaus Przybylski; William M Jones (Coastal Carolina Univ.); Nathan DeBardeleben (Los Alamos National Laboratory) The US Department of Energy’s (DOE) Los Alamos National Laboratory (LANL) is interested in automatic anomaly detection and classification applied to highly instrumented flight shock and vibration data for the purpose of providing insight into operational safety. For example, the safe and secure transport of materials and devices during a variety of conditions is particularly of interest. In this work, we apply well-known Machine Learning (ML) techniques to a publicly available motor vibration data set that serves as a proxy to the actual LANL data. We successfully train a random forest to classify anomalous motor states using the signal data set, and use this model to simulate real-time anomaly detection and event classification on multi-variate time series data. Furthermore, we perform an extensive suite of computational studies on a large cluster computer to determine optimal parametric settings for our framework and evaluate the cost-benefit of these parameters. Surrogate ML/AI Model Benchmarking for FAIR Principles’ Conformance Piotr Luszczek; Cade E Brown (Univ. of Tennessee Knoxville) We present benchmarking platform for surrogate ML/AI models that enables the essential properties for open science and allow them to be findable, accessible, interoperable, and reusable. We also present a use case of cloud cover modeling, analysis, and experimental testing based on a large dataset of multi-spectral satellite sensor data. We use this particular evaluation to highlight the plethora of choices that need resolution for the life cycle of supporting the scientific workflows with data-driven models that need to be first trained to satisfactory accuracy and later monitored during field usage for proper feedback into both computational results and future data model improvements. Unlike traditional testing, performance, or analysis efforts, we focus exclusively on science- oriented metrics as the relevant figures of merit. 3-3: AI / Machine Learning 3 Session (14:15-15:30) Co-Chairs: Janice Gordon & Sanmukh Rao Kuppannagari Invited Talk: Trends in Energy Estimates for Computing in AI/Machine Learning Accelerators, Supercomputers, and Compute-Intensive Applications Albert Reuther (MIT LLSC); Sadasivan Shankar (Stanford Univ.) We examine the computational energy requirements of different systems driven by the geometrical scaling law (known as Moore’s law or Dennard Scaling for geometry) and increasing use of Artificial Intelligence/ Machine Learning (AI/ML) over the last decade. With more scientific and technology applications based on data-driven discovery, machine learning methods, especially deep neural networks, have become widely used. In order to enable such applications, both hardware accelerators and advanced AI/ML methods have led to introduction of new architectures, system designs, algorithms, and software. Our analysis of energy trends indicates three important observations: 1) Energy efficiency due to geometrical scaling is slowing down; 2) The energy efficiency at the bit-level does not translate into efficiency at the instruction level, or at the system level for a variety of systems, especially for large-scale supercomputers; 3) At the application level, general-purpose ML/AI methods can be computationally energy intensive, off-setting the gains in energy from geometrical scaling and special purpose accelerators. Further, our analysis provides specific pointers for integrating energy efficiency with performance analysis for enabling ML/AI-driven and high-performance computing applications in the future. Walker Activity Tracking Using Machine Learning Maxwell A. Huang; Edward A. Clancy (Worcester Polytechnic Institute) An accurate, economical, and reliable algorithm for detecting falls in persons ambulating with the assistance of an orthopedic walker is crucially important for the elderly and patients recovering from surgery, but existing tracking devices largely fail in these aspects. This project proposes a novel solution that employs motion tracking by attaching a wireless inertial measurement unit (IMU) sensor directly to a walker. Collected IMU data are transferred to a computer through a wireless link for processing. Data augmentation and machine learning are applied to train a convolutional neural network (CNN) to classify the movements as standing, walking, or (possible) falling. Preliminary testing shows that the CNN can produce a classification accuracy of 99.8% and can consistently detect falls. The machine learning algorithm can potentially be targeted to an on-board embedded processor in the future. An Evaluation of Low Overhead Time Series Preprocessing Techniques for Downstream Machine Learning Matthew L. Weiss; Joseph McDonald; David Bestor; Charles Yee (MIT LLSC); Daniel Edelman (MIT); Michael S Jones (MIT LLSC); Andrew Prout (MIT LLSC); Andrew Bowne; Lindsey McEvoy (US Air Force); Vijay Gadepally; Siddharth Samsi (MIT LLSC) In this paper we address the application of preprocessing techniques to multi-channel time series data with varying lengths, which we refer to as the alignment problem, for downstream machine learning. The misalignment of multi-channel time series data may occur for a variety of reasons, such as missing data, varying sampling rates, or inconsistent collection times. We consider multi- channel time series data collected from the MIT SuperCloud High Performance Computing (HPC) center, where different job start times and varying run times of HPC jobs result in misaligned data. This misalignment makes it challenging to build AI/ML approaches for tasks such as compute workload classification. Building on previous supervised classification work with the MIT SuperCloud Dataset, we address the alignment problem via three broad, low overhead approaches: sampling a fixed subset from a full time series, performing summary statistics on a full time series, and sampling a subset of coefficients from time series mapped to the frequency domain. Our best performing models achieve a classification accuracy greater than 95%, outperforming previous approaches to multi-channel time series classification with the MIT SuperCloud Dataset by 5%. These results indicate our low overhead approaches to solving the alignment problem, in conjunction with standard machine learning techniques, are able to achieve high levels of classification accuracy, and serve as a baseline for future approaches to addressing the alignment problem, such as kernel methods. Deep Learning For Tissue Classification Kimberly Robasky (RENCI/UNC Chapel Hill) Although classification of human tissue samples by gene expression is a valuable application, whole genome expression data include tens of thousands of features and clinical samples are relatively scarce due to cost of collection and concerns for patient privacy and confidentiality. Analytical methods risk over-fitting and are challenged by the highly nonlinear nature of gene expression covariates. Deep learning has gained attention in radiology, pathology, and clinical informatics for facilitating decision support with image classification and natural language processing. Biomolecular deep learning applications are likewise growing quickly in the areas of drug design and diagnostics. Deep learning has opened opportunities for understanding and deploying best-of-breed analytical tools, especially for tissue classification, which would be difficult to implement to such a degree of accuracy and reusability without the benefit of high-performance computing. Presented here is a new tissue classification model built with deep learning tools on high performance computing hardware. Twenty-six (26) classes from the GTEx dataset, each with at least 100 samples, were used to train a simple multi-layer perceptron (MLP) on only 2,080 samples and 18,964 features. Despite the model having close to nineteen (19) million trainable parameters, a weighted F-score of 0.98 and 98% accuracy was achieved, as compared with a multinomial regression that achieved 95% accuracy after 1000 training steps on the same data. The difference in performance between these models is expected, given the nature of variance exhibited by gene expression data. While the neural network method requires resources beyond standard consumer-grade electronics for training, the trained model is suitable for deployment on hardware readily available in most labs. Furthermore, these deep learning methods are robust to various gene expression technologies, including bulk RNASeq and gene expression microarrays. 3-4: General Purpose GPU Computing 1 Session (15:45-17:00) Co-Chairs: Sadas Shankar & Hameed Badawy Invited Talk: New Frontiers in Performance at Wafer Scale Dr. Rob Schreiber (Cerebras) AI and ML Accelerator Survey and Trends Albert Reuther; Peter Michaleas; Michael S Jones; Vijay Gadepally; Siddharth Samsi; Jeremy Kepner (MIT LLSC) This paper updates the survey of AI accelerators and processors from past three years. This paper collects and summarizes the current commercial accelerators that have been publicly announced with peak performance and power consump- tion numbers. The performance and power values are plotted on a scatter graph, and a number of dimensions and observations from the trends on this plot are again discussed and analyzed. Two new trends plots based on accelerator release dates are included in this year’s paper, along with the additional trends of some neuromorphic, photonic, and memristor-based inference accelerators. A Multi-GPU Parallel Genetic Algorithm For Large-Scale Vehicle Routing Problems Marwan Abdelatti; Manbir S Sodhi; Resit Sendag (Univ. of Rhode Island) The Vehicle Routing Problem (VRP) is fundamental to logistics operations. Finding optimal solutions for VRPs related to large, real- world operations is computationally expensive. Genetic algorithms (GA) have been used to find good solutions for different types of VRPs but are slow to converge. This work utilizes high-performance computing (HPC) platforms to design a parallel GA (PGA) algorithm for solving large-scale VRP problems. The algorithm is implemented on an eight-GPU NVIDIA DGX-1 server. Maximum parallelism is achieved by mapping all algorithm arrays into block threads to achieve high throughput and reduced latency for full GPU utilization. Tests with VRP benchmark problems of up to 20,000 nodes compare the algorithm performance (speed) with different GPU counts and a multi-CPU implementation. The developed algorithm provides the following improvements over CPU or single-GPU-based algorithms: (i) larger problem sizes up to 20,000 nodes are handled, (ii) execution time is reduced over the CPU by a factor of 1,700, and iii) for the range tested, the performance increases monotonically with the number of GPUs. DASH: Scheduling Deep Learning Workloads on Multi-Generational GPU-Accelerated Clusters Baolin Li; Tirthak Patel (Northeastern Univ.); Vijay Gadepally (MIT LLSC); Karen Gettings (MIT Lincoln Laboratory); Siddharth Samsi (MIT LLSC); Devesh Tiwari (Northeastern Univ.) Two notable characteristics of modern GPU-accelerated HPC clusters are: (1) they increasingly run deep learning (DL) model- training workloads, and (2) they consist of multiple generations of GPUs, i.e., they are heterogeneous. However, existing works in GPU cluster scheduling for DL workloads have not addressed the GPU multi-generation problem. We propose DASH, a GPU cluster scheduler designed to optimally make a match between different DL workloads and GPU types in a multi-generational GPU environment. By leveraging execution characteristics of co-scheduled DL workloads, DASH can improve the average job runtime by 17% and the average job completion time by 14% compared to the traditional heterogeneity-unaware job scheduler. Evaluation of a Novel Scratchpad Memory Through Compiler Supported Simulation Essa Imhmed (Eastern New Mexico Univ.); Jonathan Cook; Hameed Badawy (New Mexico State Univ.) Local Memory Store (LMStore) is a novel hardware-controlled, compiler-managed Scratchpad memory (SPM) design, with an initial research evaluation that showed its possibility for improving program performance. This initial evaluation was performed over memory traces prior to the development of compiler support for LMStore. In this paper, we present compiler support for the LMStore design, and present experimental results that better evaluate LMStore performance. Experimental results on benchmarks from Malardalen benchmark suite executing on the LMStore architecture modeled in Multi2Sim demonstrate that a hybrid LMStore-Cache architecture improves execution time by an average of 19.8%, compared to a conventional cache- only architecture. 3-S1: AI Challenges Special (17:30-19:30) Organizers: Vijay Gadepally AIA Challenges – Status Update Vijay Gadepally (MIT LLSC); Andy Bowne (US Air Force) SEVIR Challenge Mark Velliette; Esther Wolff (MIT Lincoln Laboratory) Datacenter Challenge Siddarth Samsi; Matthew Weiss (MIT LLSC) Rainforest Challenge Miriam Cha (MIT Lincoln Laboratory) ManeuverID Challenge Kaira Samuel (MIT) MagNav Challenge Jonathan Taylor (MIT Lincoln Laboratory) CogPilot Challenge Sophia Yuditskaya; Laura Brattain (MIT Lincoln Laboratory) 3-S2: Emerging Technologies Special (17:30-19:30) Organizers: Kurt Keville, Donato Kava, Po Hao Chen An HPC Watershed – Next Generation Arithmetic John Gustafson (Arizona State Univ.)

Welcome

Organizers

Advisory Board

Technical Committee

2022 Abstract Book