5-1: Big Data & Distributed Computing 1 Session (11:00- 12:15) Session Co-Chairs: Ken Cain & Rich Vuduc
Friday, September 24 5-V: Sponsor Spotlight Session (10:30-11:00) Session Chair(s): Albert Reuther 5-1: Big Data & Distributed Computing 1 Session (11:00-12:15) Session Co-Chairs: Ken Cain & Rich Vuduc Invited Talk: The Open Cloud Testbed: A resource for FPGA and Cloud Researchers Prof. Miriam Leeser (Northeastern Unvi.) & Prof. Martin Herbordt (Boston Unvi.) Workload Imbalance in HPC Applications: Effect on Performance of In-Network Processing Pouya Haghi (Boston University)*; Anqi Guo (Boston University); Tong Geng (Pacific Northwest National Laboratory); Anthony Skjellum (UTC); Martin Herbordt (Boston University) As HPC systems advance to exascale, communication networks are becoming ever more complex including, e.g., support for in-network processing. While critical in facilitating scalability, we note that this network complexity is rendered ineffectual when there is workload imbalance. The problem we address here is to measure and characterize workload imbalance and to do so in a way that is useful in network design. We characterize five proxy applications where in-network processing is likely to be effective. Our results reveal that on average 45% of the total execution time of these applications is wasted due to workload imbalance and other types of performance variability when running on the Stampede2 compute cluster with up to 3072 processes. Distributed and Heterogeneous SAR Backprojection with Halide Connor Imes (Information Sciences Institute, USC)*; Tzu-Mao Li (University of California, San Diego); Mark Glines (Extreme Scale Solutions); Rishi Khan (Extreme Scale Solutions); John Paul Walters (Information Sciences Institute, USC) Writing efficient, scalable, and portable HPC synthetic aperture radar (SAR) applications is increasingly challenging due to the growing diversity and heterogeneity in distributed systems. Considerable developer and computational resources are often spent to port applications to new HPC platforms and architectures, which is both time consuming and expensive. Domain-specific languages have been shown to be highly productive for development effort, but additionally achieving both scalable computational efficiency and platform portability remains challenging. The Halide programming language is both productive and efficient for dense data processing, supports common CPU architectures and heterogeneous resources like GPUs, and has previously been extended for distributed processing. We propose to use a distributed Halide implementation for scalable and heterogeneous HPC SAR processing. We implement a backprojection algorithm for SAR image reconstruction and demonstrate scalability on the OLCF Summit supercomputer up to 1,024 compute nodes (43,008 cores, each with 4 hardware threads) with a large 32,768×32,768 dataset, and up to 8 distributed GPUs with a 8,192×8,192 dataset. Our results show excellent scaling and portability to heterogeneous resources, and motivate additional improvements in Halide to better support distributed high-performance signal processing. Low-Communication Asynchronous Distributed Generalized Canonical Polyadic Tensor Decomposition Cannada A Lewis (Sandia National Laboratories)*; Eric Phipps (Sandia National Laboratories) In this work, we show that reduced communication algorithms for distributed stochastic gradient descent improve the time per epoch and strong scaling for the Generalized Canonical Polyadic (GCP) tensor decomposition, but with a cost, achieving convergence becomes more difficult. The implementation, based on MPI, shows that while one-sided algorithms offer a path to asynchronous execution, the performance benefits of optimized allreduce are difficult to best. 3D Real-Time Supercomputer Monitoring William Bergeon (MIT)*; Jeremy Kepner (MIT Lincoln Laboratory); Matthew Hubbell (MIT Lincoln Laboratory); Dylan Sequeira (MIT); Winter Williams (UMASS/Amherst) Supercomputers   are   complex   systems   producing vast  quantities  of  performance  data  from  multiple  sources  and of  varying  types.  Performance  data  from  each  of  the  thousands of  nodes  in  a  supercomputer  tracks  multiple  forms  of  storage, memory,  networks,  processors,  and  accelerators.  Optimization of  application  performance  is  critical  for  cost  effective  usage of a  supercomputer  and  requires  efficient  methods  for  effectively viewing  performance  data.  The  combination  of  supercomputing analytics and 3D gaming visualization enables real-time processing  and  visual  data  display  of  massive  amounts  of  information that humans can process quickly with little training. Our system fully utilizes the capabilities of modern 3D gaming environments to  create  novel  representations  of  computing  hardware  which intuitively represent the physical attributes of the supercomputer while displaying real-time alerts and component utilization. This system allows operators to quickly assess how the supercomputer is  being  used,  gives  users  visibility  into  the  resources  they  are consuming,  and  provides  instructors  new  ways  to  interactively teach the computing architecture concepts necessary for efficient computing. 5-2: Big Data & Distributed Computing 2 Session (12:30-13:45) Session Co-Chairs: David Cousins & Plamen Krastev Invited Talk: DBOS: Database Operating System Prof. Mike Stonebraker (MIT CSAIL) WASP: A Wearable Super-Computing Platform for Distributed Intelligence in Multi-Agent Systems Chinmaya Patnayak (Virginia Tech)*; James E McClure (Virginia Tech); Ryan Williams (Virginia Tech) Autonomous unmanned aerial vehicle (UAV) systems have broad applications in surveillance, disaster management, and search and rescue (SAR) operations. Field deployments of intelligent multi-UAV systems are heavily constrained by available power and networking capabilities, and limited computational processing resources which are needed to reduce large volumes of on-board sensor data in real-time. In this work, we design a WearAble Super-Computing Platform (WASP) to address such challenges associated with multi-UAV deployments in remote field environments based on a human-in-the-loop (HITL) design. The WASP system is an advanced edge computing instrument designed from commodity embedded processing devices interconnected through an on-board Ethernet network. Networking is further extended through wireless networking capabilities to communicate with UAVs. Computational workloads and storage are orchestrated as discrete containers across WASP and the UAVs, which accounts for processor heterogeneity and time-varying workloads that must adapt dynamically to account for unpredictable failures of wireless networking in the field. We use our prototype to demonstrate advantages in terms of power management, redundancy, robustness, and human-robot collaboration in challenging field environments. A Survey and Taxonomy of Blockchain-based Payment Channel Networks Haleh Khojasteh (Bridgewater State University)*; Hirad Tabatabaei (University of Massachusetts Amherst) Blockchain technology has the potential to turn into the mainstream in many enterprises including finance, personal identity security and real-time operating systems. The usage of public or permissionless blockchains, in which any participant has the option of joining and leaving at any moment, is growing abruptly. However, the decentralization of permissionless blockchains comes with some main costs which one of them is the limitation of scalability: the transaction loads managed by blockchains are remarkably less than those managed by traditional financial systems. Among many proposals to address the scaling issue, one of the most promising solutions is the Payment Channel Network (PCN), which deploys off-chain settlement of transactions with minimal involvement of expensive on-chain blockchain operations. This work explores the existing PCN solutions, their challenges and the suggested improvements on PCNs, especially routing mechanisms. Detection of Multiple Crop Diseases using  Image Processing Techniques Akanksha soni (University institute of Technology )* Crops can be considered as a heart of the agricultural field but sometimes various diseases arise on the plant that creates a predicament as it leads to the losses in the prolificacy, along with that the eminence of the product is also influenced. Manually health monitoring and identification of plant pathogens is a laborious and time-consuming task because farmers are not capable enough to identify diseases so the farmer requires expertise this will increase the expenditure of farming. Hence, the proposed idea illustrates a competent approach to automatic recognition of crop diseases in an initial phase. Automatic identification is extremely helpful for diminishing the work of screening in large farms of the crop. The main goal of our model is to recognize the leaf disease of cucumber, guava, groundnut, pumpkin, and cowpea efficiently with very less processing instant and provide vigorous output, which is helpful for saving manpower, wealth, and precious time. The proposed novel approach follows the steps namely Pre-processing, Thresholding, Sobel filter, Morphological Dilation and ROI extraction. MATLAB software performs all the processes and acquires 98.27% accuracy by testing 406 healthy/diseased crop leaves images. Efficiently Building a Large Scale Dataset for Program Induction Lauren Milechin (MIT); Javier Lopez-Contreras (MIT); Ferran Alet (MIT)* One of the applications of machine learning with the most potential is speeding up expensive computational processes, i.e. learning to think. To do so, one first generates a large-scale dataset by a compute-intensive process and then trains a model to approximate the distribution. High performance computing (HPC) is a perfect fit for these processes, as one may efficiently deploy large amounts of computation to generate a dataset in a reasonable amount of time, to then learn a computationally- efficient solution. Here, we focus on generating a program synthesis dataset. Finding the program that fits a given input-output specification is very expensive, but generating the input-output pairs for a given program is a well-defined process. In this work, we show how we efficiently ran hundreds of thousands of C++ codes line-by-line and used intermediate variable states to generate a large-scale program synthesis dataset. 5-3: High Peformance & Secure Hardware 1 Session (14:15-15:30) Session Co-Chairs: Frank Pietryka & Michael Vai HARDROID: Transparent Integration of Crypto Accelerators in Android Luca Piccolboni (Columbia University)*; Giuseppe Di Guglielmo (Columbia University); Simha Sethumadhavan (Columbia University); Luca Carloni (Columbia University) Accelerators have become fundamental building blocks of any modern architecture. Accelerators are often deployed on a platform by evaluating performance and energy consumption, while assuming that the software applications can be modified to invoke the accelerators. In some contexts, however, this is impractical. For instance, in an Android-based platform changing the applications to invoke an accelerator can affect their portability. We present Hardroid, a heterogeneous platform that allows an Android application to offload tasks to loosely-coupled accelerators on an FPGA in a transparent way, i.e., without modifying the code of the application. To demonstrate the Hardroid capabilities, we design four accelerators for cryptography with high-level synthesis (HLS) and we compare their efficiency with two libraries for cryptography, by executing 29 Android applications. While we use FPGAs to implement and evaluate Hardroid, our accelerators are designed so that they can be integrated in a system-on-chip (SoC) and we report their energy efficiency also for an ASIC implementation. The experimental results show that Hardroid is an effective platform that can be used to evaluate the costs and benefits of integrating accelerators, when these are called by real-world Android applications. We show that invoking accelerators without modifying the code of the applications can affect the energy efficiency of the accelerators. Spatial Temporal Analysis of 40,000,000,000,000 Internet Darkspace Packets Jeremy Kepner (MIT Lincoln Laboratory)*; Michael S Jones (MIT Lincoln Laboratory); Daniel Andersen (CAIDA); Aydin Buluc (Lawrence Berkeley National Laboratory); Chansup Byun (MIT Lincoln Laboratory); kc  clafft (CAIDA/UC San Diego); Timothy A Davis (Texas A&M University); William Arcand (MIT); Jonathan Bernays (MIT); David Bestor (MIT); William Bergeon (MIT); Vijay Gadepally (MIT Lincoln Laboratory); Michael Hurray (MIT); Matthew Hubbell (MIT Lincoln Laboratory); Anna Klein (MIT); Chad Meiners (MIT); Lauren Milechin (MIT); Julie S Mullen (MIT Lincoln Laboratory); Sandeep Pisharody (MIT Lincoln Laboratory); Andrew Prout (MIT); Albert Reuther (MIT Lincoln Laboratory); Antonio Rosa (MIT); Siddharth Samsi (MIT Lincoln Laboratory); Doug Stetson (MIT); Adam Tse (MIT); Charles Yee (MIT Lincoln Laboratory); Peter Michaleas (MIT Lincoln Laboratory) The Internet has never been more important to our society, and understanding the behavior of the Internet is essential.  The Center for Applied Internet Data Analysis (CAIDA) Telescope observes a continuous stream of packets from an unsolicited darkspace representing 1/256 of the Internet.  During 2019 and 2020 over 40,000,000,000,000 unique packets were collected representing the largest ever assembled public corpus of Internet traffic.  Using the combined resources of the Supercomputing Centers at UC San Diego, Lawrence Berkeley National Laboratory, and MIT, the spatial temporal structure of anonymized source-destination pairs from the CAIDA Telescope data has been analyzed with GraphBLAS hierarchical hypersparse matrices.  These analyses provide unique insight on this unsolicited Internet darkspace traffic with the discovery of many previously unseen scaling relations.  The data show a significant sustained increase in unsolicited traffic corresponding to the start of the COVID19 pandemic, but relatively little change in the underlying scaling relations associated with unique sources, source fan-outs, unique links, destination fan-ins, and unique destinations.  This work provides a demonstration of the practical feasibility and benefit of the safe collection and analysis of significant quantities of anonymized Internet traffic. Exploring the Tradeoff Between Reliability and Performance in HPC Systems Craig  Walker (Coastal Carolina University); Braeden Slade (Los Alamos National Laboratory); Gavin  Bailey (Coastal Carolina University); Nicklaus Przybylski (Coastal Carolina University); William M Jones (Coastal Carolina University )*; Nathan DeBardeleben (Los Alamos National Laboratory) Evaluating the trade-off space between performance and reliability is important for data center operators as part of their supercomputer procurement, planning and acceptance testing. While some simple systems can be modeled with tractable analytic methods, in order to capture with fidelity the interaction among such factors as individual component reliability, processor speeds, checkpointing behaviors and effects, workload characteristics (capacity versus capability), and scheduling policies, just to name a few, a simulation-based approach is required.  This paper extends Batsim, a flexible and modular batch scheduling and cluster simulation framework that is built on top of the grid simulation framework, SimGrid.  These extensions add a fault model and simulated job checkpoint-restart capability.  The enhancements are detailed and experiments are performed using this new capability on varying workloads, both synthetic and real, to evaluate impacts on prospective HPC systems with varying levels of per-node reliability.  A basic analytical model is constructed for this trade-off, and contrasted with the experimental results to illustrate the utility of the simulator.  It is shown that the toolkit enables one to see not only how much worse a workload will perform on a prospective less reliable system, but also how much larger that system would have to increase in size to achieve the same makespan.  This can be used to evaluate tradeoffs and is vitally useful in system procurement. The value of the simulation environment is further demonstrated with a complex, real-world workload that performs differently than expected analytically. Towards Combining Compression and Cryptography for Scientific Data Ruiwen Shan (Clemson University)*; Sheng Di (Argonne National Laboratory, Lemont, IL); Jon C Calhoun (Clemson University); Franck Cappello (Argonne National Laboratory, Lemont, IL) In the scientific domain, extremely large amounts of data are generated by large-scale high performance computing (HPC) simulations. Storing and sending such vast volumes of data poses serious scalability and performance issues, which can be considerably mitigated by data compression techniques which significantly reduced storage size and data movement burdens. Since scientific data are being shared by scientists more and more frequently, data security methods that ensure the confidentiality, integrity, and availability of data itself are becoming increasingly important. As such, combing compression and encryption is critical to storing large-scale datasets securely. In this work, we explore how to integrate data compression and cryptography techniques as efficiently as possible for big scientific datasets in the HPC field. We perform thorough experiments using different scientific datasets with the state-of-the-art error-bounded lossy compressor - SZ - on a real-world supercomputing environment. Experiments verify that performing encryption before lossy compression (a.k.a., encr-cmpr method) may invalidate the advantage of compression algorithms. By contrast, executing encryption after lossy compression (a.k.a., cmpr-encr method) keeps not only high compression ratios but high overall execution speed. Experiments also verify that the encryption overhead under the cmpr-encr method decreases with increasing compression ratios, which means very good scalability. Design of Asynchronous Polymorphic Logic Gates for Hardware Security Chandler Bernard (University of Arkansas); William Bryant (University of Arkansas); Richard Becker (University of Arkansas); Jia Di (University of Arkansas)* Polymorphic circuits are circuits that perform two or more different functions under varying operating conditions. This paper presents dual-function asynchronous polymorphic logic gates with functionalities selected by a change in supply voltage. Unlike previous polymorphic gates generated using an evolutionary approach for use in synchronous circuits, these asynchronous polymorphic gates are designed with a deliberate methodology that has been applied across multiple process nodes and verified in silicon using the TSMC 65nm process. The asynchronous polymorphic gates were designed to achieve aspects of “hidden” reconfigurability without the typical area and security drawbacks of field- programmable gate array (FPGA) hardware implementations. 5-4: High Peformance & Secure Hardware 2 Session (15:45-17:00) Session Co-Chairs: Frank Pietryka & Michael Vai Reconfigurable Hardware Root-of-Trust for Secure Edge Processing Karen Gettings (MIT Lincoln Laboratory)*; Alan Ehret (Texas A&M University); Michel Kinsy (Texas A&M University) In this work, we introduce key security primitives for secure edge processing based on a reconfigurable hardware Root-of-Trust. We present a reference architecture, named RECORD SoC, that makes use of these security primitives. These modules can be configured to support a variety of security features, including isolated firmware, I/O access policies, and digital signature verification of an initially untrusted application. We demonstrate that a hardware root-of-trust can be implemented flexibly and efficiently for an edge system vulnerable to physical access-based attacks, requiring only a 16.8% area overhead. Except for a one-time application verification at startup, the security features we examine represent only 0.08% of the latency required to process a sample of sensor data. A Novel Approach to Cyber Situational Awareness in Embedded Systems Kyle W Denney (MIT Lincoln Laboratory)*; Robert  Lychev (MIT Lincoln Laboratory); Michael M Vai (MIT Lincoln Laboratory); Alice Lee (MIT Lincoln Laboratory); Donato Kava (MIT Lincoln Laboratory); Nicholas Evancich (Trusted S&T); Richard  Clark (Trusted S&T); David Lide (Trusted S&T); KJ Kwak (Trusted S&T); Jason Li (Siege Technologies); Michael  Lynch (Air Force Research Laboratory); Kyle  Tillotson (Air Force Research Laboratory); Walt Tirenin (Air Force Research Laboratory); Douglas Schafer (Cohere Technology) Impacts from cyberattacks on mission capability of an embedded system and their mitigations vary according to the system's application and constituents. To effectively respond to a cyber attack, the operator must be given proper context of how that attack impacts mission capability and performance. In this paper, we describe a novel approach toward developing and visualizing cyber situational awareness on embedded systems. The approach uses a mission decomposition model to derive component-level dependencies on mission functionality. This allows us to calculate a mission capability metric which indicates the relative ability to perform mission functionality while undergoing cyber duress in operation. Our approach provides the operator with traceable, concise information about system components, how they are affected by cyberattacks, and the impact on mission capability -- allowing the operator to make informed decisions and take timely courses of action. Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation Yijia Zhang (boston university)*; Burak Aksar (boston university); Omar Aaziz (Sandia National Laboratories); Benjamin Schwaller (Sandia National Laboratories); Jim Brandt (Sandia National Laboratories); Vitus J Leung (Sandia National Laboratories); Manuel Egele (Boston University); Ayse K Coskun (Boston University) On high-performance computing (HPC) systems, job allocation strategies control the placement of a job among available nodes. As the placement changes a job's communication performance, allocation can significantly affects execution times of many HPC applications. Existing allocation strategies typically make decisions based on resource limit, network topology, communication patterns, etc. However, system network performance at runtime is seldom consulted in allocation, even though it significantly affects job execution times. In this work, we demonstrate using monitoring data to improve HPC systems' performance by proposing a Network-Data-Driven (NeDD) job allocation framework, which monitors the network performance of an HPC system at runtime and allocates resources based on both network performance and job characteristics. NeDD characterizes system network performance by collecting the network traffic statistics on each router link, and it characterizes a job's sensitivity to network congestion by collecting Message Passing Interface (MPI) statistics. During allocation, NeDD pairs network-sensitive (network-insensitive) jobs with nodes whose parent routers have low (high) network traffic. Through experiments on a large HPC system, we demonstrate that NeDD reduces the execution time of parallel applications by 11% on average and up to 34%. Timing-based side-channel attack and mitigation on PCIe connected distributed embedded systems Salman Khaliq (University of Connecticut); Usman Ali (University of Connecticut); Omer Khan (University of Connecticut)* PCIe connected peripheral devices are increasingly deployed in distributed embedded systems. For example, a GPU accelerator connected with a host CPU via PCIe interconnect brings massive performance improvement for artificial intelligence applications. These peripheral devices benefit from the shared memory of the host CPU for performance gains, but sharing the host CPU resources brings security challenges. The shared PCIe interconnect hardware of the host CPU can be exploited to create a timing-based information leakage side-channel between multiple connected peripheral devices that are isolated at the software level. This paper proposes an attack setup that consists of GPU and FPGA peripheral devices accessing their data from the host CPU's main memory. Both covert communication and information leakage attacks are demonstrated at a throughput rate of 13.02 kbps. A temporal isolation-based mitigation scheme is proposed that utilizes time-division multiplexing between the peripheral devices to mitigate the attacks. The paper primarily focuses on demonstrating the security context of the proposed attack and mitigation.
2021 IEEE High Performance Extreme Computing Virtual Conference 20 - 24 September 2021
Home Monday, Sept 20 Tuesday, Sept 21 Wednesday, Sept 22 Thursday, Sept 23 Friday, Sept 24 Subscribe to HPEC 2022 Poster Session Friday, Sept 24
2021 Abstract Book