2019 IEEE High Performance Extreme Computing Conference(HPEC ‘19)Twenty-third Annual HPEC Conference24 - 26 September 2019Westin Hotel, Waltham, MA USA
Wednesday, September 25, 2019Cloud 3:00-4:40 in Eden Vale C1Chair: Douglass Enright / Aerospace[Best Paper Finalist] Using Container Migration for HPC Workloads ResilienceMohamad Sindi, John R. Williams (MIT)We share experiences in implementing a container-based HPC environment that could help sustain running HPC workloads on clusters. By running workloads inside containers, we are able to migrate them from cluster nodes anticipating hardware problems, to healthy nodes while the workloads are running. Migration is done using the CRIU tool with no application modification. No major interruption or overhead is introduced to the workload. Various real HPC applications are tested. Tests are done with different hardware node specs, network interconnects, and MPI implementations. We also benchmark the applications on containers and compare performance to native. Results demonstrate successful migration of HPC workloads inside containers with minimal interruption, while maintaining the integrity of the results produced. We provide several YouTube videos demonstrating the migration tests. Benchmarks also show that application performance on containers is close to native. We discuss some of the challenges faced during implementation and solutions adopted. To the best of our knowledge, we believe this work is the first to demonstrate successful migration of real MPI-based HPC workloads using CRIU and containers.Singularity for Machine Learning Applications - Analysis of Performance ImpactBruce R. Jordan Jr., David Barrett, David Burke, Patrick Jardin, Amelia Littrell, Paul Monticciolo, Michael Newey, Jean Piou, Kara Warner (MIT-LL)Software deployments in general, and deep learning applications in particular, suffer from difficulty in reproducible results. The use of containers to mitigate these issues is becoming a common practice. Singularity is a container technology which targets the unique issues present in High Performance Computing (HPC) Centers. This paper characterizes the impact of using Singularity for both Training and Inference on deep learning applications.Introducing DyMonDS-as-a-Service (DyMaaS) for Internet of ThingsMarija Ilic, Rupamathi Jaddivada (MIT)With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by employing high-fidelity models. The inherent spatial and temporal complexity of these systems however still acts as a roadblock. It is thus desirable to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing called DyMonDS-as-a-Service (DyMaas), for use by operators at various different spatial granularities to efficiently emulate the interconnection of IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.[Best Student Paper Finalist] Design and Implementation of Knowledge Base for Runtime Management of Software Defined HardwareHongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, Viktor Prasanna (USC)Runtime-reconfigurable software coupled with reconfigurable hardware is highly desirable as a means towards maximizing runtime efficiency without compromising programmability. Compilers for such software systems are extremely difficult to design as they must leverage different types of hardware at runtime. To address the need for static and dynamic compiler optimization of workflows matched to dynamically reconfigurable hardware, we propose a novel design of the central component of a dynamic software compiler for software defined hardware. Our comprehensive design focuses not just on static knowledge but also on semi-supervised extraction of knowledge from program executions and developing their performance models. Specifically, our novel dynamic and extensible knowledge base 1) continuously gathers knowledge during execution of workflows 2) identifies optimal implementations of workflows on optimal (available) hardware configurations. It plays a hub role in storing information from, and providing information to other components of the compiler, as well as the human analyst. Through a rich tripartite graph representation, the knowledge base captures and learns extensive information on decomposition and mapping of code steps to kernels and mapping of kernels to available hardware configurations. The knowledge base is implemented using the C++ Boost Library and is capable of quickly processing offline and online queries and updates. We show that our knowledge base can answer queries in 1ms regardless of the number of workflows it stores. To the best of our knowledge, this is the first design of a dynamic and extensible knowledge base to support compilation of high-level languages to leverage arbitrary reconfigurable platforms.COMET: A Distributed Metadata Service for Federated Cloud InfrastructuresCong Wang, Komal Thareja, Michael Stealey, Paul Ruth, Ilya Baldin (RENCI, Univ. North Carolina)Majority of today's cloud services are independently operated by individual cloud service providers. In this approach, the locations of cloud resources are strictly constrained by the distribution of cloud service providers' sites. As the popularity and scale of cloud services increases, we believe this traditional paradigm is about to change toward further federated services, a.k.a., multi-cloud, due the improved performance, reduced cost of compute, storage and network resources, as well as the increased user demands. In this paper, we present COMET, a light weight, distributed storage system for managing metadata on large scale, federated cloud infrastructure providers, end users, and their applications. We use two use cases from NSF's ExoGENI and Chameleon research cloud testbeds to show the effectiveness of COMET design and deployment.