2019 IEEE High Performance
Extreme Computing Conference
(HPEC ‘19)
Twenty-third Annual HPEC Conference
24 - 26 September 2019
Westin Hotel, Waltham, MA USA
Thursday, September 26, 2019
Cloud
3:00-4:40 in Eden Vale C1/C2
Chair: Douglass Enright / Aerospace
[Best Paper Finalist] Using Container Migration for HPC Workloads Resilience
Mohamad Sindi, John R. Williams (MIT)
We share experiences in implementing a container-based HPC environment that could help sustain running HPC workloads on clusters. By running
workloads inside containers, we are able to migrate them from cluster nodes anticipating hardware problems, to healthy nodes while the workloads
are running. Migration is done using the CRIU tool with no application modification. No major interruption or overhead is introduced to the workload.
Various real HPC applications are tested. Tests are done with different hardware node specs, network interconnects, and MPI implementations. We
also benchmark the applications on containers and compare performance to native. Results demonstrate successful migration of HPC workloads
inside containers with minimal interruption, while maintaining the integrity of the results produced. We provide several YouTube videos
demonstrating the migration tests. Benchmarks also show that application performance on containers is close to native. We discuss some of the
challenges faced during implementation and solutions adopted. To the best of our knowledge, we believe this work is the first to demonstrate
successful migration of real MPI-based HPC workloads using CRIU and containers.
[Best Student Paper Finalist] Design and Implementation of Knowledge Base for Runtime Management of Software Defined Hardware
Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, Viktor Prasanna (USC)
Runtime-reconfigurable software coupled with reconfigurable hardware is highly desirable as a means towards maximizing runtime efficiency
without compromising programmability. Compilers for such software systems are extremely difficult to design as they must leverage different types
of hardware at runtime. To address the need for static and dynamic compiler optimization of workflows matched to dynamically reconfigurable
hardware, we propose a novel design of the central component of a dynamic software compiler for software defined hardware. Our comprehensive
design focuses not just on static knowledge but also on semi-supervised extraction of knowledge from program executions and developing their
performance models. Specifically, our novel dynamic and extensible knowledge base 1) continuously gathers knowledge during execution of
workflows 2) identifies optimal implementations of workflows on optimal (available) hardware configurations. It plays a hub role in storing
information from, and providing information to other components of the compiler, as well as the human analyst. Through a rich tripartite graph
representation, the knowledge base captures and learns extensive information on decomposition and mapping of code steps to kernels and
mapping of kernels to available hardware configurations. The knowledge base is implemented using the C++ Boost Library and is capable of
quickly processing offline and online queries and updates. We show that our knowledge base can answer queries in 1ms regardless of the number
of workflows it stores. To the best of our knowledge, this is the first design of a dynamic and extensible knowledge base to support compilation of
high-level languages to leverage arbitrary reconfigurable platforms.
Singularity for Machine Learning Applications - Analysis of Performance Impact
Bruce R. Jordan Jr., David Barrett, David Burke, Patrick Jardin, Amelia Littrell, Paul Monticciolo, Michael Newey, Jean Piou, Kara Warner (MIT-LL)
Software deployments in general, and deep learning applications in particular, suffer from difficulty in reproducible results. The use of containers to
mitigate these issues is becoming a common practice. Singularity is a container technology which targets the unique issues present in High
Performance Computing (HPC) Centers. This paper characterizes the impact of using Singularity for both Training and Inference on deep learning
applications.
COMET: A Distributed Metadata Service for Federated Cloud Infrastructures
Cong Wang, Komal Thareja, Michael Stealey, Paul Ruth, Ilya Baldin (RENCI, Univ. North Carolina)
Majority of today's cloud services are independently operated by individual cloud service providers. In this approach, the locations of cloud
resources are strictly constrained by the distribution of cloud service providers' sites. As the popularity and scale of cloud services increases, we
believe this traditional paradigm is about to change toward further federated services, a.k.a., multi-cloud, due the improved performance, reduced
cost of compute, storage and network resources, as well as the increased user demands. In this paper, we present COMET, a light weight,
distributed storage system for managing metadata on large scale, federated cloud infrastructure providers, end users, and their applications. We
use two use cases from NSF's ExoGENI and Chameleon research cloud testbeds to show the effectiveness of COMET design and deployment.
Introducing DyMonDS-as-a-Service (DyMaaS) for Internet of Things
Marija Ilic, Rupamathi Jaddivada (MIT)
With recent trends in computation and communication architecture, it is becoming possible to simulate complex networked dynamical systems by
employing high-fidelity models. The inherent spatial and temporal complexity of these systems however still acts as a roadblock. It is thus desirable
to have adaptive platform design facilitating zooming-in and out of the models to emulate time-evolution of processes at a desired spatial and
temporal granularity. In this paper, we propose new computing and networking abstractions, that can embrace physical dynamics and computations
in a unified manner, by taking advantage of the inherent structure. We further design multi-rate numerical methods that can be implemented by
computing architectures to facilitate adaptive zooming-in and out of the models spanning multiple spatial and temporal layers. These methods are
all embedded in a platform called Dynamic Monitoring and Decision Systems (DyMonDS). We introduce a new service model of cloud computing
called DyMonDS-as-a-Service (DyMaas), for use by operators at various different spatial granularities to efficiently emulate the interconnection of
IoT devices. The usage of this platform is described in the context of an electric microgrid system emulation.