2019 IEEE High Performance
Extreme Computing Conference
(HPEC ‘19)
Twenty-third Annual HPEC Conference
24 - 26 September 2019
Westin Hotel, Waltham, MA USA
Thursday, September 26, 2019
Al 2
10:20-12:00 in Eden Vale A1/A2
Chair: Siddharth Samsi / MIT-LL
Deep Learning-Based Nuclei Segmentation of Cleared Brain Tissue
Pooya Khorrami (MIT-LL), Kevin Brady (MIT-LL), Mark Hernandez (MIT-LL), Lars Gjesteby (MIT-LL), Sara Nicole Burke (Univ. Florida), Damon
Lamb (Univ. Florida), Matthew A Melton (Univ. Florida), Kevin Otto (Univ. Florida), Laura J. Brattain (MIT-LL)
We present a deep learning approach for nuclei segmentation at scale. Our algorithm aims to address the challenge of segmentation in dense
scenes with limited annotated data available. Annotation in this domain is highly manual in nature, requiring time-consuming markup of the
neuron and extensive expertise, and often results in errors. For these reasons, the approach under consideration employs methods adopted
from transfer learning. This approach can also be extended to segment other components of the neurons.
Deploying AI Frameworks on Secure HPC Systems with Containers.
David Brayford (Leibniz Rechenzentrum), Sofia Vallecorsa (CERN), Atanas Atanasov (Intel), Fabio Baruffa (Intel), Walter Rivera (Intel)
The increasing interest in the usage of Artificial Intelligence (AI) techniques from the research community and industry to tackle “real world”
problems, requires High-Performance Computing (HPC) resources to efficiently compute and scale complex algorithms across thousands of
nodes. Unfortunately, typical data scientists are not familiar with the unique requirements and characteristics of HPC environments. They usually
develop their applications with high-level scripting languages or frameworks such as TensorFlow and the installation processes often require
connection to external systems to download open-source software during the build. HPC environments, on the other hand, are often based on
closed source applications that incorporate parallel and distributed computing API’s such as MPI and OpenMP, while users have restricted
administrator privileges and face security restrictions such as not allowing access to external systems. In this paper, we discuss the issues
associated with the deployment of AI frameworks in a secure HPC environment and how we successfully deploy AI frameworks on SuperMUC-
NG with Charliecloud.
Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System
Wenjia Zheng (Fordham Univ.), Yun Song (Fordham Univ.), Zihao Guo (Fordham Univ.), Yongchen Cui (Fordham Univ.), Suwen Gu (Fordham
Univ.), Ying Mao (Fordham Univ.), Long Cheng (Univ. College Dublin)
The neural network-based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles,
computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an
iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by
the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage,
deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life,
obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources,
e.g., CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource
contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we
propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of
the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments
show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target.
[Best Paper Finalist] Training Behavior of Sparse Neural Network Topologies
Simon Alford, Ryan Robinett, Lauren Milechin, Jeremy Kepner (MIT)
Improvements in the performance of deep neural networks have often come through the design of larger and more complex networks. As a
result, fast memory is a significant limiting factor in our ability to improve network performance. One approach to overcoming this limit is the
design of sparse neural networks, which can be both very large and efficiently trained. In this paper we experiment training on sparse neural
network topologies. We test pruning-based topologies, which are derived from an initially dense network whose connections are pruned, as well
as RadiX-Nets, a class of network topologies with proven connectivity and sparsity properties. Results show that sparse networks obtain
accuracies comparable to dense networks, but extreme levels of sparsity cause instability in training, which merits further study.
[Best Paper Finalist] Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators
Piotr Luszczek (Univ. Tennessee), Ichitaro Yamazaki (Sandia), Jack Dongarra (Univ. Tennessee, ORNL, Univ. Manchester)
The emergence of Deep Learning as a leading computational workload for Machine Learning tasks at large scale cloud infrastructure
installations has lead to plethora of releases of accelerator hardware. However, the reduced precision and range of the floating-point numbers
on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear
algebra operations that come with a guarantee of robust error bounds. In order to address these concerns, we present a number of strategies
that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats
implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the
explanation of a broader context and connections to established IEEE floating-point standards and existing HPC benchmarks. We also present
a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which
directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to
recover substantial amount of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using
iterative refinement based on single-precision floating-point arithmetic. The discussion will also explore the numerical stability issues that are
important for robust linear solvers on these new hardware platforms.