2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
Thursday, September 26, 2019 Al 2 10:20-12:00 in Eden Vale A1/A2 Chair: Siddharth Samsi / MIT-LL Deep Learning-Based Nuclei Segmentation of Cleared Brain Tissue Pooya Khorrami (MIT-LL), Kevin Brady (MIT-LL), Mark Hernandez (MIT-LL), Lars Gjesteby (MIT-LL), Sara Nicole Burke (Univ. Florida), Damon Lamb (Univ. Florida), Matthew A Melton (Univ. Florida), Kevin Otto (Univ. Florida), Laura J. Brattain (MIT-LL) We present a deep learning approach for nuclei segmentation at scale. Our algorithm aims to address the challenge of segmentation in dense scenes with limited annotated data available. Annotation in this domain is highly manual in nature, requiring time-consuming markup of the neuron and extensive expertise, and often results in errors. For these reasons, the approach under consideration employs methods adopted from transfer learning. This approach can also be extended to segment other components of the neurons. Deploying AI Frameworks on Secure HPC Systems  with Containers. David Brayford (Leibniz Rechenzentrum), Sofia Vallecorsa (CERN), Atanas Atanasov (Intel), Fabio Baruffa (Intel), Walter Rivera (Intel) The increasing interest in the usage of Artificial Intelligence (AI) techniques from the research community and industry to tackle “real world” problems, requires High-Performance Computing (HPC) resources to efficiently compute and scale complex algorithms across thousands of nodes. Unfortunately, typical data scientists are not familiar with the unique requirements and characteristics of HPC environments. They usually develop their applications with high-level scripting languages or frameworks such as TensorFlow and the installation processes often require connection to external systems to download open-source software during the build. HPC environments, on the other hand, are often based on closed source applications that incorporate parallel and distributed computing API’s such as MPI and OpenMP, while users have restricted administrator privileges and face security restrictions such as not allowing access to external systems. In this paper, we discuss the issues associated with the deployment of AI frameworks in a secure HPC environment and how we successfully deploy AI frameworks on SuperMUC- NG with Charliecloud. Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System Wenjia Zheng (Fordham Univ.), Yun Song (Fordham Univ.), Zihao Guo (Fordham Univ.), Yongchen Cui (Fordham Univ.), Suwen Gu (Fordham Univ.), Ying Mao (Fordham Univ.), Long Cheng (Univ. College Dublin) The neural network-based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g., CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target. [Best Paper Finalist] Training Behavior of Sparse Neural Network Topologies Simon Alford, Ryan Robinett, Lauren Milechin, Jeremy Kepner (MIT) Improvements in the performance of deep neural networks have often come through the design of larger and more complex networks. As a result, fast memory is a significant limiting factor in our ability to improve network performance. One approach to overcoming this limit is the design of sparse neural networks, which can be both very large and efficiently trained. In this paper we experiment training on sparse neural network topologies. We test pruning-based topologies, which are derived from an initially dense network whose connections are pruned, as well as RadiX-Nets, a class of network topologies with proven connectivity and sparsity properties. Results show that sparse networks obtain accuracies comparable to dense networks, but extreme levels of sparsity cause instability in training, which merits further study. [Best Paper Finalist] Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half-Precision Accelerators Piotr Luszczek (Univ. Tennessee), Ichitaro Yamazaki (Sandia), Jack Dongarra (Univ. Tennessee, ORNL, Univ. Manchester) The emergence of Deep Learning as a leading computational workload for Machine Learning tasks at large scale cloud infrastructure installations has lead to plethora of releases of accelerator hardware.  However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds.  In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing HPC benchmarks.  We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amount of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic.  The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms.
Thursday, September 26, 2019 Al 2 10:20-12:00 in Eden Vale A1/A2 Chair: Siddharth Samsi / MIT-LL Deep Learning-Based Nuclei Segmentation of Cleared Brain Tissue Pooya Khorrami (MIT-LL), Kevin Brady (MIT-LL), Mark Hernandez (MIT-LL), Lars Gjesteby (MIT-LL), Sara Nicole Burke (Univ. Florida), Damon Lamb (Univ. Florida), Matthew A Melton (Univ. Florida), Kevin Otto (Univ. Florida), Laura J. Brattain (MIT-LL) We present a deep learning approach for nuclei segmentation at scale. Our algorithm aims to address the challenge of segmentation in dense scenes with limited annotated data available. Annotation in this domain is highly manual in nature, requiring time-consuming markup of the neuron and extensive expertise, and often results in errors. For these reasons, the approach under consideration employs methods adopted from transfer learning. This approach can also be extended to segment other components of the neurons. Deploying AI Frameworks on Secure HPC Systems  with Containers. David Brayford (Leibniz Rechenzentrum), Sofia Vallecorsa (CERN), Atanas Atanasov (Intel), Fabio Baruffa (Intel), Walter Rivera (Intel) The increasing interest in the usage of Artificial Intelligence (AI) techniques from the research community and industry to tackle “real world” problems, requires High-Performance Computing (HPC) resources to efficiently compute and scale complex algorithms across thousands of nodes. Unfortunately, typical data scientists are not familiar with the unique requirements and characteristics of HPC environments. They usually develop their applications with high-level scripting languages or frameworks such as TensorFlow and the installation processes often require connection to external systems to download open-source software during the build. HPC environments, on the other hand, are often based on closed source applications that incorporate parallel and distributed computing API’s such as MPI and OpenMP, while users have restricted administrator privileges and face security restrictions such as not allowing access to external systems. In this paper, we discuss the issues associated with the deployment of AI frameworks in a secure HPC environment and how we successfully deploy AI frameworks on SuperMUC-NG with Charliecloud. Target-based Resource Allocation for Deep Learning Applications in a Multi-tenancy System Wenjia Zheng (Fordham Univ.), Yun Song (Fordham Univ.), Zihao Guo (Fordham Univ.), Yongchen Cui (Fordham Univ.), Suwen Gu (Fordham Univ.), Ying Mao (Fordham Univ.), Long Cheng (Univ. College Dublin) The neural network-based deep learning is the key technology that enables many powerful applications, which include self-driving vehicles, computer vision, and natural language processing. Although various algorithms focus on different directions, generally, they mainly employ an iteration by iteration training and evaluating the process. Each iteration aims to find a parameter set, which minimizes a loss function defined by the learning model. When completing the training process, the global minimum is achieved with a set of optimized parameters. At this stage, deep learning applications can be shipped with a trained model to provide services. While deep learning applications are reshaping our daily life, obtaining a good learning model is an expensive task. Training deep learning models is, usually, time-consuming and requires lots of resources, e.g., CPU and GPU. In a multi-tenancy system, however, limited resources are shared by multiple clients that lead to severe resource contention. Therefore, a carefully designed resource management scheme is required to improve the overall performance. In this project, we propose a target based scheduling scheme named TRADL. In TRADL, developers have options to specify a two-tier target. If the accuracy of the model reaches a target, it can be delivered to clients while the training is still going on to continue improving the quality. The experiments show that TRADL is able to significantly reduce the time cost, as much as 48.2%, for reaching the target. [Best Paper Finalist] Training Behavior of Sparse Neural Network Topologies Simon Alford, Ryan Robinett, Lauren Milechin, Jeremy Kepner (MIT) Improvements in the performance of deep neural networks have often come through the design of larger and more complex networks. As a result, fast memory is a significant limiting factor in our ability to improve network performance. One approach to overcoming this limit is the design of sparse neural networks, which can be both very large and efficiently trained. In this paper we experiment training on sparse neural network topologies. We test pruning-based topologies, which are derived from an initially dense network whose connections are pruned, as well as RadiX-Nets, a class of network topologies with proven connectivity and sparsity properties. Results show that sparse networks obtain accuracies comparable to dense networks, but extreme levels of sparsity cause instability in training, which merits further study. [Best Paper Finalist] Increasing Accuracy of Iterative Refinement in Limited Floating-Point Arithmetic on Half- Precision Accelerators Piotr Luszczek (Univ. Tennessee), Ichitaro Yamazaki (Sandia), Jack Dongarra (Univ. Tennessee, ORNL, Univ. Manchester) The emergence of Deep Learning as a leading computational workload for Machine Learning tasks at large scale cloud infrastructure installations has lead to plethora of releases of accelerator hardware.  However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non- trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds.  In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing HPC benchmarks.  We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amount of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic.  The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms.