2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
Wednesday September 25, 2019 Al 1 1:00-2:40 in Eden Vale A3 Chair: Paul Monticciolo / MIT LL Survey and Benchmarking of Machine Learning Accelerators Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner (MIT-LL) Advances in multicore processors and accelerators have opened the flood gates to greater exploration and application of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore's Law, have prompted an explosion of processors and accelerators that promise even greater computational and machine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators.  This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then choose and benchmark two commercially-available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and mobile machine learning inference applications that are most applicable to the DoD and other SWaP constrained users. We determine how they actually perform with real-world images and neural network models, and compare those results to the reported performance and power consumption values and compare them to an Intel CPU that is used in some embedded applications. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM Brian Plancher (Harvard University); Camelia Brumar (Worcester Polytechnic Institute); Iulian Brumar (Harvard University); Lillian Pentecost (Harvard University); Saketh Rama (Harvard University)*; David Brooks (Harvard University) Computational efficiency is a critical constraint for a variety of cutting-edge applications. In this work, we identify an opportunity to speed up the end-to-end runtime of two such applications by incorporating approximate linear algebra techniques. Particularly, we apply approximate matrix multiplication to artificial Neural Networks (NNs) for image classification and to the robotics problem of Distributed Simultaneous Localization and Mapping (DSLAM).  Expanding upon recent sampling-based Monte Carlo approximation strategies for matrix multiplication, we develop updated theoretical bounds, and an adaptive error prediction strategy. We then apply these techniques in the context of NNs and DSLAM increasing the speed of both applications by 15-20% while maintaining a 97% classification accuracy for NNs running on the MNIST dataset and keeping the average robot position error under 1 meter (vs 0.32 meters for the exact solution). However, both applications experience variance in their results.  This suggests that Monte Carlo matrix multiplication may be an effective technique to reduce the memory and computational burden of certain algorithms when used carefully, but more research is needed before these techniques can be widely used in practice. Low Power Computing and Simultaneous Electro-Optical/Radar Data Processing using IBM’s NS16e 16-chip Neuromorphic Hardware Mark Barnell (Air Force Research Laboratory); Courtney Raymond (-); Daniel Brown (SRC, Inc.); Matthew Wilson (SRC, Inc.); Eric Cote (SRC, Inc.)* For the first time ever, advanced machine learning (ML) compute architectures, techniques, and methods were demonstrated on United States Geological Survey (USGS) optical imagery and Department of Defense (DoD) Synthetic Aperture Radar (SAR) imagery, simultaneously, using IBM’s new NS16e neurosynaptic processor board comprised of 16 TrueNorth chips. The Air Force Research Laboratory (AFRL) Information Directorate Advanced Computing and Communications Division continues to develop and demonstrate new bio-inspired computing algorithms and architectures, designed to provide advanced, ultra-low power, ground and airborne High-Performance Computing (HPC) solutions to meet operational and tactical, real-time processing needs for Intelligence, Surveillance, and Reconnaissance (ISR) missions on small form factor hardware, and in Size, Weight and Power (SWaP) constrained environments. With an average throughput of 16,000 inferences per second, the system provided a processing efficiency of 1,066 inferences per Watt. The NS16e power utilization never exceeded 15 Watts for this application. The contribution of power consumption from TrueNorth processors was bound to less than 5.5 Watts. TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir Siddharth Samsi (MIT Lincoln Laboratory)*; Tao Schardl (MIT CSAIL) This work introduces TapirXLA, a replacement for TensorFlow’s XLA compiler that embeds recursive fork-join parallelism into XLA’s low-level representation of code. Machine- learning applications rely on efficient parallel processing to achieve performance, and they employ a variety of technologies to improve performance, including compiler technology. But compil- ers in machine-learning frameworks lack a deep understanding of parallelism, causing them to lose performance by missing optimizations on parallel computation. This work studies how Tapir, a compiler intermediate representation (IR) that embeds parallelism into a mainstream compiler IR, can be incorporated into a compiler for machine learning to remedy this problem. TapirXLA modifies the XLA compiler in TensorFlow to employ the Tapir/LLVM compiler to optimize low-level parallel computation. TapirXLA encodes the parallelism within high-level TensorFlow operations using Tapir’s representation of fork- join parallelism. TapirXLA also exposes to the compiler implementa- tions of linear-algebra library routines whose parallel operations are encoded using Tapir’s representation. We compared the performance of TensorFlow using TapirXLA against TensorFlow using an unmodified XLA compiler. On four neural-network benchmarks, TapirXLA speeds up the parallel running time of the network by a geometric-mean multiplicative factor of 30% to 100%, across four CPU architectures.
Wednesday September 25, 2019 Al 1 1:00-2:40 in Eden Vale A3 Chair: Paul Monticciolo / MIT LL Survey and Benchmarking of Machine Learning Accelerators Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner (MIT-LL) Advances in multicore processors and accelerators have opened the flood gates to greater exploration and application of machine learning techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore's Law, have prompted an explosion of processors and accelerators that promise even greater computational and machine learning capabilities. These processors and accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators.  This paper surveys the current state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and inference versus training. We then choose and benchmark two commercially-available low size, weight, and power (SWaP) accelerators as these processors are the most interesting for embedded and mobile machine learning inference applications that are most applicable to the DoD and other SWaP constrained users. We determine how they actually perform with real-world images and neural network models, and compare those results to the reported performance and power consumption values and compare them to an Intel CPU that is used in some embedded applications. Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM Brian Plancher (Harvard University); Camelia Brumar (Worcester Polytechnic Institute); Iulian Brumar (Harvard University); Lillian Pentecost (Harvard University); Saketh Rama (Harvard University)*; David Brooks (Harvard University) Computational efficiency is a critical constraint for a variety of cutting- edge applications. In this work, we identify an opportunity to speed up the end-to-end runtime of two such applications by incorporating approximate linear algebra techniques. Particularly, we apply approximate matrix multiplication to artificial Neural Networks (NNs) for image classification and to the robotics problem of Distributed Simultaneous Localization and Mapping (DSLAM).  Expanding upon recent sampling-based Monte Carlo approximation strategies for matrix multiplication, we develop updated theoretical bounds, and an adaptive error prediction strategy. We then apply these techniques in the context of NNs and DSLAM increasing the speed of both applications by 15-20% while maintaining a 97% classification accuracy for NNs running on the MNIST dataset and keeping the average robot position error under 1 meter (vs 0.32 meters for the exact solution). However, both applications experience variance in their results.  This suggests that Monte Carlo matrix multiplication may be an effective technique to reduce the memory and computational burden of certain algorithms when used carefully, but more research is needed before these techniques can be widely used in practice. Low Power Computing and Simultaneous Electro-Optical/Radar Data Processing using IBM’s NS16e 16-chip Neuromorphic Hardware Mark Barnell (Air Force Research Laboratory); Courtney Raymond (- ); Daniel Brown (SRC, Inc.); Matthew Wilson (SRC, Inc.); Eric Cote (SRC, Inc.)* For the first time ever, advanced machine learning (ML) compute architectures, techniques, and methods were demonstrated on United States Geological Survey (USGS) optical imagery and Department of Defense (DoD) Synthetic Aperture Radar (SAR) imagery, simultaneously, using IBM’s new NS16e neurosynaptic processor board comprised of 16 TrueNorth chips. The Air Force Research Laboratory (AFRL) Information Directorate Advanced Computing and Communications Division continues to develop and demonstrate new bio-inspired computing algorithms and architectures, designed to provide advanced, ultra-low power, ground and airborne High-Performance Computing (HPC) solutions to meet operational and tactical, real-time processing needs for Intelligence, Surveillance, and Reconnaissance (ISR) missions on small form factor hardware, and in Size, Weight and Power (SWaP) constrained environments. With an average throughput of 16,000 inferences per second, the system provided a processing efficiency of 1,066 inferences per Watt. The NS16e power utilization never exceeded 15 Watts for this application. The contribution of power consumption from TrueNorth processors was bound to less than 5.5 Watts. TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir Siddharth Samsi (MIT Lincoln Laboratory)*; Tao Schardl (MIT CSAIL) This work introduces TapirXLA, a replacement for TensorFlow’s XLA compiler that embeds recursive fork-join parallelism into XLA’s low- level representation of code. Machine- learning applications rely on efficient parallel processing to achieve performance, and they employ a variety of technologies to improve performance, including compiler technology. But compil- ers in machine-learning frameworks lack a deep understanding of parallelism, causing them to lose performance by missing optimizations on parallel computation. This work studies how Tapir, a compiler intermediate representation (IR) that embeds parallelism into a mainstream compiler IR, can be incorporated into a compiler for machine learning to remedy this problem. TapirXLA modifies the XLA compiler in TensorFlow to employ the Tapir/LLVM compiler to optimize low-level parallel computation. TapirXLA encodes the parallelism within high-level TensorFlow operations using Tapir’s representation of fork-join parallelism. TapirXLA also exposes to the compiler implementa- tions of linear-algebra library routines whose parallel operations are encoded using Tapir’s representation. We compared the performance of TensorFlow using TapirXLA against TensorFlow using an unmodified XLA compiler. On four neural-network benchmarks, TapirXLA speeds up the parallel running time of the network by a geometric-mean multiplicative factor of 30% to 100%, across four CPU architectures.