2019 IEEE High Performance
Extreme Computing Conference
(HPEC ‘19)
Twenty-third Annual HPEC Conference
24 - 26 September 2019
Westin Hotel, Waltham, MA USA
Wednesday September 25, 2019
Al 1
1:00-2:40 in Eden Vale A3
Chair: Paul Monticciolo / MIT LL
Survey and Benchmarking of Machine Learning Accelerators
Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner (MIT-LL)
Advances in multicore processors and accelerators have opened the flood gates to greater exploration and application of machine learning
techniques to a variety of applications. These advances, along with breakdowns of several trends including Moore's Law, have prompted an
explosion of processors and accelerators that promise even greater computational and machine learning capabilities. These processors and
accelerators are coming in many forms, from CPUs and GPUs to ASICs, FPGAs, and dataflow accelerators. This paper surveys the current
state of these processors and accelerators that have been publicly announced with performance and power consumption numbers. The
performance and power values are plotted on a scatter graph and a number of dimensions and observations from the trends on this plot are
discussed and analyzed. For instance, there are interesting trends in the plot regarding power consumption, numerical precision, and
inference versus training. We then choose and benchmark two commercially-available low size, weight, and power (SWaP) accelerators as
these processors are the most interesting for embedded and mobile machine learning inference applications that are most applicable to the
DoD and other SWaP constrained users. We determine how they actually perform with real-world images and neural network models, and
compare those results to the reported performance and power consumption values and compare them to an Intel CPU that is used in some
embedded applications.
Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM
Brian Plancher (Harvard University); Camelia Brumar (Worcester Polytechnic Institute); Iulian Brumar (Harvard University); Lillian Pentecost
(Harvard University); Saketh Rama (Harvard University)*; David Brooks (Harvard University)
Computational efficiency is a critical constraint for a variety of cutting-edge applications. In this work, we identify an opportunity to speed up
the end-to-end runtime of two such applications by incorporating approximate linear algebra techniques. Particularly, we apply approximate
matrix multiplication to artificial Neural Networks (NNs) for image classification and to the robotics problem of Distributed Simultaneous
Localization and Mapping (DSLAM). Expanding upon recent sampling-based Monte Carlo approximation strategies for matrix multiplication,
we develop updated theoretical bounds, and an adaptive error prediction strategy. We then apply these techniques in the context of NNs and
DSLAM increasing the speed of both applications by 15-20% while maintaining a 97% classification accuracy for NNs running on the MNIST
dataset and keeping the average robot position error under 1 meter (vs 0.32 meters for the exact solution). However, both applications
experience variance in their results. This suggests that Monte Carlo matrix multiplication may be an effective technique to reduce the memory
and computational burden of certain algorithms when used carefully, but more research is needed before these techniques can be widely used
in practice.
Low Power Computing and Simultaneous Electro-Optical/Radar Data Processing using IBM’s NS16e 16-chip Neuromorphic
Hardware
Mark Barnell (Air Force Research Laboratory); Courtney Raymond (-); Daniel Brown (SRC, Inc.); Matthew Wilson (SRC, Inc.); Eric Cote (SRC,
Inc.)*
For the first time ever, advanced machine learning (ML) compute architectures, techniques, and methods were demonstrated on United States
Geological Survey (USGS) optical imagery and Department of Defense (DoD) Synthetic Aperture Radar (SAR) imagery, simultaneously, using
IBM’s new NS16e neurosynaptic processor board comprised of 16 TrueNorth chips. The Air Force Research Laboratory (AFRL) Information
Directorate Advanced Computing and Communications Division continues to develop and demonstrate new bio-inspired computing algorithms
and architectures, designed to provide advanced, ultra-low power, ground and airborne High-Performance Computing (HPC) solutions to meet
operational and tactical, real-time processing needs for Intelligence, Surveillance, and Reconnaissance (ISR) missions on small form factor
hardware, and in Size, Weight and Power (SWaP) constrained environments. With an average throughput of 16,000 inferences per second,
the system provided a processing efficiency of 1,066 inferences per Watt. The NS16e power utilization never exceeded 15 Watts for this
application. The contribution of power consumption from TrueNorth processors was bound to less than 5.5 Watts.
TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir
Siddharth Samsi (MIT Lincoln Laboratory)*; Tao Schardl (MIT CSAIL)
This work introduces TapirXLA, a replacement for TensorFlow’s XLA compiler that embeds recursive fork-join parallelism into XLA’s low-level
representation of code. Machine- learning applications rely on efficient parallel processing to achieve performance, and they employ a variety
of technologies to improve performance, including compiler technology. But compil- ers in machine-learning frameworks lack a deep
understanding of parallelism, causing them to lose performance by missing optimizations on parallel computation. This work studies how Tapir,
a compiler intermediate representation (IR) that embeds parallelism into a mainstream compiler IR, can be incorporated into a compiler for
machine learning to remedy this problem. TapirXLA modifies the XLA compiler in TensorFlow to employ the Tapir/LLVM compiler to optimize
low-level parallel computation. TapirXLA encodes the parallelism within high-level TensorFlow operations using Tapir’s representation of fork-
join parallelism. TapirXLA also exposes to the compiler implementa- tions of linear-algebra library routines whose parallel operations are
encoded using Tapir’s representation. We compared the performance of TensorFlow using TapirXLA against TensorFlow using an unmodified
XLA compiler. On four neural-network benchmarks, TapirXLA speeds up the parallel running time of the network by a geometric-mean
multiplicative factor of 30% to 100%, across four CPU architectures.