2019 IEEE High Performance
Extreme Computing Conference
(HPEC ‘19)
Twenty-third Annual HPEC Conference
24 - 26 September 2019
Westin Hotel, Waltham, MA USA
Lunch, View Posters & Demos 1
12:00-1:00 in Foyer
Deep-Learning Inferencing with High-Performance Hardware Accelerators
Luke Kljucaric, Alan D. George (SHREC @ Pitt)
In order to improve their performance-per-watt capabilities over general-purpose architectures, FPGAs are commonly employed to
accelerate applications. With the exponential growth of available data, machine-learning apps have generated greater interest in order
to more comprehensively understand that data and increase autonomous processing. As FPGAs become more readily available on
cloud services like Amazon Web Services F1 platform, it is worth studying the performance of accelerating machine-learning apps on
FPGAs over traditional fixed-logic devices, like CPUs and GPUs. FPGA frameworks for accelerating convolutional neural networks
(CNN), which are used in many machine-learning apps, have begun to emerge for accelerated-application development. This research
aims to compare the performance of these forthcoming frameworks on two commonly used CNNs, GoogLeNet and AlexNet.
Specifically, handwritten Chinese character recognition is benchmarked across multiple FPGA frameworks on Xilinx and Intel FPGAs
and compared against multiple CPU and GPU architectures featured on AWS, Google’s Cloud platform, the University of Pittsburgh’s
Center for Research Computing (CRC), and Intel’s vLab Academic Cluster. All NVIDIA GPUs have proven to have the best performance
over every other device in this study. The Zebra framework available for Xilinx FPGAs showed to have an average 8.3 times and 9.3
times performance and efficiency improvement, respectively, over the OpenVINO framework available for Intel FPGAs. Although the
Zebra framework on the Xilinx VU9P showed greater efficiency than the Pascal-based GPUs, the NVIDIA Tesla V100 proved to be the
most efficient device at 125.9 and 47.2 images-per-second- per-Watt for AlexNet and GoogLeNet, respectively. Although currently
lacking, FPGA frameworks and devices have the potential to compete with GPUs in terms of performance and efficiency.
ECG Feature Processing Performance Acceleration on SLURM Compute Systems
Michael Nolan, Mark Hernandez, Philip Fremont-Smith, Albert Swiston, Kajal Claypool (MIT-LL)
Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological
assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for
serialized data processing which scale poorly for large datasets. To address this issue, we've developed a Matlab code library for
parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over
supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of
parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory
Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to
assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to
the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing
returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown.
Embedded Processor-In-Memory Architecture for Accelerating Arithmetic Operations
Richard Muri, Paul Fortier (UMass Dartmouth)
Abstract—A processor-in-memory (PIM) computer architecture is any design that performs some subset of logical operations in the
same location as memory. The traditional model of computing involves a processor loading data from memory to perform operations,
with a bus connecting the processor and memory. While this technique works well in many situations, a growing gap between memory
performance and processor performance has led some researchers to develop alternative architectures. This paper details the
implementation of a PIM architecture in a soft core microcontroller used to accelerate applications limited by register file size. Using an
Artix-7 FPGA, an ATmega103 microcontroller soft core is modified to include a PIM core as an accelerator. The sample application of
AES encryption provides a comparison between the baseline processor and the PIM enhanced machine. AES encryption using the
modified microcontroller requires 38% fewer clock cycles without relying on application specific improvements, at expense of increased
program memory size and FPGA fabric utilization.
[Graph Challenge Honorable Mention] Fast Triangle Counting on GPU
Chuangyi Gui, Long Zheng, Pengcheng Yao, Xiaofei Liao, Hai Jin (Huazhong Univ. of Science and Technology)
Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains. Exploring
the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify that the stat-of-
the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of random accesses in
degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the processed vertex in
advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the degree-based graph
reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of-the-art HEPC Graph
Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to 5.9x speedup with > 10^9
TEPS on a single GPU for many large real graphs from graph challenge datasets.
FFTX for Micromechanical Stress-Strain Analysis
Anuva Kulkarni, Daniele G. Spampinato, Franz Franchetti (CMU)
Porting scientific simulations to heterogeneous platforms requires complex algorithmic and optimization strategies to overcome memory
and communication bottlenecks. Such operations are inexpressible using traditional libraries (e.g., FFTW for spectral methods) and
difficult to optimize by hand for various hardware platforms. In this work, we use our GPU-adapted stress-strain analysis method to
show how FFTX, a new API that extends FFTW, can be used to express our algorithm without worrying about code optimization, which
is handled by a back-end code generator.
Large Scale Organization and Inference of an Imagery Dataset for Public Safety
Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert (MIT-LL)
Video applications and analytics are routinely projected as a stressing and significant service of the Nationwide Public Safety
Broadband Network. As part of a NIST PSCR funded effort, the New Jersey Office of Homeland Security and Preparedness and MIT
Lincoln Laboratory have been developing a computer vision dataset of operational and representative public safety scenarios. The scale
and scope of this dataset necessitates a hierarchical organization approach for efficient compute and storage. We overview architectural
considerations using the Lincoln Laboratory Supercomputing Cluster as a test architecture. We then describe how we intelligently
organized the dataset across LLSC and evaluated it with large scale imagery inference across terabytes of data.
Optimizing the Visualization Pipeline of a 3D Monitoring and Management System
Rebecca Wild, Matthew Hubbell, Jeremy Kepner (MIT-LL)
Monitoring and Managing High Performance Computing (HPC) systems and environments generate an ever growing amount of data.
Making sense of this data and generating a platform where the data can be visualized for system administrators and management to
proactively identify system failures or understand the state of the system requires the platform to be as efficient and scalable as the
underlying database tools used to store and analyze the data. In this paper we will show how we leverage Accumulo, d4m, and Unity to
generate a 3D visualization platform to monitor and manage the Lincoln Laboratory Supercomputer systems and how we have had to
retool our approach to scale with our systems.
Overcoming Limitations of GPGPU-Computing in Scientific Applications
Connor Kenyon, Glenn Volkema, Gaurav Khanna (UMass Dartmouth)
The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe
interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly,
leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives to
the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System
Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various
scientific application kernels.
[Graph Challenge Finalist] Performance of Training Sparse Deep Neural Networks on GPUs
Jianzong Wang, Zhangcheng Huang, Lingwei Kong, Jing Xiao (Ping An Technology (Shenzhen)), Pengyu Wang, Lu Zhang, Chao Li
(Shanghai Jiao Tong University)
Deep neural networks have revolutionized the field of machine learning by dramatically improving the state-of-the-art in various
domains. The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to fast store and train them. Over
the past few decades, researches have explored the prospect of sparsifying DNNs before, during, and after training by pruning edges
from the underlying topology. After the above operation, the generated neural network is known as a sparse neural network. More
recent works have demonstrated the remarkable results that certain sparse DNNs can train to the same precision as dense DNNs at
lower runtime and storage cost. Although existing methods ease the situation that high demand for computation resources severely
hinders the deployment of large-scale DNNs in resource-constrained devices, DNNs can be trained at a faster speed and lower cost. In
this work, we propose a Fine-tune Structured Sparsity Learning (FSSL) method to regularize the structures of DNNs and accelerate the
training of DNNs. FSSL can: (1) learn a compact structure from large sparse DNN to reduce computation cost; (2) obtain a hardware-
friendly to accelerate the DNNs evaluation efficiently. Experimental results of the training time and the compression rate show that
superior performance and efficiency than the Matlab example code. These speedups are about twice speedups of non-structured
sparsity.
Projecting Quantum Computational Advantage versus Classical State-of-the-Art
Jason Larkin, Daniel Justice (CMU SEI)
A major milestone in quantum computing research is to demonstrate quantum supremacy, where some computation is performed by a
quantum computer that is unfeasible classically.[1] But while quantum supremacy may be demonstrable in the near-term noisy
intermediate scale quantum computing (NISQ) era,[2], [3] this supremacy demonstration does not bring advantage for practical
applications. A common practical application used in benchmarking classical and quantum computing is Maxcut, with applications in
domains such as machine scheduling,[4] image recognition[5], electronic circuit layout,[ 6], and software verification/validation.[7], [8]
For practical applications, quantum computational advantage has to be measured against the best performance that classical
computing offers.(cite) For Maxcut, we project the performance of the hybrid quantum- classical Quantum Approximate Optimization
Algorithm (QAOA)[9], and compare it with a list of the best performing classical solvers in terms of time to solution and quality.[10]–[12]
Performance data is discussed in the context of projecting NISQ-era quantum advantage. The results demonstrate the importance of
algorithm and problem selection, and in particular, the challenges that quantum computing algorithms and hardware face from classical
computing. [1] John Preskill. Quantum Computing in the NISQ era and beyond. page 2012. [2] Sergio Boixo, Sergei V. Isakov, Vadim N.
Smelyanskiy, Ryan Babbush, Nan Ding, Zhang Jiang, Michael J. Bremner, John M. Martinis, and Hartmut Neven. Characterizing
quantum supremacy in near-term devices. Nature Physics, 14(6):595–600, June 2018. [3] C. Neill, P. Roushan, K. Kechedzhi, S. Boixo,
S. V. Isakov, V. Smelyanskiy, A. Megrant, B. Chiaro, A. Dunsworth, K. Arya, R. Barends, B. Burkett, Y. Chen, Z. Chen, A. Fowler, B.
Foxen, M. Giustina, R. Graff, E. Jeffrey, T. Huang, J. Kelly, P. Klimov, E. Lucero, J. Mutus, M. Neeley, C. Quintana, D. Sank, A.
Vainsencher, J. Wenner, T. C. White, H. Neven, and J. M. Martinis. A blueprint for demonstrating quantum supremacy with
superconducting qubits. Science, 360(6385):195–199, April 2018. [4] BAHRAM ALIDAEE, GARY A. KOCHENBERGER, and AHMAD
AHMADIAN. 0-1 Quadratic programming approach for optimum solutions of two scheduling problems. International Journal of Systems
Science, 25(2):401–408, February 1994. [5] Hartmut Neven, Geordie Rose, and William G. Macready. Image recognition with an
adiabatic quantum computer I. Mapping to quadratic unconstrained binary optimization. arXiv:0804.4457 [quant-ph], April 2008. arXiv:
0804.4457. [6] Michel Deza and Monique Laurent. Applications of cut polyhedra — II. Journal of Computational and Applied
Mathematics, 55(2):217–247, November 1994. [7] Manu Jose and Rupak Majumdar. Cause Clue Clauses: Error Localization Using
Maximum Satisfiability. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation,
PLDI ’11, pages 437–446, New York, NY, USA, 2011. ACM. [8] L. Guo, A. S. Vincentelli, and A. Pinto. A complexity metric for concurrent
finite state machine based embedded software. In 2013 8th IEEE International Symposium on Industrial Embedded Systems (SIES),
pages 189–195, June 2013. [9] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A Quantum Approximate Optimization Algorithm
Applied to a Bounded Occurrence Constraint Problem. arXiv:1412.6062 [quant-ph], December 2014. arXiv: 1412.6062. [10] Michel X.
Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite
programming. Journal of the ACM, 42(6):1115–1145, November 1995. [11] Adrian Kügel. Improved Exact Solver for the Weighted Max-
SAT Problem. page 38. [12] Blaine Keetch and Yves van Gennip. A Max-Cut approximation using a graph based MBO scheme.
arXiv:1711.02419 [math], November 2017. arXiv: 1711.02419. [13] Gavin E. Crooks. Performance of the Quantum Approximate
Optimization Algorithm on the Maximum Cut Problem. arXiv:1811.08419 [quant-ph], November 2018. arXiv: 1811.08419. [14] Joran van
Apeldoorn and András Gilyén. Improvements in Quantum SDP-Solving with Applications. arXiv:1804.05058 [quant-ph], April 2018.
arXiv: 1804.05058. [15] Fernando G. S. L. Brandão, Amir Kalev, Tongyang Li, Cedric Yen-Yu Lin, Krysta M. Svore, and Xiaodi Wu.
Quantum SDP Solvers: Large Speed-ups, Optimality, and Applications to Quantum Learning. arXiv:1710.02581 [quant-ph], October
2017. arXiv: 1710.02581. [16] Nai-Hui Chia, Tongyang Li, Han-Hsuan Lin, and Chunhao Wang. Quantum-inspired classical sublinear-
time algorithm for solving low-rank semidefinite programming via sampling approaches. arXiv:1901.03254 [quant-ph], January 2019.
arXiv: 1901.03254.
Resilience-Aware Decomposition and Monitoring of Large-Scale Embedded Systems
Miguel Mark (Boston Univ.), David Whelihan, Michael Vai, Haley Whitman (MIT-LL), Michel Kinsy (Boston Univ.)
With the inherent complexity of large scale embedded systems and the lack of proper design tools, it is difficult for system engineers to
verify that functional specifications adhere to design requirements. Applying formal verification to such large scale embedded systems is
challenging due to the expertise required in formal methods. It then becomes a daunting task to achieve mission assurance for
embedded systems deployed in hostile environments. In this work, we introduce a monitoring based approach and develop a new tool,
called Formal Resilience Decomposition and Monitoring (FOREDEM), to assist system engineers to improve the mission assurance of
their designs. FOREDEM implements a workflow allowing engineers to assess the overall resilience of a design and understand the
associated costs through trade-off analysis.
Road Traffic Anomaly Detection using Functional Data Analysis
George Tsitsopoulos (Northeastern); Eric Truslow, Dimitris Manolakis (MIT-LL)
Streets and highways provide a ubiquitous data source, vehicle traffic volume, that can be exploited to gain insight into what is
happening on roadways. Traffic patterns generally fluctuate in a consistent manner throughout the week, making them relatively
predictable. However, holidays and unforeseeable anomalies such as accidents can cause significant deviations from the norm.
Detecting these irregularities can be a difficult task due to the general noisiness of the count data. Nonetheless, knowledge of these
traffic anomalies is important to many parties, making it a critical problem to solve. Awareness of an anomaly can ensure a timely arrival
to work or alert agencies when something unusual is occurring in an area of interest. Although traffic volume data is readily available, it
is not exploited to the extent we believe it should be when it comes to detecting anomalies. We can divide traffic anomalies into two
categories: short-term anomalies and long-term anomalies. A short term anomaly is generally an accident that causes a change in traffic
pattern for a few hours or less. For example, a rear-end collision on the highway during the early afternoon may impact traffic for only 30
minutes. A long-term anomaly is typically a holiday, road closure, or extreme weather -- events that cause a large deviation from the
expected pattern for a sustained period. Our research focused on the long-term anomalies, aiming to automatically process and detect
all holidays and major events that impact a day's traffic profile. Many approaches have been developed to detect traffic anomalies,
each with varying success. A majority treat the count data as a discrete set of measurements. An alternative approach is to model the
volume as a function of time and represent it using a smooth, continuous function. Typical traffic exhibits peaks in volume on weekdays
during the morning and evening rush hours, with dips coming during the midday and nighttime hours. Weekends display different
behavior, with diminished rush hour peaks. In this work we utilize Functional Data Analysis (FDA) to smooth traffic counts into
continuous functions. In this paper, we utilized FDA to detect long-term traffic anomalies based on single-sensor count data. FPCA was
used to identify the principal components of traffic variation. These components were compared to new data in order to determine
whether or not an anomaly occurred. Three detection methods were contrasted; modified functional bagplots, high density region (HDR)
boxplots, and Mahalanobis distance. We gathered our data from the California Department of Transportation Performance
Measurement System (PeMS), which contains thousands of inductive-loop sensors throughout the state's roads and highways. These
sensors have continuously recorded data sampled in 30 second intervals for several years, providing us a large source of traffic count
information. Additionally, this dataset allows us to verify that holidays occurred, something that simulated traffic counts cannot do.
Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems
Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader (Georgia Tech)
Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link
prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or
are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge
endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly
counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in
runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use
our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of
scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common
neighbor counts.
Wednesday, September 25, 2019