2019 IEEE High Performance
Extreme Computing Conference
(HPEC ‘19)
Twenty-third Annual HPEC Conference
24 - 26 September 2019
Westin Hotel, Waltham, MA USA
12:00 - 1:00 in Emerson
Lunch
View Posters & Demos 1
12:00-1:00 in Foyer
Embedded Processor-In-Memory Architecture for Accelerating Arithmetic Operations
Richard Muri, Paul Fortier (UMass Dartmouth)
Abstract—A processor-in-memory (PIM) computer architecture is any design that performs some subset of logical operations in the
same location as memory. The traditional model of computing involves a processor loading data from memory to perform
operations, with a bus connecting the processor and memory. While this technique works well in many situations, a growing gap
between memory performance and processor performance has led some researchers to develop alternative architectures. This
paper details the implementation of a PIM architecture in a soft core microcontroller used to accelerate applications limited by
register file size. Using an Artix-7 FPGA, an ATmega103 microcontroller soft core is modified to include a PIM core as an
accelerator. The sample application of AES encryption provides a comparison between the baseline processor and the PIM
enhanced machine. AES encryption using the modified microcontroller requires 38% fewer clock cycles without relying on
application specific improvements, at expense of increased program memory size and FPGA fabric utilization.
FFTX for Micromechanical Stress-Strain Analysis
Anuva Kulkarni (Carnegie Mellon University)*; Daniele Giuseppe Spampinato (Carnegie Mellon University); Franz Franchetti
(Carnegie Mellon University)
Porting scientific simulations to heterogeneous platforms requires complex algorithmic and optimization strategies to overcome
memory and communication bottlenecks. Such operations are inexpressible using traditional libraries (e.g., FFTW for spectral
methods) and difficult to optimize by hand for various hardware platforms. In this work, we use our GPU-adapted stress-strain
analysis method to show how FFTX, a new API that extends FFTW, can be used to express our algorithm without worrying about
code optimization, which is handled by a back-end code generator.
ECG Feature Processing Performance Acceleration on SLURM Compute Systems
Michael Nolan; Kajal Claypool; Mark Hernandez; Philip Fremont-Smith (MIT Lincoln Laboratory)*; Albert Swiston (Merck)
Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological
assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed
for serialized data processing which scale poorly for large datasets. To address this issue, we've developed a Matlab code library for
parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over
supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function
of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory
Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to
assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n
relationship to the number of processors used, while total computation times accounting for deployment and data aggregation
impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is
shown.
Emerging Applications of 3D Integration and Approximate Computing in High-PerformanceComputing Systems: Unique
Security Vulnerabilities
Pruthvy Yellu, Zhiming Zhang, Mohammad Mezanur, Rahman Monjur, Ranuli Abeysinghe, Qiaoyan Yu (UNH)
High-performance computing (HPC) systems rely on new technologies such as emerging devices, advanced integration techniques,
and computing architecture to continue advancing performance. The adoption of new techniques could potentially leave high-
performance computing systems vulnerable to new security threats. This work analyzes the security challenges in theHPC systems
that employ three-dimensional integrated circuits and approximating computing. Case studies are provided to show the impact of
new security threats on the system integrity and highlight the urgent need for new security measures.
Large Scale Organization and Inference of an Imagery Dataset for Public Safety
Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert (MIT-LL)
Video applications and analytics are routinely projected as a stressing and significant service of the Nationwide Public Safety
Broadband Network. As part of a NIST PSCR funded effort, the New Jersey Office of Homeland Security and Preparedness and
MIT Lincoln Laboratory have been developing a computer vision dataset of operational and representative public safety scenarios.
The scale and scope of this dataset necessitates a hierarchical organization approach for efficient compute and storage. We
overview architectural considerations using the Lincoln Laboratory Supercomputing Cluster as a test architecture. We then describe
how we intelligently organized the dataset across LLSC and evaluated it with large scale imagery inference across terabytes of
data.
Deep-Learning Inferencing with High-Performance Hardware Accelerators
Luke Kljucaric*; Alan George (NSF SHREC/CHREC)
In order to improve their performance-per-watt capabilities over general-purpose architectures, FPGAs are commonly employed to
accelerate applications. With the exponential growth of available data, machine-learning apps have generated greater interest in
order to more comprehensively understand that data and increase autonomous processing. As FPGAs become more readily
available on cloud services like Amazon Web Services F1 platform, it is worth studying the performance of accelerating machine-
learning apps on FPGAs over traditional fixed-logic devices, like CPUs and GPUs. FPGA frameworks for accelerating convolutional
neural networks (CNN), which are used in many machine-learning apps, have begun to emerge for accelerated-application
development. This research aims to compare the performance of these forthcoming frameworks on two commonly used CNNs,
GoogLeNet and AlexNet. Specifically, handwritten Chinese character recognition is benchmarked across multiple FPGA frameworks
on Xilinx and Intel FPGAs and compared against multiple CPU and GPU architectures featured on AWS, Google’s Cloud platform,
the University of Pittsburgh’s Center for Research Computing (CRC), and Intel’s vLab Academic Cluster. All NVIDIA GPUs have
proven to have the best performance over every other device in this study. The Zebra framework available for Xilinx FPGAs showed
to have an average 8.3 times and 9.3 times performance and efficiency improvement, respectively, over the OpenVINO framework
available for Intel FPGAs. Although the Zebra framework on the Xilinx VU9P showed greater efficiency than the Pascal-based
GPUs, the NVIDIA Tesla V100 proved to be the most efficient device at 125.9 and 47.2 images-per-second- per-Watt for AlexNet
and GoogLeNet, respectively. Although currently lacking, FPGA frameworks and devices have the potential to compete with GPUs
in terms of performance and efficiency.
Projecting Quantum Computational Advantage versus Classical State-of-the-Art
Jason Larkin (Carnegie Mellon University Software Engineering Institute)*; Daniel Justice (CMU SEI)
A major milestone in quantum computing research is to demonstrate quantum supremacy, where some computation is performed by
a quantum computer that is unfeasible classically.
Resilience-Aware Decomposition and Monitoring of Large-Scale Embedded Systems
Miguel Mark*; Michel Kinsy (Boston University); Haley Whitman; David Whelihan; Michael Vai (MIT Lincoln Laboratory)
With the inherent complexity of large scale embedded systems and the lack of proper design tools, it is difficult for system engineers
to verify that functional specifications adhere to design requirements. Applying formal verification to such large scale embedded
systems is challenging due to the expertise required in formal methods. It then becomes a daunting task to achieve mission
assurance for embedded systems deployed in hostile environments. In this work, we introduce a monitoring based approach and
develop a new tool, called Formal Resilience Decomposition and Monitoring (FOREDEM), to assist system engineers to improve
the mission assurance of their designs. FOREDEM implements a workflow allowing engineers to assess the overall resilience of a
design and understand the associated costs through trade-off analysis.
Road Traffic Anomaly Detection using Functional Data Analysis
George Tsitsopoulos (Northeastern University)*
Streets and highways provide a ubiquitous data source, vehicle traffic volume, that can be exploited to gain insight into what is
happening on roadways. Traffic patterns generally fluctuate in a consistent manner throughout the week, making them relatively
predictable. However, holidays and unforeseeable anomalies such as accidents can cause significant deviations from the norm.
Detecting these irregularities can be a difficult task due to the general noisiness of the count data. Nonetheless, knowledge of these
traffic anomalies is important to many parties, making it a critical problem to solve. Awareness of an anomaly can ensure a timely
arrival to work or alert agencies when something unusual is occurring in an area of interest. Although traffic volume data is readily
available, it is not exploited to the extent we believe it should be when it comes to detecting anomalies. We can divide traffic
anomalies into two categories: short-term anomalies and long-term anomalies. A short term anomaly is generally an accident that
causes a change in traffic pattern for a few hours or less. For example, a rear-end collision on the highway during the early
afternoon may impact traffic for only 30 minutes. A long-term anomaly is typically a holiday, road closure, or extreme weather --
events that cause a large deviation from the expected pattern for a sustained period. Our research focused on the long-term
anomalies, aiming to automatically process and detect all holidays and major events that impact a day's traffic profile. Many
approaches have been developed to detect traffic anomalies, each with varying success. A majority treat the count data as a
discrete set of measurements. An alternative approach is to model the volume as a function of time and represent it using a smooth,
continuous function. Typical traffic exhibits peaks in volume on weekdays during the morning and evening rush hours, with dips
coming during the midday and nighttime hours. Weekends display different behavior, with diminished rush hour peaks. In this work
we utilize Functional Data Analysis (FDA) to smooth traffic counts into continuous functions. In this paper, we utilized FDA to detect
long-term traffic anomalies based on single-sensor count data. FPCA was used to identify the principal components of traffic
variation. These components were compared to new data in order to determine whether or not an anomaly occurred. Three
detection methods were contrasted; modified functional bagplots, high density region (HDR) boxplots, and Mahalanobis distance.
We gathered our data from the California Department of Transportation Performance Measurement System (PeMS), which contains
thousands of inductive-loop sensors throughout the state's roads and highways. These sensors have continuously recorded data
sampled in 30 second intervals for several years, providing us a large source of traffic count information. Additionally, this dataset
allows us to verify that holidays occurred, something that simulated traffic counts cannot do.
Overcoming Limitations of GPGPU-Computing in Scientific Applications
Gaurav Khanna*; Connor Kenyon; Glenn Volkema (UMass Dartmouth)
The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe
interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly,
leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives
to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System
Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various
scientific application kernels.
Optimizing the Visualization Pipeline of a 3D Monitoring and Management System
Rebecca Wild (Johns Hopkins APL), Matthew Hubbell, Jeremy Kepner (MIT-LL)
Monitoring and Managing High Performance Computing (HPC) systems and environments generate an ever growing amount of
data. Making sense of this data and generating a platform where the data can be visualized for system administrators and
management to proactively identify system failures or understand the state of the system requires the platform to be as efficient and
scalable as the underlying database tools used to store and analyze the data. In this paper we will show how we leverage
Accumulo, d4m, and Unity to generate a 3D visualization platform to monitor and manage the Lincoln Laboratory Supercomputer
systems and how we have had to retool our approach to scale with our systems.
Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems
Xiaojing An; Kasimir Gabert*; James Fox (Georgia Institute of Technology); Oded Green (NVIDIA); David Bader (Georgia Institute of
Technology)
Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link
prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections
or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a
wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This
exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic
improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing
implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs,
demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for
computing all pairs common neighbor counts.
[Graph Challenge Finalist] Fast BFS-Based Triangle Counting on GPUs
Leyuan Wang*; John D Owens (University of California, Davis)
In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching,
our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a
massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our imple- mentation in
runtime and memory consumption compared with previous state-of-the-art work. We sustain a peak traversed-edges- per-second
(TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU imple- mentations and
also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on
the triangle counting problem for the Subgraph Isomorphism Graph Challenge 2019, demonstrating a geometric mean speedup
over the 2018 champion of 3.84×.
[Graph Challenge Finalist] Performance of Training Sparse Deep Neural Networks on GPUs
Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd)*; Zhangcheng Huang (Ping An Technology (Shenzhen) Co., Ltd);
Lingwei Kong (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China); Pengyu Wang
(Shanghai Jiao Tong University); Lu Zhang (Shanghai Jiao Tong University); Chao Li (Shanghai Jiao Tong University)
Deep neural networks have revolutionized the field of machine learning by dramatically improving the state-of-the-art in various
domains. The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to fast store and train them.
Over the past few decades, researches have explored the prospect of sparsifying DNNs before, during, and after training by pruning
edges from the underlying topology. After the above operation, the generated neural network is known as a sparse neural network.
More recent works have demonstrated the remarkable results that certain sparse DNNs can train to the same precision as dense
DNNs at lower runtime and storage cost. Although existing methods ease the situation that high demand for computation resources
severely hinders the deployment of large-scale DNNs in resource-constrained devices, DNNs can be trained at a faster speed and
lower cost. In this work, we propose a Fine-tune Structured Sparsity Learning (FSSL) method to regularize the structures of DNNs
and accelerate the training of DNNs. FSSL can: (1) learn a compact structure from large sparse DNN to reduce computation cost;
(2) obtain a hardware-friendly to accelerate the DNNs evaluation efficiently. Experimental results of the training time and the
compression rate show that superior performance and efficiency than the Matlab example code. These speedups are about twice
speedups of non-structured sparsity.
[Graph Challenge Honorable Mention] Fast Triangle Counting on GPU
Chuangyi Gui (Huazhong University of Science and Technology); Long Zheng (Huazhong University of Science and Technology)*;
Pengcheng Yao (Huazhong University of Science and Technology); Xiaofei Liao (HUST); Hai Jin (Huazhong University of Science
and Tech
Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains.
Exploring the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify
that the stat-of-the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of
random accesses in degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the
processed vertex in advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the
degree-based graph reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of-
the-art HEPC Graph Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to
5.9x speedup with > 10^9 TEPS on a single GPU for many large real graphs from graph challenge datasets.
Wednesday, September 25, 2019