2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
Lunch, View Posters & Demos 1 12:00-1:00 in Foyer Deep-Learning Inferencing with High-Performance Hardware Accelerators Luke Kljucaric, Alan D. George (SHREC @ Pitt) In order to improve their performance-per-watt capabilities over general-purpose architectures, FPGAs are commonly employed to accelerate applications. With the exponential growth of available data, machine-learning apps have generated greater interest in order to more comprehensively understand that data and increase autonomous processing. As FPGAs become more readily available on cloud services like Amazon Web Services F1 platform, it is worth studying the performance of accelerating machine-learning apps on FPGAs over traditional fixed-logic devices, like CPUs and GPUs. FPGA frameworks for accelerating convolutional neural networks (CNN), which are used in many machine-learning apps, have begun to emerge for accelerated-application development. This research aims to compare the performance of these forthcoming frameworks on two commonly used CNNs, GoogLeNet and AlexNet. Specifically, handwritten Chinese character recognition is benchmarked across multiple FPGA frameworks on Xilinx and Intel FPGAs and compared against multiple CPU and GPU architectures featured on AWS, Google’s Cloud platform, the University of Pittsburgh’s Center for Research Computing (CRC), and Intel’s vLab Academic Cluster. All NVIDIA GPUs have proven to have the best performance over every other device in this study. The Zebra framework available for Xilinx FPGAs showed to have an average 8.3 times and 9.3 times performance and efficiency improvement, respectively, over the OpenVINO framework available for Intel FPGAs. Although the Zebra framework on the Xilinx VU9P showed greater efficiency than the Pascal-based GPUs, the NVIDIA Tesla V100 proved to be the most efficient device at 125.9 and 47.2 images-per-second- per-Watt for AlexNet and GoogLeNet, respectively. Although currently lacking, FPGA frameworks and devices have the potential to compete with GPUs in terms of performance and efficiency. ECG Feature Processing Performance Acceleration on SLURM Compute Systems Michael Nolan, Mark Hernandez, Philip Fremont-Smith, Albert Swiston, Kajal Claypool (MIT-LL) Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we've developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown. Embedded Processor-In-Memory Architecture for Accelerating Arithmetic Operations Richard Muri, Paul Fortier (UMass Dartmouth) Abstract—A processor-in-memory (PIM) computer architecture is any design that performs some subset of logical operations in the same location as memory. The traditional model of computing involves a processor loading data from memory to perform operations, with a bus connecting the processor and memory. While this technique works well in many situations, a growing gap between memory performance and processor performance has led some researchers to develop alternative architectures. This paper details the implementation of a PIM architecture in a soft core microcontroller used to accelerate applications limited by register file size. Using an Artix-7 FPGA, an ATmega103 microcontroller soft core is modified to include a PIM core as an accelerator. The sample application of AES encryption provides a comparison between the baseline processor and the PIM enhanced machine. AES encryption using the modified microcontroller requires 38% fewer clock cycles without relying on application specific improvements, at expense of increased program memory size and FPGA fabric utilization. [Graph Challenge Honorable Mention] Fast Triangle Counting on GPU Chuangyi Gui, Long Zheng, Pengcheng Yao, Xiaofei Liao, Hai Jin (Huazhong Univ. of Science and Technology) Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains. Exploring the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify that the stat-of- the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of random accesses in degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the processed vertex in advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the degree-based graph reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of-the-art HEPC Graph Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to 5.9x speedup with > 10^9 TEPS on a single GPU for many large real graphs from graph challenge datasets. FFTX for Micromechanical Stress-Strain Analysis Anuva Kulkarni, Daniele G. Spampinato, Franz Franchetti (CMU) Porting scientific simulations to heterogeneous platforms requires complex algorithmic and optimization strategies to overcome memory and communication bottlenecks. Such operations are inexpressible using traditional libraries (e.g., FFTW for spectral methods) and difficult to optimize by hand for various hardware platforms. In this work, we use our GPU-adapted stress-strain analysis method to show how FFTX, a new API that extends FFTW, can be used to express our algorithm without worrying about code optimization, which is handled by a back-end code generator. Large Scale Organization and Inference of an Imagery Dataset for Public Safety Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert (MIT-LL) Video applications and analytics are routinely projected as a stressing and significant service of the Nationwide Public Safety Broadband Network. As part of a NIST PSCR funded effort, the New Jersey Office of Homeland Security and Preparedness and MIT Lincoln Laboratory have been developing a computer vision dataset of operational and representative public safety scenarios. The scale and scope of this dataset necessitates a hierarchical organization approach for efficient compute and storage. We overview architectural considerations using the Lincoln Laboratory Supercomputing Cluster as a test architecture. We then describe how we intelligently organized the dataset across LLSC and evaluated it with large scale imagery inference across terabytes of data. Optimizing the Visualization Pipeline of a 3D Monitoring and Management System Rebecca Wild, Matthew Hubbell, Jeremy Kepner (MIT-LL) Monitoring and Managing High Performance Computing (HPC) systems and environments generate an ever growing amount of data. Making sense of this data and generating a platform where the data can be visualized for system administrators and management to proactively identify system failures or understand the state of the system requires the platform to be as efficient and scalable as the underlying database tools used to store and analyze the data. In this paper we will show how we leverage Accumulo, d4m, and Unity to generate a 3D visualization platform to monitor and manage the Lincoln Laboratory Supercomputer systems and how we have had to retool our approach to scale with our systems. Overcoming Limitations of GPGPU-Computing in Scientific Applications Connor Kenyon, Glenn Volkema, Gaurav Khanna (UMass Dartmouth) The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly, leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various scientific application kernels. [Graph Challenge Finalist] Performance of Training Sparse Deep Neural Networks on GPUs Jianzong Wang, Zhangcheng Huang, Lingwei Kong, Jing Xiao (Ping An Technology (Shenzhen)), Pengyu Wang, Lu Zhang, Chao Li (Shanghai Jiao Tong University) Deep neural networks have revolutionized the field of machine learning by dramatically improving the state-of-the-art in various domains. The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to fast store and train them. Over the past few decades, researches have explored the prospect of sparsifying DNNs before, during, and after training by pruning edges from the underlying topology. After the above operation, the generated neural network is known as a sparse neural network. More recent works have demonstrated the remarkable results that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. Although existing methods ease the situation that high demand for computation resources severely hinders the deployment of large-scale DNNs in resource-constrained devices, DNNs can be trained at a faster speed and lower cost. In this work, we propose a Fine-tune Structured Sparsity Learning (FSSL) method to regularize the structures of DNNs and accelerate the training of DNNs. FSSL can: (1) learn a compact structure from large sparse DNN to reduce computation cost; (2) obtain a hardware- friendly to accelerate the DNNs evaluation efficiently. Experimental results of the training time and the compression rate show that superior performance and efficiency than the Matlab example code. These speedups are about twice speedups of non-structured sparsity. Projecting Quantum Computational Advantage versus Classical State-of-the-Art Jason Larkin, Daniel Justice (CMU SEI) A major milestone in quantum computing research is to demonstrate quantum supremacy, where some computation is performed by a quantum computer that is unfeasible classically.[1] But while quantum supremacy may be demonstrable in the near-term noisy intermediate scale quantum computing (NISQ) era,[2], [3] this supremacy demonstration does not bring advantage for practical applications. A common practical application used in benchmarking classical and quantum computing is Maxcut, with applications in domains such as machine scheduling,[4] image recognition[5], electronic circuit layout,[ 6], and software verification/validation.[7], [8] For practical applications, quantum computational advantage has to be measured against the best performance that classical computing offers.(cite) For Maxcut, we project the performance of the hybrid quantum- classical Quantum Approximate Optimization Algorithm (QAOA)[9], and compare it with a list of the best performing classical solvers in terms of time to solution and quality.[10]–[12] Performance data is discussed in the context of projecting NISQ-era quantum advantage. The results demonstrate the importance of algorithm and problem selection, and in particular, the challenges that quantum computing algorithms and hardware face from classical computing.  [1] John Preskill. Quantum Computing in the NISQ era and beyond. page 2012. [2] Sergio Boixo, Sergei V. Isakov, Vadim N. Smelyanskiy, Ryan Babbush, Nan Ding, Zhang Jiang, Michael J. Bremner, John M. Martinis, and Hartmut Neven. Characterizing quantum supremacy in near-term devices. Nature Physics, 14(6):595–600, June 2018. [3] C. Neill, P. Roushan, K. Kechedzhi, S. Boixo, S. V. Isakov, V. Smelyanskiy, A. Megrant, B. Chiaro, A. Dunsworth, K. Arya, R. Barends, B. Burkett, Y. Chen, Z. Chen, A. Fowler, B. Foxen, M. Giustina, R. Graff, E. Jeffrey, T. Huang, J. Kelly, P. Klimov, E. Lucero, J. Mutus, M. Neeley, C. Quintana, D. Sank, A. Vainsencher, J. Wenner, T. C. White, H. Neven, and J. M. Martinis. A blueprint for demonstrating quantum supremacy with superconducting qubits. Science, 360(6385):195–199, April 2018. [4] BAHRAM ALIDAEE, GARY A. KOCHENBERGER, and AHMAD AHMADIAN. 0-1 Quadratic programming approach for optimum solutions of two scheduling problems. International Journal of Systems Science, 25(2):401–408, February 1994. [5] Hartmut Neven, Geordie Rose, and William G. Macready. Image recognition with an adiabatic quantum computer I. Mapping to quadratic unconstrained binary optimization. arXiv:0804.4457 [quant-ph], April 2008. arXiv: 0804.4457. [6] Michel Deza and Monique Laurent. Applications of cut polyhedra — II. Journal of Computational and Applied Mathematics, 55(2):217–247, November 1994. [7] Manu Jose and Rupak Majumdar. Cause Clue Clauses: Error Localization Using Maximum Satisfiability. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pages 437–446, New York, NY, USA, 2011. ACM. [8] L. Guo, A. S. Vincentelli, and A. Pinto. A complexity metric for concurrent finite state machine based embedded software. In 2013 8th IEEE International Symposium on Industrial Embedded Systems (SIES), pages 189–195, June 2013. [9] Edward Farhi, Jeffrey Goldstone, and Sam Gutmann. A Quantum Approximate Optimization Algorithm Applied to a Bounded Occurrence Constraint Problem. arXiv:1412.6062 [quant-ph], December 2014. arXiv: 1412.6062. [10] Michel X. Goemans and David P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, 42(6):1115–1145, November 1995. [11] Adrian Kügel. Improved Exact Solver for the Weighted Max- SAT Problem. page 38. [12] Blaine Keetch and Yves van Gennip. A Max-Cut approximation using a graph based MBO scheme. arXiv:1711.02419 [math], November 2017. arXiv: 1711.02419. [13] Gavin E. Crooks. Performance of the Quantum Approximate Optimization Algorithm on the Maximum Cut Problem. arXiv:1811.08419 [quant-ph], November 2018. arXiv: 1811.08419. [14] Joran van Apeldoorn and András Gilyén. Improvements in Quantum SDP-Solving with Applications. arXiv:1804.05058 [quant-ph], April 2018. arXiv: 1804.05058. [15] Fernando G. S. L. Brandão, Amir Kalev, Tongyang Li, Cedric Yen-Yu Lin, Krysta M. Svore, and Xiaodi Wu. Quantum SDP Solvers: Large Speed-ups, Optimality, and Applications to Quantum Learning. arXiv:1710.02581 [quant-ph], October 2017. arXiv: 1710.02581. [16] Nai-Hui Chia, Tongyang Li, Han-Hsuan Lin, and Chunhao Wang. Quantum-inspired classical sublinear- time algorithm for solving low-rank semidefinite programming via sampling approaches. arXiv:1901.03254 [quant-ph], January 2019. arXiv: 1901.03254. Resilience-Aware Decomposition and Monitoring of Large-Scale Embedded Systems Miguel Mark (Boston Univ.), David Whelihan, Michael Vai, Haley Whitman (MIT-LL), Michel Kinsy (Boston Univ.) With the inherent complexity of large scale embedded systems and the lack of proper design tools, it is difficult for system engineers to verify that functional specifications adhere to design requirements. Applying formal verification to such large scale embedded systems is challenging due to the expertise required in formal methods.  It then becomes a daunting task to achieve mission assurance for embedded systems deployed in hostile environments. In this work, we introduce a monitoring based approach and develop a new tool, called Formal Resilience Decomposition and Monitoring (FOREDEM), to assist system engineers to improve the mission assurance of their designs. FOREDEM implements a workflow allowing engineers to assess the overall resilience of a design and understand the associated costs through trade-off analysis. Road Traffic Anomaly Detection using Functional Data Analysis George Tsitsopoulos (Northeastern); Eric Truslow, Dimitris Manolakis (MIT-LL) Streets and highways provide a ubiquitous data source, vehicle traffic volume, that can be exploited to gain insight into what is happening on roadways. Traffic patterns generally fluctuate in a consistent manner throughout the week, making them relatively predictable. However, holidays and unforeseeable anomalies such as accidents can cause significant deviations from the norm. Detecting these irregularities can be a difficult task due to the general noisiness of the count data. Nonetheless, knowledge of these traffic anomalies is important to many parties, making it a critical problem to solve. Awareness of an anomaly can ensure a timely arrival to work or alert agencies when something unusual is occurring in an area of interest. Although traffic volume data is readily available, it is not exploited to the extent we believe it should be when it comes to detecting anomalies.   We can divide traffic anomalies into two categories: short-term anomalies and long-term anomalies. A short term anomaly is generally an accident that causes a change in traffic pattern for a few hours or less. For example, a rear-end collision on the highway during the early afternoon may impact traffic for only 30 minutes. A long-term anomaly is typically a holiday, road closure, or extreme weather -- events that cause a large deviation from the expected pattern for a sustained period. Our research focused on the long-term anomalies, aiming to automatically process and detect all holidays and major events that impact a day's traffic profile.   Many approaches have been developed to detect traffic anomalies, each with varying success. A majority treat the count data as a discrete set of measurements. An alternative approach is to model the volume as a function of time and represent it using a smooth, continuous function. Typical traffic exhibits peaks in volume on weekdays during the morning and evening rush hours, with dips coming during the midday and nighttime hours. Weekends display different behavior, with diminished rush hour peaks. In this work we utilize Functional Data Analysis (FDA) to smooth traffic counts into continuous functions.  In this paper, we utilized FDA to detect long-term traffic anomalies based on single-sensor count data.  FPCA was used to identify the principal components of traffic variation. These components were compared to new data in order to determine whether or not an anomaly occurred. Three detection methods were contrasted; modified functional bagplots, high density region (HDR) boxplots, and Mahalanobis distance. We gathered our data from the California Department of Transportation Performance Measurement System (PeMS), which contains thousands of inductive-loop sensors throughout the state's roads and highways. These sensors have continuously recorded data sampled in 30 second intervals for several years, providing us a large source of traffic count information. Additionally, this dataset allows us to verify that holidays occurred, something that simulated traffic counts cannot do. Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems Xiaojing An, Kasimir Gabert, James Fox, Oded Green, David A. Bader (Georgia Tech) Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts.
Wednesday, September 25, 2019