2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
12:00 - 1:00 in Emerson
Lunch View Posters & Demos 1 12:00-1:00 in Foyer Embedded Processor-In-Memory Architecture for Accelerating Arithmetic Operations Richard Muri, Paul Fortier (UMass Dartmouth) Abstract—A processor-in-memory (PIM) computer architecture is any design that performs some subset of logical operations in the same location as memory. The traditional model of computing involves a processor loading data from memory to perform operations, with a bus connecting the processor and memory. While this technique works well in many situations, a growing gap between memory performance and processor performance has led some researchers to develop alternative architectures. This paper details the implementation of a PIM architecture in a soft core microcontroller used to accelerate applications limited by register file size. Using an Artix-7 FPGA, an ATmega103 microcontroller soft core is modified to include a PIM core as an accelerator. The sample application of AES encryption provides a comparison between the baseline processor and the PIM enhanced machine. AES encryption using the modified microcontroller requires 38% fewer clock cycles without relying on application specific improvements, at expense of increased program memory size and FPGA fabric utilization. FFTX for Micromechanical Stress-Strain Analysis Anuva Kulkarni (Carnegie Mellon University)*; Daniele Giuseppe Spampinato (Carnegie Mellon University); Franz Franchetti (Carnegie Mellon University) Porting scientific simulations to heterogeneous platforms requires complex algorithmic and optimization strategies to overcome memory and communication bottlenecks. Such operations are inexpressible using traditional libraries (e.g., FFTW for spectral methods) and difficult to optimize by hand for various hardware platforms. In this work, we use our GPU-adapted stress-strain analysis method to show how FFTX, a new API that extends FFTW, can be used to express our algorithm without worrying about code optimization, which is handled by a back-end code generator. ECG Feature Processing Performance Acceleration on SLURM Compute Systems Michael Nolan; Kajal Claypool; Mark Hernandez; Philip Fremont-Smith (MIT Lincoln Laboratory)*; Albert Swiston (Merck) Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we've developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown. Emerging Applications of 3D Integration and Approximate Computing in High-PerformanceComputing Systems: Unique Security Vulnerabilities Pruthvy Yellu, Zhiming Zhang, Mohammad Mezanur, Rahman Monjur, Ranuli Abeysinghe, Qiaoyan Yu (UNH) High-performance computing (HPC) systems rely on new technologies such as emerging devices, advanced integration techniques, and computing architecture to continue advancing performance. The adoption of new techniques could potentially leave high- performance computing systems vulnerable to new security threats. This work analyzes the security challenges in theHPC systems that employ three-dimensional integrated circuits and approximating computing. Case studies are provided to show the impact of new security threats on the system integrity and highlight the urgent need for new security measures. Large Scale Organization and Inference of an Imagery Dataset for Public Safety Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert (MIT-LL) Video applications and analytics are routinely projected as a stressing and significant service of the Nationwide Public Safety Broadband Network. As part of a NIST PSCR funded effort, the New Jersey Office of Homeland Security and Preparedness and MIT Lincoln Laboratory have been developing a computer vision dataset of operational and representative public safety scenarios. The scale and scope of this dataset necessitates a hierarchical organization approach for efficient compute and storage. We overview architectural considerations using the Lincoln Laboratory Supercomputing Cluster as a test architecture. We then describe how we intelligently organized the dataset across LLSC and evaluated it with large scale imagery inference across terabytes of data. Deep-Learning Inferencing with High-Performance Hardware Accelerators Luke Kljucaric*; Alan George (NSF SHREC/CHREC) In order to improve their performance-per-watt capabilities over general-purpose architectures, FPGAs are commonly employed to accelerate applications. With the exponential growth of available data, machine-learning apps have generated greater interest in order to more comprehensively understand that data and increase autonomous processing. As FPGAs become more readily available on cloud services like Amazon Web Services F1 platform, it is worth studying the performance of accelerating machine- learning apps on FPGAs over traditional fixed-logic devices, like CPUs and GPUs. FPGA frameworks for accelerating convolutional neural networks (CNN), which are used in many machine-learning apps, have begun to emerge for accelerated-application development. This research aims to compare the performance of these forthcoming frameworks on two commonly used CNNs, GoogLeNet and AlexNet. Specifically, handwritten Chinese character recognition is benchmarked across multiple FPGA frameworks on Xilinx and Intel FPGAs and compared against multiple CPU and GPU architectures featured on AWS, Google’s Cloud platform, the University of Pittsburgh’s Center for Research Computing (CRC), and Intel’s vLab Academic Cluster. All NVIDIA GPUs have proven to have the best performance over every other device in this study. The Zebra framework available for Xilinx FPGAs showed to have an average 8.3 times and 9.3 times performance and efficiency improvement, respectively, over the OpenVINO framework available for Intel FPGAs. Although the Zebra framework on the Xilinx VU9P showed greater efficiency than the Pascal-based GPUs, the NVIDIA Tesla V100 proved to be the most efficient device at 125.9 and 47.2 images-per-second- per-Watt for AlexNet and GoogLeNet, respectively. Although currently lacking, FPGA frameworks and devices have the potential to compete with GPUs in terms of performance and efficiency. Projecting Quantum Computational Advantage versus Classical State-of-the-Art Jason Larkin (Carnegie Mellon University Software Engineering Institute)*; Daniel Justice (CMU SEI) A major milestone in quantum computing research is to demonstrate quantum supremacy, where some computation is performed by a quantum computer that is unfeasible classically. Resilience-Aware Decomposition and Monitoring of Large-Scale Embedded Systems Miguel Mark*; Michel Kinsy (Boston University); Haley Whitman; David Whelihan; Michael Vai (MIT Lincoln Laboratory) With the inherent complexity of large scale embedded systems and the lack of proper design tools, it is difficult for system engineers to verify that functional specifications adhere to design requirements. Applying formal verification to such large scale embedded systems is challenging due to the expertise required in formal methods.  It then becomes a daunting task to achieve mission assurance for embedded systems deployed in hostile environments. In this work, we introduce a monitoring based approach and develop a new tool, called Formal Resilience Decomposition and Monitoring (FOREDEM), to assist system engineers to improve the mission assurance of their designs. FOREDEM implements a workflow allowing engineers to assess the overall resilience of a design and understand the associated costs through trade-off analysis. Road Traffic Anomaly Detection using Functional Data Analysis George Tsitsopoulos (Northeastern University)* Streets and highways provide a ubiquitous data source, vehicle traffic volume, that can be exploited to gain insight into what is happening on roadways. Traffic patterns generally fluctuate in a consistent manner throughout the week, making them relatively predictable. However, holidays and unforeseeable anomalies such as accidents can cause significant deviations from the norm. Detecting these irregularities can be a difficult task due to the general noisiness of the count data. Nonetheless, knowledge of these traffic anomalies is important to many parties, making it a critical problem to solve. Awareness of an anomaly can ensure a timely arrival to work or alert agencies when something unusual is occurring in an area of interest. Although traffic volume data is readily available, it is not exploited to the extent we believe it should be when it comes to detecting anomalies.   We can divide traffic anomalies into two categories: short-term anomalies and long-term anomalies. A short term anomaly is generally an accident that causes a change in traffic pattern for a few hours or less. For example, a rear-end collision on the highway during the early afternoon may impact traffic for only 30 minutes. A long-term anomaly is typically a holiday, road closure, or extreme weather -- events that cause a large deviation from the expected pattern for a sustained period. Our research focused on the long-term anomalies, aiming to automatically process and detect all holidays and major events that impact a day's traffic profile.   Many approaches have been developed to detect traffic anomalies, each with varying success. A majority treat the count data as a discrete set of measurements. An alternative approach is to model the volume as a function of time and represent it using a smooth, continuous function. Typical traffic exhibits peaks in volume on weekdays during the morning and evening rush hours, with dips coming during the midday and nighttime hours. Weekends display different behavior, with diminished rush hour peaks. In this work we utilize Functional Data Analysis (FDA) to smooth traffic counts into continuous functions.  In this paper, we utilized FDA to detect long-term traffic anomalies based on single-sensor count data.  FPCA was used to identify the principal components of traffic variation. These components were compared to new data in order to determine whether or not an anomaly occurred. Three detection methods were contrasted; modified functional bagplots, high density region (HDR) boxplots, and Mahalanobis distance. We gathered our data from the California Department of Transportation Performance Measurement System (PeMS), which contains thousands of inductive-loop sensors throughout the state's roads and highways. These sensors have continuously recorded data sampled in 30 second intervals for several years, providing us a large source of traffic count information. Additionally, this dataset allows us to verify that holidays occurred, something that simulated traffic counts cannot do. Overcoming Limitations of GPGPU-Computing in Scientific Applications Gaurav Khanna*; Connor Kenyon; Glenn Volkema (UMass Dartmouth) The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly, leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various scientific application kernels. Optimizing the Visualization Pipeline of a 3D Monitoring and Management System Rebecca Wild (Johns Hopkins APL), Matthew Hubbell, Jeremy Kepner (MIT-LL) Monitoring and Managing High Performance Computing (HPC) systems and environments generate an ever growing amount of data. Making sense of this data and generating a platform where the data can be visualized for system administrators and management to proactively identify system failures or understand the state of the system requires the platform to be as efficient and scalable as the underlying database tools used to store and analyze the data. In this paper we will show how we leverage Accumulo, d4m, and Unity to generate a 3D visualization platform to monitor and manage the Lincoln Laboratory Supercomputer systems and how we have had to retool our approach to scale with our systems. Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems Xiaojing An; Kasimir Gabert*; James Fox (Georgia Institute of Technology); Oded Green (NVIDIA); David Bader (Georgia Institute of Technology) Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts. [Graph Challenge Finalist] Fast BFS-Based Triangle Counting on GPUs Leyuan Wang*; John D Owens (University of California, Davis) In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source-BFS manner and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our imple- mentation in runtime and memory consumption compared with previous state-of-the-art work. We sustain a peak traversed-edges- per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU imple- mentations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge 2019, demonstrating a geometric mean speedup over the 2018 champion of 3.84×. [Graph Challenge Finalist] Performance of Training Sparse Deep Neural Networks on GPUs Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd)*; Zhangcheng Huang (Ping An Technology (Shenzhen) Co., Ltd); Lingwei Kong (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China); Pengyu Wang (Shanghai Jiao Tong University); Lu Zhang (Shanghai Jiao Tong University); Chao Li (Shanghai Jiao Tong University) Deep neural networks have revolutionized the field of machine learning by dramatically improving the state-of-the-art in various domains. The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to fast store and train them. Over the past few decades, researches have explored the prospect of sparsifying DNNs before, during, and after training by pruning edges from the underlying topology. After the above operation, the generated neural network is known as a sparse neural network. More recent works have demonstrated the remarkable results that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. Although existing methods ease the situation that high demand for computation resources severely hinders the deployment of large-scale DNNs in resource-constrained devices, DNNs can be trained at a faster speed and lower cost. In this work, we propose a Fine-tune Structured Sparsity Learning (FSSL) method to regularize the structures of DNNs and accelerate the training of DNNs. FSSL can: (1) learn a compact structure from large sparse DNN to reduce computation cost; (2) obtain a hardware-friendly to accelerate the DNNs evaluation efficiently. Experimental results of the training time and the compression rate show that superior performance and efficiency than the Matlab example code. These speedups are about twice speedups of non-structured sparsity. [Graph Challenge Honorable Mention] Fast Triangle Counting on GPU Chuangyi Gui (Huazhong University of Science and Technology); Long Zheng (Huazhong University of Science and Technology)*; Pengcheng Yao (Huazhong University of Science and Technology); Xiaofei Liao (HUST); Hai Jin (Huazhong University of Science and Tech Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains. Exploring the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify that the stat-of-the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of random accesses in degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the processed vertex in advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the degree-based graph reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of- the-art HEPC Graph Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to 5.9x speedup with > 10^9 TEPS on a single GPU for many large real graphs from graph challenge datasets.
Wednesday, September 25, 2019
12:00 - 1:00 in Emerson
Lunch View Posters & Demos 1 12:00-1:00 in Foyer Embedded Processor-In-Memory Architecture for Accelerating Arithmetic Operations Richard Muri, Paul Fortier (UMass Dartmouth) Abstract—A processor-in-memory (PIM) computer architecture is any design that performs some subset of logical operations in the same location as memory. The traditional model of computing involves a processor loading data from memory to perform operations, with a bus connecting the processor and memory. While this technique works well in many situations, a growing gap between memory performance and processor performance has led some researchers to develop alternative architectures. This paper details the implementation of a PIM architecture in a soft core microcontroller used to accelerate applications limited by register file size. Using an Artix-7 FPGA, an ATmega103 microcontroller soft core is modified to include a PIM core as an accelerator. The sample application of AES encryption provides a comparison between the baseline processor and the PIM enhanced machine. AES encryption using the modified microcontroller requires 38% fewer clock cycles without relying on application specific improvements, at expense of increased program memory size and FPGA fabric utilization. FFTX for Micromechanical Stress-Strain Analysis Anuva Kulkarni (Carnegie Mellon University)*; Daniele Giuseppe Spampinato (Carnegie Mellon University); Franz Franchetti (Carnegie Mellon University) Porting scientific simulations to heterogeneous platforms requires complex algorithmic and optimization strategies to overcome memory and communication bottlenecks. Such operations are inexpressible using traditional libraries (e.g., FFTW for spectral methods) and difficult to optimize by hand for various hardware platforms. In this work, we use our GPU-adapted stress-strain analysis method to show how FFTX, a new API that extends FFTW, can be used to express our algorithm without worrying about code optimization, which is handled by a back-end code generator. ECG Feature Processing Performance Acceleration on SLURM Compute Systems Michael Nolan; Kajal Claypool; Mark Hernandez; Philip Fremont- Smith (MIT Lincoln Laboratory)*; Albert Swiston (Merck) Electrocardiogram (ECG) signal features (e.g. Heart rate, intrapeak interval times) are data commonly used in physiological assessment. Commercial off-the-shelf (COTS) software solutions for ECG data processing are available, but are often developed for serialized data processing which scale poorly for large datasets. To address this issue, we've developed a Matlab code library for parallelized ECG feature generation. This library uses the pMatlab and MatMPI interfaces to distribute computing tasks over supercomputing clusters using the Simple Linux Utility for Resource Management (SLURM). To profile its performance as a function of parallelization scale, the ECG processing code was executed on a non-human primate dataset on the Lincoln Laboratory Supercomputing TXGreen cluster. Feature processing jobs were deployed over a range of processor counts and processor types to assess the overall reduction in job computation time. We show that individual process times decrease according to a 1/n relationship to the number of processors used, while total computation times accounting for deployment and data aggregation impose diminishing returns of time against processor count. A maximum mean reduction in overall file processing time of 99% is shown. Emerging Applications of 3D Integration and Approximate Computing in High-PerformanceComputing Systems: Unique Security Vulnerabilities Pruthvy Yellu, Zhiming Zhang, Mohammad Mezanur, Rahman Monjur, Ranuli Abeysinghe, Qiaoyan Yu (UNH) High-performance computing (HPC) systems rely on new technologies such as emerging devices, advanced integration techniques, and computing architecture to continue advancing performance. The adoption of new techniques could potentially leave high-performance computing systems vulnerable to new security threats. This work analyzes the security challenges in theHPC systems that employ three-dimensional integrated circuits and approximating computing. Case studies are provided to show the impact of new security threats on the system integrity and highlight the urgent need for new security measures. Large Scale Organization and Inference of an Imagery Dataset for Public Safety Jeffrey Liu, David Strohschein, Siddharth Samsi, Andrew Weinert (MIT-LL) Video applications and analytics are routinely projected as a stressing and significant service of the Nationwide Public Safety Broadband Network. As part of a NIST PSCR funded effort, the New Jersey Office of Homeland Security and Preparedness and MIT Lincoln Laboratory have been developing a computer vision dataset of operational and representative public safety scenarios. The scale and scope of this dataset necessitates a hierarchical organization approach for efficient compute and storage. We overview architectural considerations using the Lincoln Laboratory Supercomputing Cluster as a test architecture. We then describe how we intelligently organized the dataset across LLSC and evaluated it with large scale imagery inference across terabytes of data. Deep-Learning Inferencing with High-Performance Hardware Accelerators Luke Kljucaric*; Alan George (NSF SHREC/CHREC) In order to improve their performance-per-watt capabilities over general-purpose architectures, FPGAs are commonly employed to accelerate applications. With the exponential growth of available data, machine-learning apps have generated greater interest in order to more comprehensively understand that data and increase autonomous processing. As FPGAs become more readily available on cloud services like Amazon Web Services F1 platform, it is worth studying the performance of accelerating machine-learning apps on FPGAs over traditional fixed-logic devices, like CPUs and GPUs. FPGA frameworks for accelerating convolutional neural networks (CNN), which are used in many machine-learning apps, have begun to emerge for accelerated-application development. This research aims to compare the performance of these forthcoming frameworks on two commonly used CNNs, GoogLeNet and AlexNet. Specifically, handwritten Chinese character recognition is benchmarked across multiple FPGA frameworks on Xilinx and Intel FPGAs and compared against multiple CPU and GPU architectures featured on AWS, Google’s Cloud platform, the University of Pittsburgh’s Center for Research Computing (CRC), and Intel’s vLab Academic Cluster. All NVIDIA GPUs have proven to have the best performance over every other device in this study. The Zebra framework available for Xilinx FPGAs showed to have an average 8.3 times and 9.3 times performance and efficiency improvement, respectively, over the OpenVINO framework available for Intel FPGAs. Although the Zebra framework on the Xilinx VU9P showed greater efficiency than the Pascal-based GPUs, the NVIDIA Tesla V100 proved to be the most efficient device at 125.9 and 47.2 images-per-second- per-Watt for AlexNet and GoogLeNet, respectively. Although currently lacking, FPGA frameworks and devices have the potential to compete with GPUs in terms of performance and efficiency. Projecting Quantum Computational Advantage versus Classical State-of-the-Art Jason Larkin (Carnegie Mellon University Software Engineering Institute)*; Daniel Justice (CMU SEI) A major milestone in quantum computing research is to demonstrate quantum supremacy, where some computation is performed by a quantum computer that is unfeasible classically. Resilience-Aware Decomposition and Monitoring of Large-Scale Embedded Systems Miguel Mark*; Michel Kinsy (Boston University); Haley Whitman; David Whelihan; Michael Vai (MIT Lincoln Laboratory) With the inherent complexity of large scale embedded systems and the lack of proper design tools, it is difficult for system engineers to verify that functional specifications adhere to design requirements. Applying formal verification to such large scale embedded systems is challenging due to the expertise required in formal methods.  It then becomes a daunting task to achieve mission assurance for embedded systems deployed in hostile environments. In this work, we introduce a monitoring based approach and develop a new tool, called Formal Resilience Decomposition and Monitoring (FOREDEM), to assist system engineers to improve the mission assurance of their designs. FOREDEM implements a workflow allowing engineers to assess the overall resilience of a design and understand the associated costs through trade-off analysis. Road Traffic Anomaly Detection using Functional Data Analysis George Tsitsopoulos (Northeastern University)* Streets and highways provide a ubiquitous data source, vehicle traffic volume, that can be exploited to gain insight into what is happening on roadways. Traffic patterns generally fluctuate in a consistent manner throughout the week, making them relatively predictable. However, holidays and unforeseeable anomalies such as accidents can cause significant deviations from the norm. Detecting these irregularities can be a difficult task due to the general noisiness of the count data. Nonetheless, knowledge of these traffic anomalies is important to many parties, making it a critical problem to solve. Awareness of an anomaly can ensure a timely arrival to work or alert agencies when something unusual is occurring in an area of interest. Although traffic volume data is readily available, it is not exploited to the extent we believe it should be when it comes to detecting anomalies.   We can divide traffic anomalies into two categories: short-term anomalies and long-term anomalies. A short term anomaly is generally an accident that causes a change in traffic pattern for a few hours or less. For example, a rear-end collision on the highway during the early afternoon may impact traffic for only 30 minutes. A long-term anomaly is typically a holiday, road closure, or extreme weather -- events that cause a large deviation from the expected pattern for a sustained period. Our research focused on the long- term anomalies, aiming to automatically process and detect all holidays and major events that impact a day's traffic profile.   Many approaches have been developed to detect traffic anomalies, each with varying success. A majority treat the count data as a discrete set of measurements. An alternative approach is to model the volume as a function of time and represent it using a smooth, continuous function. Typical traffic exhibits peaks in volume on weekdays during the morning and evening rush hours, with dips coming during the midday and nighttime hours. Weekends display different behavior, with diminished rush hour peaks. In this work we utilize Functional Data Analysis (FDA) to smooth traffic counts into continuous functions.  In this paper, we utilized FDA to detect long-term traffic anomalies based on single-sensor count data.  FPCA was used to identify the principal components of traffic variation. These components were compared to new data in order to determine whether or not an anomaly occurred. Three detection methods were contrasted; modified functional bagplots, high density region (HDR) boxplots, and Mahalanobis distance. We gathered our data from the California Department of Transportation Performance Measurement System (PeMS), which contains thousands of inductive-loop sensors throughout the state's roads and highways. These sensors have continuously recorded data sampled in 30 second intervals for several years, providing us a large source of traffic count information. Additionally, this dataset allows us to verify that holidays occurred, something that simulated traffic counts cannot do. Overcoming Limitations of GPGPU-Computing in Scientific Applications Gaurav Khanna*; Connor Kenyon; Glenn Volkema (UMass Dartmouth) The performance of discrete general purpose graphics processing units (GPGPUs) has been improving at a rapid pace. The PCIe interconnect that controls the communication of data between the system host memory and the GPU has not improved as quickly, leaving a gap in performance due to GPU downtime while waiting for PCIe data transfer. In this article, we explore two alternatives to the limited PCIe bandwidth, NVIDIA NVLink interconnect, and zero-copy algorithms for shared memory Heterogeneous System Architecture (HSA) devices. The OpenCL SHOC benchmark suite is used to measure the performance of each device on various scientific application kernels. Optimizing the Visualization Pipeline of a 3D Monitoring and Management System Rebecca Wild (Johns Hopkins APL), Matthew Hubbell, Jeremy Kepner (MIT-LL) Monitoring and Managing High Performance Computing (HPC) systems and environments generate an ever growing amount of data. Making sense of this data and generating a platform where the data can be visualized for system administrators and management to proactively identify system failures or understand the state of the system requires the platform to be as efficient and scalable as the underlying database tools used to store and analyze the data. In this paper we will show how we leverage Accumulo, d4m, and Unity to generate a 3D visualization platform to monitor and manage the Lincoln Laboratory Supercomputer systems and how we have had to retool our approach to scale with our systems. Skip the Intersection: Quickly Counting Common Neighbors on Shared-Memory Systems Xiaojing An; Kasimir Gabert*; James Fox (Georgia Institute of Technology); Oded Green (NVIDIA); David Bader (Georgia Institute of Technology) Counting common neighbors between all vertex pairs in a graph is a fundamental operation, with uses in similarity measures, link prediction, graph compression, community detection, and more. Current shared-memory approaches either rely on set intersections or are not readily parallelizable. We introduce a new efficient and parallelizable algorithm to count common neighbors: starting at a wedge endpoint, we iterate through all wedges in the graph, and increment the common neighbor count for each endpoint pair. This exactly counts the common neighbors between all pairs without using set intersections, and as such attains an asymptotic improvement in runtime. Furthermore, our algorithm is simple to implement and only slight modifications are required for existing implementations to use our results. We provide an OpenMP implementation and evaluate it on real-world and synthetic graphs, demonstrating no loss of scalability and an asymptotic improvement. We show intersections are neither necessary nor helpful for computing all pairs common neighbor counts. [Graph Challenge Finalist] Fast BFS-Based Triangle Counting on GPUs Leyuan Wang*; John D Owens (University of California, Davis) In this paper, we propose a novel method to compute triangle counting on GPUs. Unlike previous formulations of graph matching, our approach is BFS-based by traversing the graph in an all-source- BFS manner and thus can be mapped onto GPUs in a massively parallel fashion. Our implementation uses the Gunrock programming model and we evaluate our imple- mentation in runtime and memory consumption compared with previous state-of-the-art work. We sustain a peak traversed-edges- per-second (TEPS) rate of nearly 10 GTEPS. Our algorithm is the most scalable and parallel among all existing GPU imple- mentations and also outperforms all existing CPU distributed implementations. This work specifically focuses on leveraging our implementation on the triangle counting problem for the Subgraph Isomorphism Graph Challenge 2019, demonstrating a geometric mean speedup over the 2018 champion of 3.84×. [Graph Challenge Finalist] Performance of Training Sparse Deep Neural Networks on GPUs Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd)*; Zhangcheng Huang (Ping An Technology (Shenzhen) Co., Ltd); Lingwei Kong (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China); Pengyu Wang (Shanghai Jiao Tong University); Lu Zhang (Shanghai Jiao Tong University); Chao Li (Shanghai Jiao Tong University) Deep neural networks have revolutionized the field of machine learning by dramatically improving the state-of-the-art in various domains. The sizes of deep neural networks (DNNs) are rapidly outgrowing the capacity of hardware to fast store and train them. Over the past few decades, researches have explored the prospect of sparsifying DNNs before, during, and after training by pruning edges from the underlying topology. After the above operation, the generated neural network is known as a sparse neural network. More recent works have demonstrated the remarkable results that certain sparse DNNs can train to the same precision as dense DNNs at lower runtime and storage cost. Although existing methods ease the situation that high demand for computation resources severely hinders the deployment of large-scale DNNs in resource-constrained devices, DNNs can be trained at a faster speed and lower cost. In this work, we propose a Fine-tune Structured Sparsity Learning (FSSL) method to regularize the structures of DNNs and accelerate the training of DNNs. FSSL can: (1) learn a compact structure from large sparse DNN to reduce computation cost; (2) obtain a hardware- friendly to accelerate the DNNs evaluation efficiently. Experimental results of the training time and the compression rate show that superior performance and efficiency than the Matlab example code. These speedups are about twice speedups of non-structured sparsity. [Graph Challenge Honorable Mention] Fast Triangle Counting on GPU Chuangyi Gui (Huazhong University of Science and Technology); Long Zheng (Huazhong University of Science and Technology)*; Pengcheng Yao (Huazhong University of Science and Technology); Xiaofei Liao (HUST); Hai Jin (Huazhong University of Science and Tech Triangle counting is one of the most basic graph applications to solve many real-world problems in a wide variety of domains. Exploring the massive parallelism of the Graphics Processing Unit (GPU) to accelerate the triangle counting is prevail. We identify that the stat-of- the-art GPU-based studies that focus on improving the load balancing still exhibit inherently a large number of random accesses in degrading the performance. In this paper, we design a prefetching scheme that buffers the neighbor list of the processed vertex in advance in the fast shared memory to avoid high latency of random global memory access. Also, we adopt the degree-based graph reordering technique and design a simple heuristic to evenly distribute the workload. Compared to the state-of-the-art HEPC Graph Challenge Champion in the last year, we advance to improve the performance of triangle counting by up to 5.9x speedup with > 10^9 TEPS on a single GPU for many large real graphs from graph challenge datasets.
Wednesday, September 25, 2019