2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
View Posters & Demos 2 12:00-1:00 in Foyer Applying Neuromorphic Computing to Compressive Sensing Ronald Scrofano, Douglas P. Enright, George C. Valley (Aerospace) As the computing community moves toward processing at the edge, there is a need for computing systems that are both high performance and power efficient. Neuromorphic computing systems have the potential to fill this need.  In this abstract, we describe our initial progress toward applying neuromorphic computing to a compressive sensing problem in order to develop an efficient compressive sensing system for platforms with significant size, weight, and power (SWaP) constraints. Context Aware Query Performance Optimization in Healthcare for Big Data Analytics Manoj Muniswamaiah, Tilak Agerwala, and Charles Tappert (Pace Univ.) Big data analytics is playing a critical role in the healthcare industry in providing better healthcare delivery to patients and in disease exploration research. New big data tools have been developed which help in integrating and analyzing structured and unstructured data produced by different healthcare systems. Different databases have been used to store and process these healthcare-related data. In this paper, we propose and evaluate a cost-based, context-aware query optimizer which executes queries quickly and efficiently, while improving its performance. Evaluation of the Imbalance Evolution in Parallel Reservoir Simulation Marcin Rogowski, Suha N. Kayum (Saudi Aramco) Load balancing is a crucial factor affecting the performance of parallel applications. Improper work distribution leads to underutilization of computing resources and an unnecessary increase in runtime. This paper identifies the imbalance sources in reservoir simulation and characterizes them as static or dynamic. Simulation model properties that change over time, such as well management actions, are registered and correlated with performance characteristics hence identifying sources of imbalance. The results are exploratory and used to validate the current approach of static grid-to-process, and well-to-process assignment widely used in commercial parallel reservoir simulators. Areas in which implementing dynamic load balancing would be worthwhile are identified. Optimal Resource Allocation for Parallel Reservoir Simulation Suha N. Kayum, Marcin Rogowski (Saudi Aramco) Over the past few decades, the oil and gas (O&G) industry has become heavily dependent on parallel scientific computing. The turnaround time of such applications depends heavily on the amount of resources dedicated to the task. Increasing the number of compute processes for the same job tends to produce diminishing returns, and does not always guarantee an increase in performance of a justified impact. This point describes scalability limits, which this work aims to avoid surpassing. An algorithm is presented in which a reservoir simulation run automatically adjusts and finds the optimal resources, which leads to improved performance, and the efficient utilization of compute resources, resulting in significant cost savings. Exploring the Efficiency of OpenCL Pipe for Hiding Memory Latency on Cloud FPGAs Arnab A Purkayastha, Sai Raghavendran, Jhanani Thiagarajan and Hamed Tabkhi (UNC Charlotte) OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) has made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power- efficient solutions which execution of massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path, which hinders the benefits of data-path customization. This paper explores the efficiency of ”OpenCL Pipe” to hide memory latency on cloud FPGAs by decoupling memory access from the computation. This paper leverages Pipe semantic to split OpenCL kernels into ”read”, ”compute” and ”write back” sub-kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high-performance applications from the Rodinia suite vs. 3.1. All our tests on the Xilinx VU9FP FPGA platform on the Amazon cloud-based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS). A Novel Approach for Scheduling and Mapping of Real-Time Parallel Matrices Multiplication (SMPMM) Hussam Abu Azab (Univ. de Moncton) This paper introduces a novel parallel matrices multiplication algorithm (SMPMM) implies dividing the problem of matrices multiplication into smaller independent tasks where each processor in the parallel environment executes one single task a time, once done, the processor receives another task to process it. As opposed to previous algorithms, like Cannon, Fox, PUMMA, SUMMA, DIMMA and HSUMMA algorithms, where the decomposition is carried out on the data, i.e. the multiplied matrices are decomposed into small blocks, where each processor multiplies some blocks and sends the result to neighbor processors; SMPMM does include any data decomposing. In addition, SMPMM contradicts with previous algorithms where there is no data exchange and no communication among processors on in the parallel environment. One more important advantage is SMPMM multiplies non-square matrices in parallel, which is not available by any previous parallel matrices’ multiplication algorithms. Open Source Multi-functional Memory Unit and Application to Approximate Computing Shigetoshi Nakatake (Univ. Kitakyushu) Approximate computing is one of promising computation techniques which returns a possibly inaccurate result rather than a guaranteed accurate result. Conventionally, this kind of inaccurate computing is allowed for software. However, as growing mobile and embedded devices, the border of software and hardware implementation is no longer strict. We propose a novel multi-functional memory unit which can reconfigure a function of the memory decoder, which is applicable to approximate computing. In our reconfigurable mechanism,  uni-switch cells are introduced to play an alternative role of a logic or a wire, and are embedded in an SRAM array. Hence, an extensional function of the decoder is realized by PLA units inside the memory array, and is used for approximate computing. Furthermore, we demonstrate an implementation of our idea on OpenRAM which is an open-source SRAM array compiler. High-Performance Computing Applications’ Transition to the Cloud in the Oil & Gas Industry Suha N. Kayum, Marcin Rogowski (Saudi Aramco) In the cloud platform, High-Performance Computing (HPC) is meant to provide the capability of scaling to large numbers of tasks that run in parallel on-demand. However, it remains a dilemma for companies in the Oil & Gas (O&G) industry whether to transition HPC activities to the cloud or not. In this paper, the latest research studies are shared that shed light on some of the challenges and outlooks prevailing. The choice of which HPC applications should migrate to the cloud is shown to be application dependent and will be demonstrated in this paper with a case study assessing migrating reservoir simulation activity to the cloud. It is evident that a hybrid cloud solution for high-performance computing is a good starting point, as it mitigates a few challenges that entities in the O&G industry face today. [Graph Challenge Honorable Mention] Multithreaded Layer-wise Training of Sparse Deep Neural Networks using Compressed Sparse Column Mohammad Hasanzadeh Mofrad, Rami Melhem (Univ. of Pittsburgh), Yousuf Ahmad, Mohammad Hammoud (Carnegie Mellon University in Qatar) Training a sparse Deep Neural Network (DNN) is inherently less memory-intensive and processor-intensive compared to training a dense (fully-connected) DNN. In this paper, we utilize Sparse Matrix-Matrix Multiplication (SpMM) to train sparsely-connected DNNs as opposed to dense matrix-matrix multiplication used for training dense DNNs. In our C/C++ implementation, we extensively use in- memory Compressed Sparse Column (CSC) data structures to store and traverse the neural network layers. Also, we train the neural network layer by layer, and within each layer we use 1D-Column partitioning to divide the computation required for training among threads. To speedup the computation, we apply the bias and activation functions while executing SpMM operations. We tested our implementation using benchmarks provided by MIT/IEEE/Amazon HPEC graph challenge \cite{Kepner2019challenge}. Based on our results, our single thread (1 core) and multithreaded (12 cores) implementations are up to $22 \times$, and $150 \times$ faster than the serial Matlab results provided by the challenge. We believe this speedup is  due to the 1D-Column partitioning that we use to balance the computation of SpMM operations among computing threads, the efficient mechanism that we use for memory (re)allocation of sparse matrices, and the overlapping of the accumulation of SpMM results with the application of the bias and activation functions. [Graph Challenge Honorable Mention] Accelerating Sparse Deep Neural Networks on FPGAs Sitao Huang, Carl Pearson, Rakesh Nagi (University of Illinois at Urbana-Champaign), Jinjun Xiong (IBM Thomas J. Watson Research Center), Deming Chen, Wen-mei Hwu (University of Illinois at Urbana-Champaign) Deep neural networks (DNNs) have been widely adopted in many domains, including computer vision, natural language processing, medical care, and so on. Recent research reveals the sparsity in DNN parameters, which can be exploited to reduce inference computational complexity. However, sparsity also introduces irregularity and extra complexity in data processing, which make the accelerator design challenging. In this work, we design and build a highly flexible sparse DNN inference engine to accelerate the inference of sparse DNNs. Our proposed inference engine can be easily configured to be used in both mobile computing and high- performance computing scenarios. Evaluation shows our proposed inference engine effectively accelerates sparse DNNs and outperforms CPU solution by up to 4.7x in terms of energy efficiency. [Graph Challenge Honorable Mention] Update on Triangle Counting on GPU Carl Pearson, Mohammad Almasri, Omer Anjum, Vikram S. Mailthody, Zaid Qureshi, Rakesh Nagi (UIUC), Jinjun Xiong (IBM TJ Watson), and Wen-mei Hwu (UIUC) This work presents an update to the triangle-counting portion of the subgraph isomorphism static graph challenge. This work is motivated by a desire to understand the impact of CUDA unified memory on the triangle-counting problem. First, CUDA unified memory is used to overlap reading large graph data from disk with graph data structures in GPU memory. Second, we use CUDA unified memory hints to solve multi-GPU performance scaling challenges present in our last submission. Finally, we improve the single-GPU kernel performance from our past submission by introducing a work-stealing dynamic algorithm GPU kernel with persistent threads, which makes performance adaptive for large graphs without requiring a graph analysis phase.
Thursday, September 26, 2019
View Posters & Demos 2 12:00-1:00 in Foyer Applying Neuromorphic Computing to Compressive Sensing Ronald Scrofano, Douglas P. Enright, George C. Valley (Aerospace) As the computing community moves toward processing at the edge, there is a need for computing systems that are both high performance and power efficient. Neuromorphic computing systems have the potential to fill this need.  In this abstract, we describe our initial progress toward applying neuromorphic computing to a compressive sensing problem in order to develop an efficient compressive sensing system for platforms with significant size, weight, and power (SWaP) constraints. Context Aware Query Performance Optimization in Healthcare for Big Data Analytics Manoj Muniswamaiah, Tilak Agerwala, and Charles Tappert (Pace Univ.) Big data analytics is playing a critical role in the healthcare industry in providing better healthcare delivery to patients and in disease exploration research. New big data tools have been developed which help in integrating and analyzing structured and unstructured data produced by different healthcare systems. Different databases have been used to store and process these healthcare-related data. In this paper, we propose and evaluate a cost-based, context-aware query optimizer which executes queries quickly and efficiently, while improving its performance. Evaluation of the Imbalance Evolution in Parallel Reservoir Simulation Marcin Rogowski, Suha N. Kayum (Saudi Aramco) Load balancing is a crucial factor affecting the performance of parallel applications. Improper work distribution leads to underutilization of computing resources and an unnecessary increase in runtime. This paper identifies the imbalance sources in reservoir simulation and characterizes them as static or dynamic. Simulation model properties that change over time, such as well management actions, are registered and correlated with performance characteristics hence identifying sources of imbalance. The results are exploratory and used to validate the current approach of static grid-to-process, and well-to-process assignment widely used in commercial parallel reservoir simulators. Areas in which implementing dynamic load balancing would be worthwhile are identified. Optimal Resource Allocation for Parallel Reservoir Simulation Suha N. Kayum, Marcin Rogowski (Saudi Aramco) Over the past few decades, the oil and gas (O&G) industry has become heavily dependent on parallel scientific computing. The turnaround time of such applications depends heavily on the amount of resources dedicated to the task. Increasing the number of compute processes for the same job tends to produce diminishing returns, and does not always guarantee an increase in performance of a justified impact. This point describes scalability limits, which this work aims to avoid surpassing. An algorithm is presented in which a reservoir simulation run automatically adjusts and finds the optimal resources, which leads to improved performance, and the efficient utilization of compute resources, resulting in significant cost savings. Exploring the Efficiency of OpenCL Pipe for Hiding Memory Latency on Cloud FPGAs Arnab A Purkayastha, Sai Raghavendran, Jhanani Thiagarajan and Hamed Tabkhi (UNC Charlotte) OpenCL programming ability combined with OpenCL High-Level Synthesis (OpenCL-HLS) has made tremendous improvements in the reconfigurable computing field. FPGAs inherent pipelined parallelism capability provides not only faster execution times but also power-efficient solutions which execution of massively parallel applications. A major execution bottleneck affecting FPGA performance is the high number of memory stalls exposed to pipelined data-path, which hinders the benefits of data-path customization. This paper explores the efficiency of ”OpenCL Pipe” to hide memory latency on cloud FPGAs by decoupling memory access from the computation. This paper leverages Pipe semantic to split OpenCL kernels into ”read”, ”compute” and ”write back” sub-kernels which work concurrently to overlap the computation of current threads with the memory access of future threads. For evaluation, we use a mix of seven massively parallel high- performance applications from the Rodinia suite vs. 3.1. All our tests on the Xilinx VU9FP FPGA platform on the Amazon cloud- based AWS EC2 F1 instance. On average, we observe 5.2x speedup with a 2.2x increase in memory bandwidth utilization with about 2.5x increase in FPGA resource utilization over the baseline synthesis (Xilinx OpenCL-HLS). A Novel Approach for Scheduling and Mapping of Real-Time Parallel Matrices Multiplication (SMPMM) Hussam Abu Azab (Univ. de Moncton) This paper introduces a novel parallel matrices multiplication algorithm (SMPMM) implies dividing the problem of matrices multiplication into smaller independent tasks where each processor in the parallel environment executes one single task a time, once done, the processor receives another task to process it. As opposed to previous algorithms, like Cannon, Fox, PUMMA, SUMMA, DIMMA and HSUMMA algorithms, where the decomposition is carried out on the data, i.e. the multiplied matrices are decomposed into small blocks, where each processor multiplies some blocks and sends the result to neighbor processors; SMPMM does include any data decomposing. In addition, SMPMM contradicts with previous algorithms where there is no data exchange and no communication among processors on in the parallel environment. One more important advantage is SMPMM multiplies non-square matrices in parallel, which is not available by any previous parallel matrices’ multiplication algorithms. Open Source Multi-functional Memory Unit and Application to Approximate Computing Shigetoshi Nakatake (Univ. Kitakyushu) Approximate computing is one of promising computation techniques which returns a possibly inaccurate result rather than a guaranteed accurate result. Conventionally, this kind of inaccurate computing is allowed for software. However, as growing mobile and embedded devices, the border of software and hardware implementation is no longer strict. We propose a novel multi- functional memory unit which can reconfigure a function of the memory decoder, which is applicable to approximate computing. In our reconfigurable mechanism,  uni-switch cells are introduced to play an alternative role of a logic or a wire, and are embedded in an SRAM array. Hence, an extensional function of the decoder is realized by PLA units inside the memory array, and is used for approximate computing. Furthermore, we demonstrate an implementation of our idea on OpenRAM which is an open-source SRAM array compiler. High-Performance Computing Applications’ Transition to the Cloud in the Oil & Gas Industry Suha N. Kayum, Marcin Rogowski (Saudi Aramco) In the cloud platform, High-Performance Computing (HPC) is meant to provide the capability of scaling to large numbers of tasks that run in parallel on-demand. However, it remains a dilemma for companies in the Oil & Gas (O&G) industry whether to transition HPC activities to the cloud or not. In this paper, the latest research studies are shared that shed light on some of the challenges and outlooks prevailing. The choice of which HPC applications should migrate to the cloud is shown to be application dependent and will be demonstrated in this paper with a case study assessing migrating reservoir simulation activity to the cloud. It is evident that a hybrid cloud solution for high-performance computing is a good starting point, as it mitigates a few challenges that entities in the O&G industry face today. [Graph Challenge Honorable Mention] Multithreaded Layer- wise Training of Sparse Deep Neural Networks using Compressed Sparse Column Mohammad Hasanzadeh Mofrad, Rami Melhem (Univ. of Pittsburgh), Yousuf Ahmad, Mohammad Hammoud (Carnegie Mellon University in Qatar) Training a sparse Deep Neural Network (DNN) is inherently less memory-intensive and processor-intensive compared to training a dense (fully-connected) DNN. In this paper, we utilize Sparse Matrix-Matrix Multiplication (SpMM) to train sparsely-connected DNNs as opposed to dense matrix-matrix multiplication used for training dense DNNs. In our C/C++ implementation, we extensively use in-memory Compressed Sparse Column (CSC) data structures to store and traverse the neural network layers. Also, we train the neural network layer by layer, and within each layer we use 1D-Column partitioning to divide the computation required for training among threads. To speedup the computation, we apply the bias and activation functions while executing SpMM operations. We tested our implementation using benchmarks provided by MIT/IEEE/Amazon HPEC graph challenge \cite{Kepner2019challenge}. Based on our results, our single thread (1 core) and multithreaded (12 cores) implementations are up to $22 \times$, and $150 \times$ faster than the serial Matlab results provided by the challenge. We believe this speedup is  due to the 1D-Column partitioning that we use to balance the computation of SpMM operations among computing threads, the efficient mechanism that we use for memory (re)allocation of sparse matrices, and the overlapping of the accumulation of SpMM results with the application of the bias and activation functions. [Graph Challenge Honorable Mention] Accelerating Sparse Deep Neural Networks on FPGAs Sitao Huang, Carl Pearson, Rakesh Nagi (University of Illinois at Urbana-Champaign), Jinjun Xiong (IBM Thomas J. Watson Research Center), Deming Chen, Wen-mei Hwu (University of Illinois at Urbana-Champaign) Deep neural networks (DNNs) have been widely adopted in many domains, including computer vision, natural language processing, medical care, and so on. Recent research reveals the sparsity in DNN parameters, which can be exploited to reduce inference computational complexity. However, sparsity also introduces irregularity and extra complexity in data processing, which make the accelerator design challenging. In this work, we design and build a highly flexible sparse DNN inference engine to accelerate the inference of sparse DNNs. Our proposed inference engine can be easily configured to be used in both mobile computing and high-performance computing scenarios. Evaluation shows our proposed inference engine effectively accelerates sparse DNNs and outperforms CPU solution by up to 4.7x in terms of energy efficiency. [Graph Challenge Honorable Mention] Update on Triangle Counting on GPU Carl Pearson, Mohammad Almasri, Omer Anjum, Vikram S. Mailthody, Zaid Qureshi, Rakesh Nagi (UIUC), Jinjun Xiong (IBM TJ Watson), and Wen-mei Hwu (UIUC) This work presents an update to the triangle-counting portion of the subgraph isomorphism static graph challenge. This work is motivated by a desire to understand the impact of CUDA unified memory on the triangle-counting problem. First, CUDA unified memory is used to overlap reading large graph data from disk with graph data structures in GPU memory. Second, we use CUDA unified memory hints to solve multi-GPU performance scaling challenges present in our last submission. Finally, we improve the single-GPU kernel performance from our past submission by introducing a work-stealing dynamic algorithm GPU kernel with persistent threads, which makes performance adaptive for large graphs without requiring a graph analysis phase.
Thursday, September 26, 2019