2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
Wednesday September 25, 2019
Graphs, Networks & Sparse Data 2 1:00-2:40 in Eden Vale C1 Chair: Michael Wolf / Sandia Fast and Scalable Distributed Tensor Decompositions Muthu Baskaran, Thomas Henretty, James Ezick (Reservoir Labs) Tensor decomposition is a prominent technique for analyzing multi-attribute data and is being increasingly used for data analysis in different application areas. Tensor decomposition methods are computationally intense and involve irregular memory accesses over large-scale sparse data. Hence it becomes critical to optimize the execution of such data intensive computations and associated data movement to reduce the eventual time-to-solution in data analysis applications. With the prevalence of using advanced high-performance computing (HPC) systems for data analysis applications, it is becoming increasingly important to provide fast and scalable implementation of tensor decompositions and execute them efficiently on modern and advanced HPC systems. In this paper, we present distributed tensor decomposition methods that achieve faster and memory-efficient execution on HPC systems. We demonstrate that our techniques reduce the overall communication and execution time of tensor decomposition methods when they are used for analyzing datasets of varied size from real application. We illustrate our results on an advanced scaled-memory system, namely, HPE Superdome Flex, and on a distributed cluster of Intel multi-core nodes. On Computing with Diagonally Structured Matrices Shahadat Hossain, Mohammad Sakib Mahmud (Univ. Lethbridge) We present a storage scheme for storing matrices by diagonals and algorithms for performing matrix-matrix and matrix-vector multiplication by diagonals.  Matrix elements are accessed with  stride-1 and involve no indirect referencing. Access to the transposed matrix requires no additional effort. The proposed storage scheme handles dense matrices and matrices  with special structure e.g., banded, triangular, symmetric in an uniform manner.  Test results from preliminary numerical experiments with an OpenMP  implementation of our method are encouraging. Spaceland Embedding of Sparse Stochastic Graphs Nikos Pitsianis (Aristotle Univ., Duke), Alexandros-Stavros Iliopoulos (Aristotle Univ.), Dimitris Floros (Aristotle Univ.), Xiaobai Sun (Duke) We introduce SG-t-SNE, a nonlinear method for embedding stochastic graphs/networks into ��-dimensional spaces, �� = 1, 2, 3, without requiring vertex features to reside in, or be transformed into, a metric space. Graphs/networks are relational data, prevalent in real-world applications. Graph embedding is fundamental to many graph analysis tasks, besides graph visualization. SG-t-SNE follows and builds upon the core principle of t-SNE, which is a widely used method for visualizing high-dimensional data. We also introduce SG-t-SNE-Π, a high-performance software for rapid ��-dimensional embedding of large, sparse, stochastic graphs on personal computers with superior efficiency. It empowers SG- t-SNE with modern computing techniques exploiting matrix structures in tandem with memory architectures. We present elucidating graph embedding results with several synthetic graphs and real-world networks in this paper and its Supplementary Material. Partitioning Graphs for the Cloud using Reinforcement Learning Mohammad Hasanzadeh Mofrad (CMU), Rami Melhem (CMU), Mohammad Hammoud (CMU Qatar) In this paper, we propose Revolver, a parallel graph partitioning algorithm capable of partitioning large-scale graphs on a single shared-memory machine. Revolver employs an asynchronous processing framework, which leverages reinforcement learning and label propagation to adaptively partition a graph. In addition, it adopts a vertex-centric view of the graph where each vertex is assigned an autonomous agent responsible for selecting a suitable partition for it, distributing thereby the computation across all vertices. The intuition behind using a vertex-centric view is that it naturally fits the graph partitioning problem, which entails that a graph can be partitioned using local information provided by each vertex's neighborhood. We fully implemented and comprehensively tested Revolver using nine real-world graphs. Our results show that Revolver is scalable and can outperform three popular and state-of-the-art graph partitioners via producing comparable localized partitions, yet without sacrificing the load balance across partitions. Improving Parallelism of Breadth First Search (BFS) Algorithm for Accelerated Performance on GPUs Hao Wen, Wei Zhang (VCU) Breadth-first search (BFS) is a basis for graph search and a core building block for many higher-level graph analysis applications. However, BFS is also a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. In addition, the number of nodes in each layer of the graph is also irregular. Therefore, the number of active GPU threads is different for each layer of execution. These irregularities limit the parallelism of BFS executing on GPUs. Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual- BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently, and the data set also becomes more regular. Our experimental results show that the VBFS achieves significant speedup over the current GPU implementation of BFS from the Rodinia benchmark, and the energy efficiency is also improved.
Wednesday September 25, 2019
Graphs, Networks & Sparse Data 2 1:00-2:40 in Eden Vale C1 Chair: Michael Wolf / Sandia Fast and Scalable Distributed Tensor Decompositions Muthu Baskaran, Thomas Henretty, James Ezick (Reservoir Labs) Tensor decomposition is a prominent technique for analyzing multi- attribute data and is being increasingly used for data analysis in different application areas. Tensor decomposition methods are computationally intense and involve irregular memory accesses over large-scale sparse data. Hence it becomes critical to optimize the execution of such data intensive computations and associated data movement to reduce the eventual time-to-solution in data analysis applications. With the prevalence of using advanced high- performance computing (HPC) systems for data analysis applications, it is becoming increasingly important to provide fast and scalable implementation of tensor decompositions and execute them efficiently on modern and advanced HPC systems. In this paper, we present distributed tensor decomposition methods that achieve faster and memory-efficient execution on HPC systems. We demonstrate that our techniques reduce the overall communication and execution time of tensor decomposition methods when they are used for analyzing datasets of varied size from real application. We illustrate our results on an advanced scaled-memory system, namely, HPE Superdome Flex, and on a distributed cluster of Intel multi-core nodes. On Computing with Diagonally Structured Matrices Shahadat Hossain, Mohammad Sakib Mahmud (Univ. Lethbridge) We present a storage scheme for storing matrices by diagonals and algorithms for performing matrix-matrix and matrix-vector multiplication by diagonals.  Matrix elements are accessed with  stride-1 and involve no indirect referencing. Access to the transposed matrix requires no additional effort. The proposed storage scheme handles dense matrices and matrices  with special structure e.g., banded, triangular, symmetric in an uniform manner.  Test results from preliminary numerical experiments with an OpenMP  implementation of our method are encouraging. Spaceland Embedding of Sparse Stochastic Graphs Nikos Pitsianis (Aristotle Univ., Duke), Alexandros-Stavros Iliopoulos (Aristotle Univ.), Dimitris Floros (Aristotle Univ.), Xiaobai Sun (Duke) We introduce SG-t-SNE, a nonlinear method for embedding stochastic graphs/networks into ��-dimensional spaces, �� = 1, 2, 3, without requiring vertex features to reside in, or be transformed into, a metric space. Graphs/networks are relational data, prevalent in real-world applications. Graph embedding is fundamental to many graph analysis tasks, besides graph visualization. SG-t-SNE follows and builds upon the core principle of t-SNE, which is a widely used method for visualizing high- dimensional data. We also introduce SG-t-SNE-Π, a high- performance software for rapid ��-dimensional embedding of large, sparse, stochastic graphs on personal computers with superior efficiency. It empowers SG-t-SNE with modern computing techniques exploiting matrix structures in tandem with memory architectures. We present elucidating graph embedding results with several synthetic graphs and real-world networks in this paper and its Supplementary Material. Partitioning Graphs for the Cloud using Reinforcement Learning Mohammad Hasanzadeh Mofrad (CMU), Rami Melhem (CMU), Mohammad Hammoud (CMU Qatar) In this paper, we propose Revolver, a parallel graph partitioning algorithm capable of partitioning large-scale graphs on a single shared-memory machine. Revolver employs an asynchronous processing framework, which leverages reinforcement learning and label propagation to adaptively partition a graph. In addition, it adopts a vertex-centric view of the graph where each vertex is assigned an autonomous agent responsible for selecting a suitable partition for it, distributing thereby the computation across all vertices. The intuition behind using a vertex-centric view is that it naturally fits the graph partitioning problem, which entails that a graph can be partitioned using local information provided by each vertex's neighborhood. We fully implemented and comprehensively tested Revolver using nine real-world graphs. Our results show that Revolver is scalable and can outperform three popular and state-of- the-art graph partitioners via producing comparable localized partitions, yet without sacrificing the load balance across partitions. Improving Parallelism of Breadth First Search (BFS) Algorithm for Accelerated Performance on GPUs Hao Wen, Wei Zhang (VCU) Breadth-first search (BFS) is a basis for graph search and a core building block for many higher-level graph analysis applications. However, BFS is also a typical example of parallel computation that is inefficient on GPU architectures. In a graph, a small portion of nodes may have a large number of neighbors, which leads to irregular tasks on GPUs. In addition, the number of nodes in each layer of the graph is also irregular. Therefore, the number of active GPU threads is different for each layer of execution. These irregularities limit the parallelism of BFS executing on GPUs. Unlike the previous works focusing on fine-grained task management to address the irregularity, we propose Virtual- BFS (VBFS) to virtually change the graph itself. By adding virtual vertices, the high-degree nodes in the graph are divided into groups that have an equal number of neighbors, which increases the parallelism such that more GPU threads can work concurrently, and the data set also becomes more regular. Our experimental results show that the VBFS achieves significant speedup over the current GPU implementation of BFS from the Rodinia benchmark, and the energy efficiency is also improved.