2020 IEEE High Performance Extreme Computing Virtual Conference 21 - 25 September 2020
2-1: Graph Analytics & Network Science 1 Session (11:00-12:15 EDT) A GraphBLAS solution to the SIGMOD 2014 Programming Contest using multi-source BFS Márton Elekes (Budapest University of Technology and Economics); Attila Nagy (Budapest University of Technology and Economics); Dávid Sándor (Budapest University of Technology and Economics); János Benjámin Antal (unaffiliated); Timothy Davis (Texas A&M University); Gabor Szarnyas (Budapest University of Technology and Economics)* The GraphBLAS standard defines a set of fundamental building blocks for formulating graph algorithms in the language of linear algebra. Since its first release in 2017, the expressivity of the GraphBLAS API and the performance of its implementations (such as SuiteSparse:GraphBLAS) have been studied on a number of textbook graph algorithms such as BFS, single-source shortest path, and connected components. However, less attention was devoted to other aspects of graph processing such as handling typed and attributed graphs (also known as property graphs), and making use of complex graph query techniques (handling paths, aggregation, and filtering). To study these problems in more detail, we have used GraphBLAS to solve the case study of the 2014 SIGMOD Programming Contest, which defines complex graph processing tasks that require a diverse set of operations. Our solution makes heavy use of multi-source BFS algorithms expressed as sparse matrix-matrix multiplications along with other GraphBLAS techniques such as masking and submatrix extraction. While the queries can be formulated in GraphBLAS concisely, our performance evaluation shows mixed results. For some queries and data sets, the performance is competitive with the hand-optimized top solutions submitted to the contest, however, in some cases, it is currently outperformed by orders of magnitude. LessMine: Reducing Sample Space and Data Access for Dense Pattern Mining Tianyu Fu (Tsinghua University)*; Ziqian Wan (Tsinghua University); Guohao Dai (Tsinghua University); Yu Wang (Tsinghua University); Huazhong Yang (Tsinghua University) In the era of ''big data'', graph has been proven to be one of the most important reflections of real-world problems. To refine the core properties of large-scale graphs, dense pattern mining plays a significant role. Because of the complexity of pattern mining problems, conventional implementations often lack scalability, consuming much time and memory space. Previous work (e.g., ASAP) proposed approximate pattern mining as an efficient way to extract structural information from graphs. It demonstrates dramatic performance improvement by up to two orders of magnitude. However, we observe three main flaws of ASAP in cases of dense patterns, thus we propose LessMine, which reduces the sample space and data access for dense pattern mining. We introduce the reorganization of data structure, the method of concurrent sample, and uniform close. We also provide locality-aware partition for distributed settings. The evaluation shows that our design achieves up to 1829x speedup with 66% less error rate compared with ASAP. Fast Graphlet Transform of Sparse Graphs Dimitris Floros (Aristotle University of Thessaloniki); Nikos Pitsianis (Aristotle University of Thessaloniki)*; Xiaobai Sun (Duke University) We introduce the computational problem of graphlet transform of a sparse graph. Graphlets are fundamental topology elements of all graphs/networks. They can be used as coding elements to encode graph-topological information at multiple granularity levels, for classifying vertices on the same graph/network, as well as, for making differentiation or connection across different networks. Network/graph analysis using graphlets has growing applications. We recognize the universality and increased encoding capacity in using multiple graphlets, we address the arising computational complexity issues, and we present a fast method for exact graphlet transform. The fast graphlet transform establishes a few remarkable records at once in high computational efficiency, low memory consumption, and ready translation to high-performance program and implementation. It is intended to enable and advance network/graph analysis with graphlets, and to introduce the relatively new analysis apparatus to graph theory, high-performance graph computation, and broader applications. Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges Amir Sabbagh Molahosseini (Queen's University Belfast)*; Hans Vandierendonck (Queen's University Belfast) Mixed-precision computation has been proposed as a means to accelerate iterative algorithms as it can reduce the memory bandwidth and cache effectiveness. This paper aims for further memory traffic reduction via introducing new half-precision (16 bit) data formats customized for PageRank. We develop two formats. A first format builds on the observation that the exponents of about 99% of PageRank values are tightly distributed around the exponent of the inverse of the number of vertices. A second format builds on the observation that 6 exponent bits are sufficient to capture the full dynamic range of PageRank values. Our floating-point formats provide less precision compared to standard IEEE 754 formats, but sufficient dynamic range for PageRank. The experimental results on various size graphs show that the proposed formats can achieve an accuracy of 1e-4, which is an improvement over the state of the art. Due to random memory access patterns in the algorithm, performance improvements over our highly tuned baseline are 1.5% at best. GraphSDH: A General Graph Sampling Framework with Distribution and Hierarchy Jingbo Hu (Tsinghua University)*; Guohao Dai (Tsinghua University); Yu Wang (tsinghua university); Huazhong Yang (Tsinghua University) Large-scale graphs play a vital role in various applications, but it is limited by the long processing time. Graph sampling is an effective way to reduce the amount of graph data and accelerate the algorithm. However, previous work usually lacks theoretical analysis related to graph algorithm models. In this study, GraphSDH (Graph Sampling with Distribution and Hierarchy), a general large-scale graph sampling framework is established based on the vertex-centric graph model. According to four common sampling techniques, we derive the sampling probability to minimize the variance, and optimize the design according to whether there is a pre-estimation process for the intermediate value. In order to further improve the accuracy of the graph algorithm, we propose a stratified sampling method based on vertex degree and a hierarchical optimization scheme based on sampling position analysis. Extensive experiments on large graphs show that GraphSDH can achieve over 95% accuracy for PageRank by sampling only 10% edges of the original graph, and speed up PageRank by several times than that of the non-sampling case. Compared with random neighbor sampling, GraphSDH can reduce the mean relative error of PageRank by about 17% at a sampling neighbor ratio (sampling fraction) of 20%. Furthermore, GraphSDH can be applied to various graph algorithms, such as Breadth-First Search (BFS), Alternating Least Squares (ALS) and Label Propagation Algorithm (LPA). Poster Session: 2-P (12:15-15:45 EDT) Scalable Parallel File Write from a Large NUMA System Dong-In Kang (Information Sciences Institute, USC)* As the number of the sockets and cores within a cache-coherent NUMA (Non-Uniform Memory Access) system increases, parallel file I/O performance is affected more by processor affinity and synchronization overhead within the system. In this paper, we report on the performance impact of processor affinity and page caching overhead on parallel write operations to a single shared file on a large NUMA system. We did experiments using two configurations of an HPE Superdome Flex system – one with 12 sockets and the other with 32 sockets, both having 24 cores per socket, OpenMPI, and the Lustre parallel file system. Our results show that processor affinity and page caching overhead can result in large performance variation depending on the number of MPI processes. We observe that page caching is useful for parallel file writes with a small number of MPI processes, but it is not scalable on a large NUMA system. Parallel file writes without page caching are scalable but achieve higher performance only with a large number of MPI processes. Variable Precision Multiplication for Software-Based Neural Networks Richa Singh (Virginia Polytechnic Institute and State University )*; Tom Conroy (Virginia Polytechnic Institute and State University ); Patrick Schaumont (Worcester Polytechnic Institute) As the number of applications of neural networks continues to grow, so does the need to efficiently perform inference computations on highly constrained devices. In this paper, we propose a methodology to accelerate neural networks in software. We exploit the limited- precision requirements of typical neural networks by formulating recurring operations in a bit-slice computation format. Bit-slice computation ensures that every bit of an M-bit processor word contributes useful work even while computing a limited-precision n-bit (with n < M) operation. This paper brings the following contributions. We first present an environment to efficiently create bitslice descriptions in software, by synthesizing them from Verilog. We then develop bitsliced designs of matrix multiplication and evaluate their performance. Our target is a small microcontroller, and we rely solely on software optimization. Our driving application is a neural network classifier for the MNIST database. Range-Based Linear Quantization in symmetric mode quantizes pre-trained 32-bit floating point weights and activation to low-precision data-widths. Experiments on RISC-V with varying levels of hardware-support show that for data-widths common to neural network applications, the bit-sliced code produces a speedup over traditional methods, which leads to faster and efficient inference without incurring significant loss in accuracy. For example, 8-bit matrix multiplications are sped up by a factor of 2.62 times when compared with non-bitsliced rv32i ISA implementation with no hardware multiplier. Offline Machine Learning for Human Activity Recognition with Smartphone Yanjia Zhang (University of South Florida)*; Kandethody Ramachandran (University of South Florida) Wearable sensors in smart-phone and wrist tracking devices are widely used in the activity tracking and body monitoring with a low cost. Human activity recognition (HAR) is one of the important applications. Activities identification then is the core part of HAR. In this report, we present a comparison with several popular offline machine learning methods using smartphone data and try to find the most effective model for analyzing these sensor data. The methods include Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), Decision Tree (DT), Random Forest (RF), and Artificial Neural Network (ANN). Two datasets with different transition methods from smartphone are used. The data includes Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, Lying, and Jogging. The first dataset has the first 6 activities and shows that the SVM with linear kernel and LDA have the highest test average accuracy of 96.4% and 96.2%, respectively. Decision Tree performs worse than others with test average accuracy of 86.0%. While the second dataset excludes the Lying, but has jogging, and shows that LDA and DT are the most appropriate algorithms, the test average accuracy of these are 100%. KNN is the worse one with 74.3% test average accuracy. Based on all the results, LDA might be the best one for these sensors data. Moreover, the transition method used to reduce noise and extra information in the second data might be better than that in the first one. It has lower dimensions and better classification performance. In order to get these improved accuracy rates, in this paper, we used grid search, multi-fold cross validation, and dimensional reduction method. In addition to just doing the comparison, we also proposed a two-layer method for activity identification. This method is more flexible of choosing classifiers for activities. Execution of Complete Molecular Dynamics Simulations on Multiple FPGAs Carlo Pascoe (Silicon Therapeutics)*; Larry Stewart (Silicon Therapeutics); Woody Sherman (Silicon Therapeutics); Vipin Sachdeva (Silicon Therapeutics); Martin Herbordt (Boston University) We  have  modified  the  open  source  molecular  dynamics (MD) simulation code OpenMM [1] to add support for running complete MD timesteps on a cluster of FPGAs. FPGA kernels are coded in OpenCL for ease of hardware development and application integration. MD proceeds by calculating forces on individual particles and integrating those forces to update velocities and positions on  a  per  timestep  basis.  A  variety  of  forces  apply  to  each particle and we subdivide them into three categories based on the types of computation required: range limited (RL), long range  (LR),  and  bonded.  RL  interactions  comprise  Lennard Jones and electrostatic forces between all particle pairs within a radial cutoff. LR interactions primarily comprise electrostatic forces  beyond  the  RL  cutoff,  where  pairwise  computation would be too costly. We calculate LR forces using the Smooth Particle  Mesh  Ewald  (PME)  method,  which  uses  3D  Fast Fourier Transforms (FFTs) to accelerate computation. Bonded interactions are the focus of future work. The ultimate goal of this project is to perform MD simulation of  biologically  relevant  systems  within  the  context  of  drug discovery (i.e., periodic systems of 50,000–100,000 particles) with  strong  scaling  performance  greater  than  possible  with other technologies such as GPUs. Optimizing Use of Different Types of Memory for FPGAs in High Performance Computing Kai Huang (Northeastern University); Mehmet Gungor (Northeastern University); Stratis Ioannidis (Northeastern University); Miriam Leeser (Northeastern University)* Accelerators such as Field Programmable Gate Arrays (FPGAs) are increasingly used in high performance computing, and the problems they are applied to process larger and larger amounts of data.  FPGA manufacturers have added new types of memory on chip to help ease the memory bottleneck; however, the burden is on the designer to determine how data is allocated to different memory types.  We study the use of ultraRAM for a graph application running on Amazon Web Services (AWS) that generates a large amount of intermediate data that is not subsequently accessed sequentially.  We investigate different algorithms for mapping data to ultraRAM.  Our results show that use of ultraRAM can speed up overall application run time by a factor of 3 or more.  Maximizing the amount of ultraRAM used produces the best results, and as problem size grows, judiciously assigning data to ultraRAM vs. DDR results in better performance. 2-2: Graph Analytics & Network Science 2 Session (12:30-13:45 EDT) Fast GPU Graph Contraction by Combining Efficient Shallow Searches and Post-Culling Roozbeh Karimi (NVidia Corporation); David Koppelman (Louisiana State University)*; Chris Michael (Naval Research Laboratory) Efficient GPU single-source shortest-path (SSSP) queries of road network graphs can be realized by a technique called PHAST (Delling et al.) in which the graph is contracted (pre-processed using Geisberger's Contraction Hierarchies) once and the resulting contracted graph is queried as needed. PHAST accommodates GPUs' parallelism requirements well, resulting in efficient queries. For situations in which a graph is not available well in advance or changes frequently contraction time itself becomes significant. Karimi et al. recently described a GPU contraction technique, CU-CH, which significantly reduces the contraction time of small- to medium- sized graphs, reporting a speedup of over 20 times on an NVidia P100 GPU. However CU-CH realizes little speedup on larger graphs, such as DIMACS' USA and W. Europe graphs. The obstacle to faster contraction of larger graphs is the frequently performed witness path search (WPS). A WPS for a node determines which shortcut edges need to be added between the node's neighbors to maintain distances after the removal of the node. GPUs' strict thread convergence requirements and limited scratchpad preclude the bidirectional Dijkstra approach used in CPU implementations. Instead, CU-CH uses a two-hop-limit WPS tightly coded to fit GPU shared storage and to maintain thread convergence. Where two hops is sufficient speedup is high, but for larger graphs the hop limit exacts a toll due to the accumulation of unneeded shortcuts. The problem is overcome here by retaining the efficient CU-CH WPS but using it both for its original purpose and also to identify unnecessary shortcuts added in prior steps. The unnecessary shortcuts are culled (removed). Culling shortcuts not only dramatically reduces the time needed to contract a graph but also improves the quality of the contracted graph. For smaller graphs such as DIMACS Cal (travel time) contraction time is 61% faster compared to CU-CH. For the DIMACS Europe and USA graphs contraction times are 40 times and 12 times faster, respectively. SSSP query times also improve dramatically, approaching those obtained on aggressively contracted graphs. The speedup over Geisberger's CPU code is over 100 times for NVidia V100 GPUs on most graphs tried. Using Graphlet Spectrograms for Temporal Pattern Analysis of Virus-Research Collaboration Networks Dimitris Floros (Aristotle University of Thessaloniki); Tiancheng Liu (Duke University); Nikos Pitsianis (Aristotle University of Thessaloniki)*; Xiaobai Sun (Duke University) We introduce a new method for temporal pattern analysis of scientific collaboration networks. We investigate in particular virus research activities through five epidemic or pandemic outbreaks in recent two decades and in the ongoing pandemic with COVID-19. Our method embodies two innovative components. The first is a simple model of temporal collaboration networks with time segmented in publication time and convolved in citation history, to effectively capture and accommodate collaboration activities at mixed time scales. The second component is the novel use of graphlets to encode topological structures and to detect change and persistence in collaboration activities over time. We discover in particular two unique and universal roles of bi-fork graphlet in (1) identifying bridges among triangle clusters and (2) quantifying grassroots as the backbone of every collaboration network. We present a number of intriguing patterns and findings about the virus-research activities. Computing PageRank Scores of Web Crawl Data Using DGX A100 Clusters Seunghwa Kang (NVIDIA)*; Alex Fender (NVIDIA); Joe Eaton (NVIDIA); Brad Rees (NVIDIA) PageRank is a widely used graph analytics algorithm to rank vertices using relationship data. Large-scale PageRank is challenging due to its irregular and communication intensive computational characteristics. We implemented PageRank on NVIDIA’s newly released DGX A100 cluster and compared the performance with two recent notable large-scale PageRank computations using the Common Crawl dataset. The ShenTu framework computed PageRank scores using a large number of custom microprocessors connected with an HPC class interconnect. The Hronos framework reported the state-of-the-art performance using 3000 commodity CPU nodes and 10 Gbps Ethernet. The Common Crawl dataset captures link relationships between web pages in a graph with 3.563 billion vertices and 128.736 billion edges. Our implementation demonstrated 13x faster PageRank iteration time than the Hronos framework using a cluster with NVLink connected 32 A100 GPUs. Triangle Counting with Cyclic Distributions Kevin Deweese (University of Washington)*; Jesun Firoz (Pacific Northwest National Laboratory); Andrew Lumsdaine (Pacific Northwest National Lab and University of Washington); Scott McMillan (CMU Software Engineering Institute); Luke D'Alessandro (Indiana University) Triangles are the simplest non-trivial subgraphs and triangle counting is used in a number of different applications. The order in which vertices are processed in triangle counting strongly effects the amount of work that needs to be done (and thus the overall performance). Ordering vertices by degree has been shown to be one particularly effective ordering approach. However, for graphs with skewed degree distributions (such as power-law graphs), ordering by degree effects the distribution of work; parallelization must account for this distribution in order to balance work among workers. In this paper we provide an in- depth analysis of the ramifications of degree-based ordering on parallel triangle counting. We present approach for partitioning work in triangle counting, based on cyclic distribution and some surprisingly simple C++ implementations. Experimental results demonstrate the effectiveness of our approach, particularly for power-law (and social network) graphs. Towards an Objective Metric for the Performance of Exact Triangle Count Mark Blanco (Carnegie Mellon University)*; Scott McMillan (CMU Software Engineering Institute); Tze Meng Low (Carnegie Mellon University) The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries.   We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed.  Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline. 2-3: Graph Analytics & Network Science 3 Session (14:15-15:30 EDT) Leveraging Linear Algebra to Count and Enumerate Simple Subgraphs Vitaliy Gleyzer (MIT Lincoln Laboratory)*; Edward Kao (MIT Lincoln Laboratory); Andrew Soszynski (MIT Lincoln Laboratory) Even though subgraph counting and subgraph matching are well-known NP-Hard problems, they are foundational building blocks for many scientific and commercial applications. In order to analyze graphs that contain millions to billions of edges, distributed systems can provide computational scalability through search parallelization. One recent approach for exposing graph algorithm parallelization is through a linear algebra formulation and the use of the matrix multiply operation, which conceptually is equivalent to a massively parallel graph traversal. This approach has several benefits, including 1) a mathematically-rigorous foundation, and 2) ability to leverage specialized linear algebra accelerators and high-performance libraries. In this paper, we explore and define a linear algebra methodology for performing exact subgraph counting and matching for 4-vertex subgraphs excluding the clique. Matches on these simple subgraphs can be joined as components for a larger subgraph. With thorough analysis, we demonstrate that the linear algebra formulation leverages path aggregation which allows it to be up 2x to 5x more efficient in traversing the search space and compressing the results as compared to tree-based subgraph matching techniques. GBTLX: A First Look Sanil Rao (Carnegie Mellon University)* We provide a first look at GBTLX, a code generator that translates GBTL-based graph processing programs into high performance C graph processing programs that match hand-tuned high performance implementations. GBTLX refactors code written using GBTL into problems that capture the signature of algorithms and solvers that capture the semantics (input/output behavior of algorithms. Users provide classes that implement these two aspects using standard GBTL functions and encapsulate the targeted algorithm. GBTLX then performs a sequence of inspection, code generation and high performance execution. First, the user code is  traced while run against original GBTL. Then, the trace is used to define the semantics and signature of the algorithm for use in code generation. The SPIRAL system is used to generate high performance C code that implements the user-specified algorithm, specializing the code for algorithm- dependent optimizations. Finally, the user-provided GBTL-based implementation is replaced by the SPIRAL generated C code. For triangle counting and k-truss enumeration the resulting executables provide performance equivalent to the hand-tuned implementation, while the source code is maintainable as it only use the C++ GBTL library. GraphBLAS Programmability: Python and MATLAB Interfaces Timothy Mattson (Intel)*; Michel Pelletier (Graphegon); Timothy Davis (Texas A&M University) A graph can be represented by a sparse array.  In that representation, the basic operations used to construct graph algorithms are linear algebra operations over algebraic semirings.  The GraphBLAS forum was established in 2013 to define a set of Basic Linear Algebra Subprograms for Graph Algorithms.  They started with a mathematical definition of the GraphBLAS followed by a binding to the C Programming language.  The result was the GraphBLAS C specification currently at version 1.3.1.  A robust implementation of the GraphBLAS C specification is available as the SuiteSparse GraphBLAS library.  The primary motivation of the GraphBLAS was to provide a high performance foundation for graph algorithms.  Many members of the GraphBLAS forum were also motivated by the simplicity of graph algorithms expressed in terms of linear algebra. This simplicity, however, is difficult to fully appreciate when the programming interface is the C programming language.   To see the full expressive power of the GraphBLAS, a high level interface is needed so the elegance of the mathematical underpinnings of the GraphBLAS is apparent.  In this paper we present python (pygraphblas) and MATLAB (@GrB) interfaces to the SuiteSparse implementation of the GraphBLAS.  We use a subset of the GAP benchmark suite to demonstrate the simplicity of these GraphBLAS interfaces and show that in most cases, little performance is sacrificed by moving to the more productive Python and MATLAB interfaces. On the Feasibility of Using Reduced-Precision Tensor Core Operations for Graph Analytics Jesun Firoz (Pacific Northwest National Laboratory)*; Ang Li (Pacific Northwest National Laboratory); Jiajia Li (Pacific Northwest National Laboratory); Kevin Barker (PNNL) Today’s data-driven analytics and machine learning workload   have   been   largely   driven   by   the   General-PurposeGraphics Processing Units (GPGPUs). To accelerate dense matrix multiplications  on  the  GPUs,  Tensor  Core  Units  (TCUs)  have been  introduced  in  recent  years.  In  this  paper,  we  study  linear-algebra-based  and  vertex-centric  algorithms  for  various  graph kernels  on  the  GPUs  with  an  objective  of  applying  this  new hardware feature to graph applications. We identify the potential stages in these graph kernels that can be executed on the TensorCore Units. In particular, we leverage the reformulation of the reduction and scan operations in terms of matrix multiplication [1] on  the  TCUs.  We  demonstrate  that  executing  these  operations on the TCUs, available inside different graph kernels, can assist in  establishing  an  end-to-end  pipeline  on  the  GPGPUs  without depending on hand-tuned external libraries and still can deliver comparable  performance  for  various  graph  analytics. Efficient Sparse Matrix-Vector Multiplication on Intel PIUMA Architecture Sriram Aananthakrishnan (Intel)*; Robert Pawlowski (Intel); Joshua Fryman (Intel); Ibrahim Hur (Intel Corporation) Graph analytics has many real-world applications such as cyber-security, detecting fraudulent activity, and understanding disease spread. Sparse matrix dense vector (SpMV) multiplication is an indispensable kernel in large scale graph analytics. SpMV’s performance is sensitive to input sparsity and its performance has been observed to be poor on scale-free graphs. Intel PIUMA is a novel architecture tailored for graph analytics. PIUMA addresses the challenges in graph analytics through scalable network and memory design along with novel features for offloading memory accesses and synchronization. We report on an early performance study of SpMV on the Intel PIUMA architecture. Our results show strong scalability of SpMV on scale-free graphs with a peak simulated performance of 109 GFlops on a scale-free graph with 16 million nodes and 0.5-billion edges. 2-4: Big Data & Distributed Computing 1 Session (15:45-17:00 EDT) A Framework for Task Mapping onto Heterogeneous Platforms Ta-Yang Wang (University of Southern California)*; Ajitesh Srivastava (University of Southern California); Viktor Prasanna (Unversity of Southern California) While heterogeneous systems provide considerable opportunities for accelerating big data applications, the variation in processing capacities and communication latency of different resources makes it challenging to effectively map the applications on the platform. To generate an optimized mapping of the input application on a variety of heterogeneous platforms, we design a flexible annotated task interaction graph based framework which 1) allows modeling of mixed CPU and GPU architectures, and 2) identifies an efficient task- hardware mapping of the input application, given the dependencies and communication costs between tasks that constitute the applications. The annotated task interaction graph (ATIG) representation captures all the information that is necessary to execute the application and the meta-data, such as performance models for estimating runtime on a target resource and communication latencies. Our framework supports solving the problem of mapping tasks in the ATIG onto available resources by including variations of greedy algorithm and LP relaxations with rounding. We show that our framework can achieve high speedup, allowing domain experts to efficiently compile a broad set of programs to parallel and heterogeneous hardware. Best of Both Worlds: High Performance Interactive and Batch Launching Chansup Byun (MIT Lincoln Laboratory)* Rapid launch of thousands of jobs is essential for effective interactive supercomputing, big data analysis, and AI algorithm development.  Achieving thousands of launches per second has required hardware to be available to receive these jobs.  This paper presents a novel preemptive approach to implement “spot” jobs on MIT SuperCloud systems allowing the resources to be fully utilized for both long running batch jobs while still providing fast launch for interactive jobs. The new approach separates the job preemption and scheduling operations and can achieve 100 times faster performance in the scheduling of a job with preemption when compared to using the standard scheduler-provided automatic preemption-based capability.  The results demonstrate that the new approach can schedule interactive jobs preemptively at a performance comparable to when the required computing resources are idle and available. The spot job capability can be deployed without disrupting the interactive user experience while increasing the overall system utilization. Approximate Inverse Chain Preconditioner: Iteration Count Case Study for Spectral Support Solvers M. Harper Langston (Reservoir Labs, Inc)*; Meifeng Lin (Brookhaven Natinoal Laboratory); Eric Papenhausen (Akai Kaeru); Pierre- David Letourneau (Reservoir Labs, Inc.); Julia Wei (Reservoir Labs, Inc.); Larry Weintraub (Reservoir Labs, Inc.); Mitchell Harris (Reservoir Labs, Inc.); Richard Lethin (Reservoir Labs) As the growing availability of computational power slows, there has been an increasing reliance on algorithmic advances.  However, faster algorithms alone will not necessarily bridge the gap in allowing computational scientists to study problems at the edge of scientific discovery in the next several decades. Often, it is necessary to simplify or precondition solvers to accelerate the study of large systems of linear equations commonly seen in a number of scientific fields. Preconditioning a problem to increase efficiency is often seen as the best approach; yet, preconditioners which are fast, smart, and efficient do not always exist. Following the progress of our paper last year at HPEC on the Combinatorial Multigrid,  we present a new preconditioner for symmetric diagonally dominant (SDD) systems of linear equations.  These systems are common in certain PDEs, network science, and supervised learning among others.  Based on spectral support graph theory, this new preconditioner builds off of the work of Richard Peng and Dan Spielman, computing and applying a V-cycle chain of approximate inverse matrices.  This preconditioner approach is both algebraic in nature as well as hierarchically-constrained depending on the condition number of the system to be solved.  Due to its generation of an Approximate Inverse Chain of matrices, we refer to this as the AIC preconditioner. We further accelerate the AIC preconditioner by utilizing precomputations to simplify setup and multiplications in the context of an iterative Krylov-subspace solver.  While these iterative solvers can greatly reduce solution time, the number of iterations can grow large quickly in the absence of good preconditioners.  Initial results for the AIC preconditioner have shown a very large reduction in iteration counts for SDD systems as compared to standard preconditioners such as Incomplete Cholesky (ICC) and Multigrid (MG).  We further show significant reduction in iteration counts against the more advanced Combinatorial Multigrid (CMG) preconditioner. We have further developed no-fill sparsification techniques to ensure that the computational cost of applying the AIC preconditioner does not grow prohibitively large as the depth of the V-cycle grows for systems with larger condition numbers.  Our numerical results have shown that these sparsifiers maintain the sparsity structure of our system while also displaying significant reductions in iteration counts. Accelerating Distributed Inference of Sparse Deep Neural Networks via Mitigating the Straggler Effect Mohammad Hasanzadeh Mofrad (University of Pittsburgh)*; Rami Melhem (University of Pittsburgh); Mohammad Hammoud (Carnegie Mellon University); Muhammad Yousuf Ahmad (Carnegie Mellon University in Qatar) Once a Deep Neural Network (DNN) is trained, an inference algorithm retains the learning and applies it to batches of data. The trained DNN can be sparse as a result of pruning or following a preset sparse connectivity pattern. Inference in such sparse networks requires less space and time complexities compared to dense ones. Similar to dense DNNs, sparse DNNs can be parallelized using \textit{model or data parallelisms}, whereby the former partitions the \textit{network} and the latter partitions the \textit{input} among multiple threads. Model parallelism efficiently utilizes the Last Level Cache (LLC) but has a heavy synchronization cost because of compulsory reductions per layer. In contrast, data parallelism allows independent execution of partitions but suffers from a straggler effect due to a load imbalance among partitions. We combine data and model parallelisms through a new type of parallelism that we denote \textit{data-then-model}. In data-then-model, each thread starts with data parallelism, thus mitigating the per-layer synchronization cost of model parallelism. After it finishes its partition, it switches to model parallelism to support a slower active thread, hence, alleviating the straggler effect of data parallelism. We compare data-then-model parallelism with data and model parallelisms as well as task-based parallelisms using the IEEE HPEC sparse DNN challenge dataset. On average, we achieve up to 10 to 65\% speedup compared to these parallelisms. Distributed Non-Negative Tensor Train Decomposition Manish Bhattarai (Los Alamos National Labs)*; Gopinath Chennupati (Los Alamos National Laboratory); Erik Skau (Los Alamos National Laboratory); Raviteja Vangara (Los Alamos National Laboratory); Hirsto Djidjev (Los Alamos National Laboratory); Boian Alexandroe (LANL) The era of exascale computing opens new venues for innovations and discoveries in many scientific, engineering, and commercial fields. However, with the exaflops also come the extra-large high-dimensional data generated by high-performance computing. High- dimensional data is presented as multidimensional arrays, aka tensors. The presence of latent (not directly observable) structures in the tensor allows a unique representation and compression of the data by classical tensor factorization techniques. However, the classical tensor methods are not always stable or they can be exponential in their memory requirements, which makes them not suitable for high-dimensional tensors. Tensor train (TT) is a state-of-the-art tensor network introduced for factorization of high- dimensional tensors. TT transforms the initial high-dimensional tensor in a network of three-dimensional tensors that requires only a linear storage. Many real-world data, such as, density, temperature, population, probability, etc., are non-negative and for an easy interpretation, the algorithms preserving non-negativity are preferred. Here, we introduce a distributed non-negative tensor-train and demonstrate its scalability and the compression on synthetic and real-world big datasets. 2-S2: Remote Sensing for Disaster Relief Special (17:30-19:30 EDT) Fast Mapping onto Census Blocks Jeremy Kepner (MIT Lincoln Laboratory)*; Darren Engwirda (Columbia University); Vijay Gadepally (MIT Lincoln Laboratory); Chris Hill (MIT); Tim Kraska (MIT); Michael Jones (MIT Lincoln Laboratory); Andreas Kipf (MIT); Lauren Milechin (MIT); Navin Vembar (Camber Systems) Pandemic measures such as social distancing and contact tracing can be enhanced by rapidly integrating dynamic location data and demographic data. Projecting billions of longi- tude and latitude locations onto hundreds of thousands of highly irregular demographic census block polygons is computationally challenging in both research and deployment contexts. This paper describes two approaches labeled “simple” and “fast”. The simple approach can be implemented in any scripting language (Matlab/Octave, Python, Julia, R) and is easily integrated and customized to a variety of research goals. This simple approach uses a novel combination of hierarchy, sparse bounding boxes, polygon crossing-number, vectorization, and parallel processing to achieve 100,000,000+ projections per second on 100 servers. The simple approach is compact, does not increase data storage requirements, and is applicable to any country or region. The fast approach exploits the thread, vector, and memory optimizations that are possible using a low-level language (C++) and achieves similar performance on a single server. This paper details these approaches with the goal of enabling the broader community to quickly integrate location and demographic data. Train and Deploy an Image Classifier for Disaster Response Jianyu Mao (The Pennsylvania State University)*; Kiana Harris (The Pennsylvania State University); Nae-Rong Chang (The Pennsylvania State University); Caleb Pennell (The Pennsylvania State University); Yiming Ren (The Pennsylvania State University) With Deep Learning Image Classification becoming more powerful each year, it is apparent that its introduction to disaster response will increase the efficiency that responders can work with. Using several Neural Network Models, including AlexNet, ResNet, MobileNet, DenseNets, and 4-Layer CNN, we have classified flood disaster images from a large image data set with up to 79% accuracy.  Our models and tutorials for working with the data set have created a foundation for others to classify other types of disasters contained in the images. Integrating Multiple Deep Learning Models to Classify Disaster Scene Videos Yuan Li (University of North Texas)*; Haili Wang (University of North Texas); Shuo Sun (University of North Texas); Bill Buckles (University of North Texas) It is well-known that the long-term trend of an increasing number of natural disasters globally, accompanied by increasing loss of life and property, continues. The fifth assessment report of the Intergovernmental Panel on Climate Change (IPCC, 2014) predicts that as global warming continues in the coming decades, its contribution to the increase in natural disaster losses will become more prominent.  However, through rapid and accurate analysis of disaster scenarios, there is still an opportunity to significantly reduce catastrophic losses caused by extreme events. From a video, we extract key frames and identify embedded objects (using YOLOv3). The densely labeled images are given a global label using various VGG16 tools. Classical quality measures (accuracy, precision, and recall) will guide subsequent development directions. A Hierarchical Auto-Labeling Deep Neural Network for Disaster Scene Videos Shuo Sun (University of North Texas)*; Yuan Li (University of North Texas); Haili Wang (University of North Texas); Bill Buckles (University of North Texas) As the climate changes, the number of extreme natrual disasters that may cause many life and proverty losses is continuously increasing, which becomes a more and more severe problem in modern society. To faster response to the natural disasters, computer- aided detection and analysis methods are urgently required. But the current popular computer vision technologies can not fully perform their capbilities due to the short of dataset with proper disaster-related labels, let along the huge amount and complexity of the data. To solve these problems, using the low altitude disaster imagery (LADI) dataset, we develop a hierarchically auto-labeling deep neural network that can automatically generate multiple disaster labels, which also provides a foundation of disaster scene description and higher level analysis.
Tuesday, September 22
2020 Abstract Book
2-1: Graph Analytics & Network Science 1 Session (11:00-12:15 EDT) A GraphBLAS solution to the SIGMOD 2014 Programming Contest using multi-source BFS Márton Elekes (Budapest University of Technology and Economics); Attila Nagy (Budapest University of Technology and Economics); Dávid Sándor (Budapest University of Technology and Economics); János Benjámin Antal (unaffiliated); Timothy Davis (Texas A&M University); Gabor Szarnyas (Budapest University of Technology and Economics)* The GraphBLAS standard defines a set of fundamental building blocks for formulating graph algorithms in the language of linear algebra. Since its first release in 2017, the expressivity of the GraphBLAS API and the performance of its implementations (such as SuiteSparse:GraphBLAS) have been studied on a number of textbook graph algorithms such as BFS, single-source shortest path, and connected components. However, less attention was devoted to other aspects of graph processing such as handling typed and attributed graphs (also known as property graphs), and making use of complex graph query techniques (handling paths, aggregation, and filtering). To study these problems in more detail, we have used GraphBLAS to solve the case study of the 2014 SIGMOD Programming Contest, which defines complex graph processing tasks that require a diverse set of operations. Our solution makes heavy use of multi-source BFS algorithms expressed as sparse matrix-matrix multiplications along with other GraphBLAS techniques such as masking and submatrix extraction. While the queries can be formulated in GraphBLAS concisely, our performance evaluation shows mixed results. For some queries and data sets, the performance is competitive with the hand-optimized top solutions submitted to the contest, however, in some cases, it is currently outperformed by orders of magnitude. LessMine: Reducing Sample Space and Data Access for Dense Pattern Mining Tianyu Fu (Tsinghua University)*; Ziqian Wan (Tsinghua University); Guohao Dai (Tsinghua University); Yu Wang (Tsinghua University); Huazhong Yang (Tsinghua University) In the era of ''big data'', graph has been proven to be one of the most important reflections of real-world problems. To refine the core properties of large-scale graphs, dense pattern mining plays a significant role. Because of the complexity of pattern mining problems, conventional implementations often lack scalability, consuming much time and memory space. Previous work (e.g., ASAP) proposed approximate pattern mining as an efficient way to extract structural information from graphs. It demonstrates dramatic performance improvement by up to two orders of magnitude. However, we observe three main flaws of ASAP in cases of dense patterns, thus we propose LessMine, which reduces the sample space and data access for dense pattern mining. We introduce the reorganization of data structure, the method of concurrent sample, and uniform close. We also provide locality-aware partition for distributed settings. The evaluation shows that our design achieves up to 1829x speedup with 66% less error rate compared with ASAP. Fast Graphlet Transform of Sparse Graphs Dimitris Floros (Aristotle University of Thessaloniki); Nikos Pitsianis (Aristotle University of Thessaloniki)*; Xiaobai Sun (Duke University) We introduce the computational problem of graphlet transform of a sparse graph. Graphlets are fundamental topology elements of all graphs/networks. They can be used as coding elements to encode graph-topological information at multiple granularity levels, for classifying vertices on the same graph/network, as well as, for making differentiation or connection across different networks. Network/graph analysis using graphlets has growing applications. We recognize the universality and increased encoding capacity in using multiple graphlets, we address the arising computational complexity issues, and we present a fast method for exact graphlet transform. The fast graphlet transform establishes a few remarkable records at once in high computational efficiency, low memory consumption, and ready translation to high-performance program and implementation. It is intended to enable and advance network/graph analysis with graphlets, and to introduce the relatively new analysis apparatus to graph theory, high-performance graph computation, and broader applications. Half-Precision Floating-Point Formats for PageRank: Opportunities and Challenges Amir Sabbagh Molahosseini (Queen's University Belfast)*; Hans Vandierendonck (Queen's University Belfast) Mixed-precision computation has been proposed as a means to accelerate iterative algorithms as it can reduce the memory bandwidth and cache effectiveness. This paper aims for further memory traffic reduction via introducing new half-precision (16 bit) data formats customized for PageRank. We develop two formats. A first format builds on the observation that the exponents of about 99% of PageRank values are tightly distributed around the exponent of the inverse of the number of vertices. A second format builds on the observation that 6 exponent bits are sufficient to capture the full dynamic range of PageRank values. Our floating- point formats provide less precision compared to standard IEEE 754 formats, but sufficient dynamic range for PageRank. The experimental results on various size graphs show that the proposed formats can achieve an accuracy of 1e-4, which is an improvement over the state of the art. Due to random memory access patterns in the algorithm, performance improvements over our highly tuned baseline are 1.5% at best. GraphSDH: A General Graph Sampling Framework with Distribution and Hierarchy Jingbo Hu (Tsinghua University)*; Guohao Dai (Tsinghua University); Yu Wang (tsinghua university); Huazhong Yang (Tsinghua University) Large-scale graphs play a vital role in various applications, but it is limited by the long processing time. Graph sampling is an effective way to reduce the amount of graph data and accelerate the algorithm. However, previous work usually lacks theoretical analysis related to graph algorithm models. In this study, GraphSDH (Graph Sampling with Distribution and Hierarchy), a general large-scale graph sampling framework is established based on the vertex- centric graph model. According to four common sampling techniques, we derive the sampling probability to minimize the variance, and optimize the design according to whether there is a pre-estimation process for the intermediate value. In order to further improve the accuracy of the graph algorithm, we propose a stratified sampling method based on vertex degree and a hierarchical optimization scheme based on sampling position analysis. Extensive experiments on large graphs show that GraphSDH can achieve over 95% accuracy for PageRank by sampling only 10% edges of the original graph, and speed up PageRank by several times than that of the non-sampling case. Compared with random neighbor sampling, GraphSDH can reduce the mean relative error of PageRank by about 17% at a sampling neighbor ratio (sampling fraction) of 20%. Furthermore, GraphSDH can be applied to various graph algorithms, such as Breadth-First Search (BFS), Alternating Least Squares (ALS) and Label Propagation Algorithm (LPA). Poster Session: 2-P (12:15-15:45 EDT) Scalable Parallel File Write from a Large NUMA System Dong-In Kang (Information Sciences Institute, USC)* As the number of the sockets and cores within a cache-coherent NUMA (Non-Uniform Memory Access) system increases, parallel file I/O performance is affected more by processor affinity and synchronization overhead within the system. In this paper, we report on the performance impact of processor affinity and page caching overhead on parallel write operations to a single shared file on a large NUMA system. We did experiments using two configurations of an HPE Superdome Flex system – one with 12 sockets and the other with 32 sockets, both having 24 cores per socket, OpenMPI, and the Lustre parallel file system. Our results show that processor affinity and page caching overhead can result in large performance variation depending on the number of MPI processes. We observe that page caching is useful for parallel file writes with a small number of MPI processes, but it is not scalable on a large NUMA system. Parallel file writes without page caching are scalable but achieve higher performance only with a large number of MPI processes. Variable Precision Multiplication for Software-Based Neural Networks Richa Singh (Virginia Polytechnic Institute and State University )*; Tom Conroy (Virginia Polytechnic Institute and State University ); Patrick Schaumont (Worcester Polytechnic Institute) As the number of applications of neural networks continues to grow, so does the need to efficiently perform inference computations on highly constrained devices. In this paper, we propose a methodology to accelerate neural networks in software. We exploit the limited-precision requirements of typical neural networks by formulating recurring operations in a bit-slice computation format. Bit-slice computation ensures that every bit of an M-bit processor word contributes useful work even while computing a limited- precision n-bit (with n < M) operation. This paper brings the following contributions. We first present an environment to efficiently create bitslice descriptions in software, by synthesizing them from Verilog. We then develop bitsliced designs of matrix multiplication and evaluate their performance. Our target is a small microcontroller, and we rely solely on software optimization. Our driving application is a neural network classifier for the MNIST database. Range-Based Linear Quantization in symmetric mode quantizes pre-trained 32-bit floating point weights and activation to low-precision data-widths. Experiments on RISC-V with varying levels of hardware-support show that for data-widths common to neural network applications, the bit-sliced code produces a speedup over traditional methods, which leads to faster and efficient inference without incurring significant loss in accuracy. For example, 8-bit matrix multiplications are sped up by a factor of 2.62 times when compared with non-bitsliced rv32i ISA implementation with no hardware multiplier. Offline Machine Learning for Human Activity Recognition with Smartphone Yanjia Zhang (University of South Florida)*; Kandethody Ramachandran (University of South Florida) Wearable sensors in smart-phone and wrist tracking devices are widely used in the activity tracking and body monitoring with a low cost. Human activity recognition (HAR) is one of the important applications. Activities identification then is the core part of HAR. In this report, we present a comparison with several popular offline machine learning methods using smartphone data and try to find the most effective model for analyzing these sensor data. The methods include Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), Decision Tree (DT), Random Forest (RF), and Artificial Neural Network (ANN). Two datasets with different transition methods from smartphone are used. The data includes Walking, Walking Upstairs, Walking Downstairs, Sitting, Standing, Lying, and Jogging. The first dataset has the first 6 activities and shows that the SVM with linear kernel and LDA have the highest test average accuracy of 96.4% and 96.2%, respectively. Decision Tree performs worse than others with test average accuracy of 86.0%. While the second dataset excludes the Lying, but has jogging, and shows that LDA and DT are the most appropriate algorithms, the test average accuracy of these are 100%. KNN is the worse one with 74.3% test average accuracy. Based on all the results, LDA might be the best one for these sensors data. Moreover, the transition method used to reduce noise and extra information in the second data might be better than that in the first one. It has lower dimensions and better classification performance. In order to get these improved accuracy rates, in this paper, we used grid search, multi-fold cross validation, and dimensional reduction method. In addition to just doing the comparison, we also proposed a two-layer method for activity identification. This method is more flexible of choosing classifiers for activities. Execution of Complete Molecular Dynamics Simulations on Multiple FPGAs Carlo Pascoe (Silicon Therapeutics)*; Larry Stewart (Silicon Therapeutics); Woody Sherman (Silicon Therapeutics); Vipin Sachdeva (Silicon Therapeutics); Martin Herbordt (Boston University) We  have  modified  the  open  source  molecular  dynamics (MD) simulation code OpenMM [1] to add support for running complete MD timesteps on a cluster of FPGAs. FPGA kernels are coded in OpenCL for ease of hardware development and application integration. MD proceeds by calculating forces on individual particles and integrating those forces to update velocities and positions on  a  per  timestep  basis.  A  variety  of  forces  apply  to  each particle and we subdivide them into three categories based on the types of computation required: range limited (RL), long range  (LR),  and  bonded.  RL  interactions  comprise  Lennard Jones and electrostatic forces between all particle pairs within a radial cutoff. LR interactions primarily comprise electrostatic forces  beyond  the  RL  cutoff,  where  pairwise  computation would be too costly. We calculate LR forces using the Smooth Particle  Mesh  Ewald  (PME)  method,  which  uses  3D  Fast Fourier Transforms (FFTs) to accelerate computation. Bonded interactions are the focus of future work. The ultimate goal of this project is to perform MD simulation of  biologically  relevant  systems  within  the  context  of  drug discovery (i.e., periodic systems of 50,000–100,000 particles) with  strong  scaling  performance  greater  than  possible  with other technologies such as GPUs. Optimizing Use of Different Types of Memory for FPGAs in High Performance Computing Kai Huang (Northeastern University); Mehmet Gungor (Northeastern University); Stratis Ioannidis (Northeastern University); Miriam Leeser (Northeastern University)* Accelerators such as Field Programmable Gate Arrays (FPGAs) are increasingly used in high performance computing, and the problems they are applied to process larger and larger amounts of data.  FPGA manufacturers have added new types of memory on chip to help ease the memory bottleneck; however, the burden is on the designer to determine how data is allocated to different memory types.  We study the use of ultraRAM for a graph application running on Amazon Web Services (AWS) that generates a large amount of intermediate data that is not subsequently accessed sequentially.  We investigate different algorithms for mapping data to ultraRAM.  Our results show that use of ultraRAM can speed up overall application run time by a factor of 3 or more.  Maximizing the amount of ultraRAM used produces the best results, and as problem size grows, judiciously assigning data to ultraRAM vs. DDR results in better performance. 2-2: Graph Analytics & Network Science 2 Session (12:30-13:45 EDT) Fast GPU Graph Contraction by Combining Efficient Shallow Searches and Post-Culling Roozbeh Karimi (NVidia Corporation); David Koppelman (Louisiana State University)*; Chris Michael (Naval Research Laboratory) Efficient GPU single-source shortest-path (SSSP) queries of road network graphs can be realized by a technique called PHAST (Delling et al.) in which the graph is contracted (pre-processed using Geisberger's Contraction Hierarchies) once and the resulting contracted graph is queried as needed. PHAST accommodates GPUs' parallelism requirements well, resulting in efficient queries. For situations in which a graph is not available well in advance or changes frequently contraction time itself becomes significant. Karimi et al. recently described a GPU contraction technique, CU- CH, which significantly reduces the contraction time of small- to medium-sized graphs, reporting a speedup of over 20 times on an NVidia P100 GPU. However CU-CH realizes little speedup on larger graphs, such as DIMACS' USA and W. Europe graphs. The obstacle to faster contraction of larger graphs is the frequently performed witness path search (WPS). A WPS for a node determines which shortcut edges need to be added between the node's neighbors to maintain distances after the removal of the node. GPUs' strict thread convergence requirements and limited scratchpad preclude the bidirectional Dijkstra approach used in CPU implementations. Instead, CU-CH uses a two-hop-limit WPS tightly coded to fit GPU shared storage and to maintain thread convergence. Where two hops is sufficient speedup is high, but for larger graphs the hop limit exacts a toll due to the accumulation of unneeded shortcuts. The problem is overcome here by retaining the efficient CU-CH WPS but using it both for its original purpose and also to identify unnecessary shortcuts added in prior steps. The unnecessary shortcuts are culled (removed). Culling shortcuts not only dramatically reduces the time needed to contract a graph but also improves the quality of the contracted graph. For smaller graphs such as DIMACS Cal (travel time) contraction time is 61% faster compared to CU-CH. For the DIMACS Europe and USA graphs contraction times are 40 times and 12 times faster, respectively. SSSP query times also improve dramatically, approaching those obtained on aggressively contracted graphs. The speedup over Geisberger's CPU code is over 100 times for NVidia V100 GPUs on most graphs tried. Using Graphlet Spectrograms for Temporal Pattern Analysis of Virus-Research Collaboration Networks Dimitris Floros (Aristotle University of Thessaloniki); Tiancheng Liu (Duke University); Nikos Pitsianis (Aristotle University of Thessaloniki)*; Xiaobai Sun (Duke University) We introduce a new method for temporal pattern analysis of scientific collaboration networks. We investigate in particular virus research activities through five epidemic or pandemic outbreaks in recent two decades and in the ongoing pandemic with COVID-19. Our method embodies two innovative components. The first is a simple model of temporal collaboration networks with time segmented in publication time and convolved in citation history, to effectively capture and accommodate collaboration activities at mixed time scales. The second component is the novel use of graphlets to encode topological structures and to detect change and persistence in collaboration activities over time. We discover in particular two unique and universal roles of bi-fork graphlet in (1) identifying bridges among triangle clusters and (2) quantifying grassroots as the backbone of every collaboration network. We present a number of intriguing patterns and findings about the virus- research activities. Computing PageRank Scores of Web Crawl Data Using DGX A100 Clusters Seunghwa Kang (NVIDIA)*; Alex Fender (NVIDIA); Joe Eaton (NVIDIA); Brad Rees (NVIDIA) PageRank is a widely used graph analytics algorithm to rank vertices using relationship data. Large-scale PageRank is challenging due to its irregular and communication intensive computational characteristics. We implemented PageRank on NVIDIA’s newly released DGX A100 cluster and compared the performance with two recent notable large-scale PageRank computations using the Common Crawl dataset. The ShenTu framework computed PageRank scores using a large number of custom microprocessors connected with an HPC class interconnect. The Hronos framework reported the state-of-the-art performance using 3000 commodity CPU nodes and 10 Gbps Ethernet. The Common Crawl dataset captures link relationships between web pages in a graph with 3.563 billion vertices and 128.736 billion edges. Our implementation demonstrated 13x faster PageRank iteration time than the Hronos framework using a cluster with NVLink connected 32 A100 GPUs. Triangle Counting with Cyclic Distributions Kevin Deweese (University of Washington)*; Jesun Firoz (Pacific Northwest National Laboratory); Andrew Lumsdaine (Pacific Northwest National Lab and University of Washington); Scott McMillan (CMU Software Engineering Institute); Luke D'Alessandro (Indiana University) Triangles are the simplest non-trivial subgraphs and triangle counting is used in a number of different applications. The order in which vertices are processed in triangle counting strongly effects the amount of work that needs to be done (and thus the overall performance). Ordering vertices by degree has been shown to be one particularly effective ordering approach. However, for graphs with skewed degree distributions (such as power-law graphs), ordering by degree effects the distribution of work; parallelization must account for this distribution in order to balance work among workers. In this paper we provide an in- depth analysis of the ramifications of degree-based ordering on parallel triangle counting. We present approach for partitioning work in triangle counting, based on cyclic distribution and some surprisingly simple C++ implementations. Experimental results demonstrate the effectiveness of our approach, particularly for power-law (and social network) graphs. Towards an Objective Metric for the Performance of Exact Triangle Count Mark Blanco (Carnegie Mellon University)*; Scott McMillan (CMU Software Engineering Institute); Tze Meng Low (Carnegie Mellon University) The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries.   We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed.  Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline. 2-3: Graph Analytics & Network Science 3 Session (14:15-15:30 EDT) Leveraging Linear Algebra to Count and Enumerate Simple Subgraphs Vitaliy Gleyzer (MIT Lincoln Laboratory)*; Edward Kao (MIT Lincoln Laboratory); Andrew Soszynski (MIT Lincoln Laboratory) Even though subgraph counting and subgraph matching are well- known NP-Hard problems, they are foundational building blocks for many scientific and commercial applications. In order to analyze graphs that contain millions to billions of edges, distributed systems can provide computational scalability through search parallelization. One recent approach for exposing graph algorithm parallelization is through a linear algebra formulation and the use of the matrix multiply operation, which conceptually is equivalent to a massively parallel graph traversal. This approach has several benefits, including 1) a mathematically-rigorous foundation, and 2) ability to leverage specialized linear algebra accelerators and high- performance libraries. In this paper, we explore and define a linear algebra methodology for performing exact subgraph counting and matching for 4-vertex subgraphs excluding the clique. Matches on these simple subgraphs can be joined as components for a larger subgraph. With thorough analysis, we demonstrate that the linear algebra formulation leverages path aggregation which allows it to be up 2x to 5x more efficient in traversing the search space and compressing the results as compared to tree-based subgraph matching techniques. GBTLX: A First Look Sanil Rao (Carnegie Mellon University)* We provide a first look at GBTLX, a code generator that translates GBTL-based graph processing programs into high performance C graph processing programs that match hand-tuned high performance implementations. GBTLX refactors code written using GBTL into problems that capture the signature of algorithms and solvers that capture the semantics (input/output behavior of algorithms. Users provide classes that implement these two aspects using standard GBTL functions and encapsulate the targeted algorithm. GBTLX then performs a sequence of inspection, code generation and high performance execution. First, the user code is  traced while run against original GBTL. Then, the trace is used to define the semantics and signature of the algorithm for use in code generation. The SPIRAL system is used to generate high performance C code that implements the user-specified algorithm, specializing the code for algorithm-dependent optimizations. Finally, the user-provided GBTL-based implementation is replaced by the SPIRAL generated C code. For triangle counting and k-truss enumeration the resulting executables provide performance equivalent to the hand-tuned implementation, while the source code is maintainable as it only use the C++ GBTL library. GraphBLAS Programmability: Python and MATLAB Interfaces Timothy Mattson (Intel)*; Michel Pelletier (Graphegon); Timothy Davis (Texas A&M University) A graph can be represented by a sparse array.  In that representation, the basic operations used to construct graph algorithms are linear algebra operations over algebraic semirings.  The GraphBLAS forum was established in 2013 to define a set of Basic Linear Algebra Subprograms for Graph Algorithms.  They started with a mathematical definition of the GraphBLAS followed by a binding to the C Programming language.  The result was the GraphBLAS C specification currently at version 1.3.1.  A robust implementation of the GraphBLAS C specification is available as the SuiteSparse GraphBLAS library.  The primary motivation of the GraphBLAS was to provide a high performance foundation for graph algorithms.  Many members of the GraphBLAS forum were also motivated by the simplicity of graph algorithms expressed in terms of linear algebra. This simplicity, however, is difficult to fully appreciate when the programming interface is the C programming language.   To see the full expressive power of the GraphBLAS, a high level interface is needed so the elegance of the mathematical underpinnings of the GraphBLAS is apparent.  In this paper we present python (pygraphblas) and MATLAB (@GrB) interfaces to the SuiteSparse implementation of the GraphBLAS.  We use a subset of the GAP benchmark suite to demonstrate the simplicity of these GraphBLAS interfaces and show that in most cases, little performance is sacrificed by moving to the more productive Python and MATLAB interfaces. On the Feasibility of Using Reduced-Precision Tensor Core Operations for Graph Analytics Jesun Firoz (Pacific Northwest National Laboratory)*; Ang Li (Pacific Northwest National Laboratory); Jiajia Li (Pacific Northwest National Laboratory); Kevin Barker (PNNL) Today’s data-driven analytics and machine learning workload   have   been   largely   driven   by   the   General-PurposeGraphics Processing Units (GPGPUs). To accelerate dense matrix multiplications  on  the  GPUs,  Tensor  Core  Units  (TCUs)  have been  introduced  in  recent  years.  In  this  paper,  we  study  linear-algebra-based  and  vertex-centric  algorithms  for  various  graph kernels  on  the  GPUs  with  an  objective  of  applying  this  new hardware feature to graph applications. We identify the potential stages in these graph kernels that can be executed on the TensorCore Units. In particular, we leverage the reformulation of the reduction and scan operations in terms of matrix multiplication [1] on  the  TCUs.  We  demonstrate  that  executing  these  operations on the TCUs, available inside different graph kernels, can assist in  establishing  an  end-to-end  pipeline  on  the  GPGPUs  without depending on hand-tuned external libraries and still can deliver comparable  performance  for  various  graph  analytics. Efficient Sparse Matrix-Vector Multiplication on Intel PIUMA Architecture Sriram Aananthakrishnan (Intel)*; Robert Pawlowski (Intel); Joshua Fryman (Intel); Ibrahim Hur (Intel Corporation) Graph analytics has many real-world applications such as cyber- security, detecting fraudulent activity, and understanding disease spread. Sparse matrix dense vector (SpMV) multiplication is an indispensable kernel in large scale graph analytics. SpMV’s performance is sensitive to input sparsity and its performance has been observed to be poor on scale-free graphs. Intel PIUMA is a novel architecture tailored for graph analytics. PIUMA addresses the challenges in graph analytics through scalable network and memory design along with novel features for offloading memory accesses and synchronization. We report on an early performance study of SpMV on the Intel PIUMA architecture. Our results show strong scalability of SpMV on scale-free graphs with a peak simulated performance of 109 GFlops on a scale-free graph with 16 million nodes and 0.5-billion edges. 2-4: Big Data & Distributed Computing 1 Session (15:45-17:00 EDT) A Framework for Task Mapping onto Heterogeneous Platforms Ta-Yang Wang (University of Southern California)*; Ajitesh Srivastava (University of Southern California); Viktor Prasanna (Unversity of Southern California) While heterogeneous systems provide considerable opportunities for accelerating big data applications, the variation in processing capacities and communication latency of different resources makes it challenging to effectively map the applications on the platform. To generate an optimized mapping of the input application on a variety of heterogeneous platforms, we design a flexible annotated task interaction graph based framework which 1) allows modeling of mixed CPU and GPU architectures, and 2) identifies an efficient task-hardware mapping of the input application, given the dependencies and communication costs between tasks that constitute the applications. The annotated task interaction graph (ATIG) representation captures all the information that is necessary to execute the application and the meta-data, such as performance models for estimating runtime on a target resource and communication latencies. Our framework supports solving the problem of mapping tasks in the ATIG onto available resources by including variations of greedy algorithm and LP relaxations with rounding. We show that our framework can achieve high speedup, allowing domain experts to efficiently compile a broad set of programs to parallel and heterogeneous hardware. Best of Both Worlds: High Performance Interactive and Batch Launching Chansup Byun (MIT Lincoln Laboratory)* Rapid launch of thousands of jobs is essential for effective interactive supercomputing, big data analysis, and AI algorithm development.  Achieving thousands of launches per second has required hardware to be available to receive these jobs.  This paper presents a novel preemptive approach to implement “spot” jobs on MIT SuperCloud systems allowing the resources to be fully utilized for both long running batch jobs while still providing fast launch for interactive jobs. The new approach separates the job preemption and scheduling operations and can achieve 100 times faster performance in the scheduling of a job with preemption when compared to using the standard scheduler-provided automatic preemption-based capability.  The results demonstrate that the new approach can schedule interactive jobs preemptively at a performance comparable to when the required computing resources are idle and available. The spot job capability can be deployed without disrupting the interactive user experience while increasing the overall system utilization. Approximate Inverse Chain Preconditioner: Iteration Count Case Study for Spectral Support Solvers M. Harper Langston (Reservoir Labs, Inc)*; Meifeng Lin (Brookhaven Natinoal Laboratory); Eric Papenhausen (Akai Kaeru); Pierre-David Letourneau (Reservoir Labs, Inc.); Julia Wei (Reservoir Labs, Inc.); Larry Weintraub (Reservoir Labs, Inc.); Mitchell Harris (Reservoir Labs, Inc.); Richard Lethin (Reservoir Labs) As the growing availability of computational power slows, there has been an increasing reliance on algorithmic advances.  However, faster algorithms alone will not necessarily bridge the gap in allowing computational scientists to study problems at the edge of scientific discovery in the next several decades. Often, it is necessary to simplify or precondition solvers to accelerate the study of large systems of linear equations commonly seen in a number of scientific fields. Preconditioning a problem to increase efficiency is often seen as the best approach; yet, preconditioners which are fast, smart, and efficient do not always exist. Following the progress of our paper last year at HPEC on the Combinatorial Multigrid,  we present a new preconditioner for symmetric diagonally dominant (SDD) systems of linear equations.  These systems are common in certain PDEs, network science, and supervised learning among others.  Based on spectral support graph theory, this new preconditioner builds off of the work of Richard Peng and Dan Spielman, computing and applying a V-cycle chain of approximate inverse matrices.  This preconditioner approach is both algebraic in nature as well as hierarchically- constrained depending on the condition number of the system to be solved.  Due to its generation of an Approximate Inverse Chain of matrices, we refer to this as the AIC preconditioner. We further accelerate the AIC preconditioner by utilizing precomputations to simplify setup and multiplications in the context of an iterative Krylov-subspace solver.  While these iterative solvers can greatly reduce solution time, the number of iterations can grow large quickly in the absence of good preconditioners.  Initial results for the AIC preconditioner have shown a very large reduction in iteration counts for SDD systems as compared to standard preconditioners such as Incomplete Cholesky (ICC) and Multigrid (MG).  We further show significant reduction in iteration counts against the more advanced Combinatorial Multigrid (CMG) preconditioner. We have further developed no-fill sparsification techniques to ensure that the computational cost of applying the AIC preconditioner does not grow prohibitively large as the depth of the V-cycle grows for systems with larger condition numbers.  Our numerical results have shown that these sparsifiers maintain the sparsity structure of our system while also displaying significant reductions in iteration counts. Accelerating Distributed Inference of Sparse Deep Neural Networks via Mitigating the Straggler Effect Mohammad Hasanzadeh Mofrad (University of Pittsburgh)*; Rami Melhem (University of Pittsburgh); Mohammad Hammoud (Carnegie Mellon University); Muhammad Yousuf Ahmad (Carnegie Mellon University in Qatar) Once a Deep Neural Network (DNN) is trained, an inference algorithm retains the learning and applies it to batches of data. The trained DNN can be sparse as a result of pruning or following a preset sparse connectivity pattern. Inference in such sparse networks requires less space and time complexities compared to dense ones. Similar to dense DNNs, sparse DNNs can be parallelized using \textit{model or data parallelisms}, whereby the former partitions the \textit{network} and the latter partitions the \textit{input} among multiple threads. Model parallelism efficiently utilizes the Last Level Cache (LLC) but has a heavy synchronization cost because of compulsory reductions per layer. In contrast, data parallelism allows independent execution of partitions but suffers from a straggler effect due to a load imbalance among partitions. We combine data and model parallelisms through a new type of parallelism that we denote \textit{data-then-model}. In data-then- model, each thread starts with data parallelism, thus mitigating the per-layer synchronization cost of model parallelism. After it finishes its partition, it switches to model parallelism to support a slower active thread, hence, alleviating the straggler effect of data parallelism. We compare data-then-model parallelism with data and model parallelisms as well as task-based parallelisms using the IEEE HPEC sparse DNN challenge dataset. On average, we achieve up to 10 to 65\% speedup compared to these parallelisms. Distributed Non-Negative Tensor Train Decomposition Manish Bhattarai (Los Alamos National Labs)*; Gopinath Chennupati (Los Alamos National Laboratory); Erik Skau (Los Alamos National Laboratory); Raviteja Vangara (Los Alamos National Laboratory); Hirsto Djidjev (Los Alamos National Laboratory); Boian Alexandroe (LANL) The era of exascale computing opens new venues for innovations and discoveries in many scientific, engineering, and commercial fields. However, with the exaflops also come the extra-large high- dimensional data generated by high-performance computing. High- dimensional data is presented as multidimensional arrays, aka tensors. The presence of latent (not directly observable) structures in the tensor allows a unique representation and compression of the data by classical tensor factorization techniques. However, the classical tensor methods are not always stable or they can be exponential in their memory requirements, which makes them not suitable for high-dimensional tensors. Tensor train (TT) is a state-of- the-art tensor network introduced for factorization of high- dimensional tensors. TT transforms the initial high-dimensional tensor in a network of three-dimensional tensors that requires only a linear storage. Many real-world data, such as, density, temperature, population, probability, etc., are non-negative and for an easy interpretation, the algorithms preserving non-negativity are preferred. Here, we introduce a distributed non-negative tensor-train and demonstrate its scalability and the compression on synthetic and real-world big datasets. 2-S2: Remote Sensing for Disaster Relief Special (17:30-19:30 EDT) Fast Mapping onto Census Blocks Jeremy Kepner (MIT Lincoln Laboratory)*; Darren Engwirda (Columbia University); Vijay Gadepally (MIT Lincoln Laboratory); Chris Hill (MIT); Tim Kraska (MIT); Michael Jones (MIT Lincoln Laboratory); Andreas Kipf (MIT); Lauren Milechin (MIT); Navin Vembar (Camber Systems) Pandemic measures such as social distancing and contact tracing can be enhanced by rapidly integrating dynamic location data and demographic data. Projecting billions of longi- tude and latitude locations onto hundreds of thousands of highly irregular demographic census block polygons is computationally challenging in both research and deployment contexts. This paper describes two approaches labeled “simple” and “fast”. The simple approach can be implemented in any scripting language (Matlab/Octave, Python, Julia, R) and is easily integrated and customized to a variety of research goals. This simple approach uses a novel combination of hierarchy, sparse bounding boxes, polygon crossing- number, vectorization, and parallel processing to achieve 100,000,000+ projections per second on 100 servers. The simple approach is compact, does not increase data storage requirements, and is applicable to any country or region. The fast approach exploits the thread, vector, and memory optimizations that are possible using a low-level language (C++) and achieves similar performance on a single server. This paper details these approaches with the goal of enabling the broader community to quickly integrate location and demographic data. Train and Deploy an Image Classifier for Disaster Response Jianyu Mao (The Pennsylvania State University)*; Kiana Harris (The Pennsylvania State University); Nae-Rong Chang (The Pennsylvania State University); Caleb Pennell (The Pennsylvania State University); Yiming Ren (The Pennsylvania State University) With Deep Learning Image Classification becoming more powerful each year, it is apparent that its introduction to disaster response will increase the efficiency that responders can work with. Using several Neural Network Models, including AlexNet, ResNet, MobileNet, DenseNets, and 4-Layer CNN, we have classified flood disaster images from a large image data set with up to 79% accuracy.  Our models and tutorials for working with the data set have created a foundation for others to classify other types of disasters contained in the images. Integrating Multiple Deep Learning Models to Classify Disaster Scene Videos Yuan Li (University of North Texas)*; Haili Wang (University of North Texas); Shuo Sun (University of North Texas); Bill Buckles (University of North Texas) It is well-known that the long-term trend of an increasing number of natural disasters globally, accompanied by increasing loss of life and property, continues. The fifth assessment report of the Intergovernmental Panel on Climate Change (IPCC, 2014) predicts that as global warming continues in the coming decades, its contribution to the increase in natural disaster losses will become more prominent.  However, through rapid and accurate analysis of disaster scenarios, there is still an opportunity to significantly reduce catastrophic losses caused by extreme events. From a video, we extract key frames and identify embedded objects (using YOLOv3). The densely labeled images are given a global label using various VGG16 tools. Classical quality measures (accuracy, precision, and recall) will guide subsequent development directions. A Hierarchical Auto-Labeling Deep Neural Network for Disaster Scene Videos Shuo Sun (University of North Texas)*; Yuan Li (University of North Texas); Haili Wang (University of North Texas); Bill Buckles (University of North Texas) As the climate changes, the number of extreme natrual disasters that may cause many life and proverty losses is continuously increasing, which becomes a more and more severe problem in modern society. To faster response to the natural disasters, computer-aided detection and analysis methods are urgently required. But the current popular computer vision technologies can not fully perform their capbilities due to the short of dataset with proper disaster-related labels, let along the huge amount and complexity of the data. To solve these problems, using the low altitude disaster imagery (LADI) dataset, we develop a hierarchically auto-labeling deep neural network that can automatically generate multiple disaster labels, which also provides a foundation of disaster scene description and higher level analysis.
Tuesday, September 22