2022 IEEE High Performance Extreme Computing Virtual Conference 19 - 23 September 2022
Thursday, September 22 4-V Session (10:30-11:00) Co-Chairs: Albert Reuther 4-1: ASIC and FPGA Advances 1 Session (11:00-12:15) Co-Chairs: Sadas Shankar & Ken Cain Invited Talk: GNN Acceleration on FPGAs Prof. Viktor Prasanna (USC) Flexible Hardware Accelerator Design Generation with SPIRAL Guanglin Xu; James Hoe; Franz Franchetti (Carnegie Mellon Univ.) Hardware specialization has become a widely employed technique for approaching higher performance and energy efficiency in computer systems. Yet obtaining efficient custom hardware designs remains a challenging and tedious task, calling for the automated approaches. In the past, Spiral has been used for generating high-throughput streaming hardware designs for linear transform kernels. This paper is motivated by an observation that a memory-based iterative computing model may allow us to trade off throughput for algorithmic flexibility. In this paper, we present a hardware generation approach that generates and optimizes algorithms using Spiral's multi-level domain-specific languages (DSLs), targeting a scalar load-store architecture. We have incorporated this approach as a hardware backend into the Spiral system. Our evaluation of this approach on several fundamental kernels shows flexibility with reasonable performance and resource utilization. On the Characterization of the Performance-Productivity Gap for FPGA Atharva Gondhalekar; Thomas Twomey; Wu-chun Feng (Virginia Tech) Today, FPGA vendors provide a C++/C-based programming environment to enhance programmer productivity over using a hardware-description language at the register-transfer level. The common perception is that this enhanced productivity comes at the expense of significantly less performance, e.g., as much an order of magnitude worse.  To characterize this performance- productivity tradeoff, we propose a new composite metric, Π, that quantitatively captures the perceived discrepancy between the performance and productivity of any two given FPGA programming languages, e.g., Verilog vs. OpenCL. We then present the implications of our metric via a case study on the design of a Sobel filter (i.e., edge detector) using three different programming models — Verilog, OpenCL, oneAPI — on an Intel Arria 10 GX FPGA accelerator. Relative to performance, our results show that an optimized OpenCL kernel achieves 84% of the performance of an optimized Verilog version of the code on a 7680×4320 (8K) image. Conversely, relative to productivity, OpenCL offers a 6.1× improvement in productivity over Verilog, while oneAPI improves the productivity by an additional factor of 1.25× over OpenCL. Optimizing Designs Using Several Types of Memories on Modern FPGAs Mehmet Gungor; Kai Huang; Stratis Ioannidis; Miriam Leeser (Northeastern Univ.) Modern FPGAs targeting data centers are designed to accelerate problems with large data.  They offer many different types of memory including on-chip and on-board memories.  A recent addition is High Bandwidth Memory (HBM), whose advantages have been demonstrated by others.  However, there is little research that looks at how interactions among different memory types impact application performance.  We investigate how a combination of HBM and on-chip memory (BRAM or URAM) impact clock rate and overall application latency.  In these designs, the on-chip memory is used as an on-chip cache for the larger amounts of data stored in HBM.  Our experiments show that as the size of data stored in BRAM or URAM increases, the achievable clock speed is reduced.  This in turn may result in degraded performance.  We examine Garbled Circuits, an implementation of Secure Function Evaluation (SFE) with high memory demands and out-of-order data access, and examine how different choices of BRAM, URAM and HBM usage alters its performance.  Performance Modeling Sparse MTTKRP Using Optical Static Random Access Memory on FPGA Sasindu Wijeratne (Univ. of Southern California); Akhilesh Jaiswal; Ajey Jacob (USC-ISI); Bingyi Zhang; Viktor K Prasanna (Univ. of Southern California) Electrical static random memory (E-SRAM) is the current standard for internal static memory in Field Programmable Gate Array (FPGA). Despite the dramatic improvement in E-SRAM technology over the past decade, the goal of ultra-fast, energy-efficient static random memory has yet to be achieved with E-SRAM technology. However, preliminary research into optical static random access memory (O-SRAM) has shown promising results in creating energy-efficient ultra-fast static memories. This paper investigates the advantage of O-SRAM over E-SRAM in access speed and energy performance while executing sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP). spMTTKRP is an essential component of tensor decomposition algorithms which is heavily used in data science applications. The evaluation results show O-SRAMs can achieve speeds of 1.1x - 2.9x while saving 2.8x - 8.1x energy compared to conventional E-SRAM technology. Tutorial Session: 4-T (12:15-15:45): Exploring Graph Analysis for HPC with Near-Memory Accelerators Organizer(s): Jeffrey Young (Georgia Tech), Patrick Lavin (Georgia Tech), Jason Riedy (Lucata), Srinivas Eswar (Georgia Tech) 4-2: ASIC and FPGA Advances 2 Session (12:30-13:45) Co-Chairs: Darrell Ricke & Siddarth Samsi A High Throughput Hardware Accelerator for FFTW Codelets: A First Look Larry Tang; Siyuan Chen; Keshav Harisrikanth; Guanglin Xu; Ken Mai; Franz Franchetti (Carnegie Mellon Univ.) The Fast Fourier Transform (FFT) is a critical computation for numerous applications in science and engineering. Its implementation has been widely studied and optimized on various computing platforms, with the FFTW library becoming the standard interface in HPC. In this work, we propose hardware acceleration of the FFTW library by putting a software codelet into hardware. The hardware is exposed to the user through an FFTW-compatible software library while actual computation takes place behind the scenes on a custom accelerator. To demonstrate a first look at this idea, we design a high throughput accelerator for FFTW twiddle codelets. The FFT hardware is automatically generated using SPIRAL and a test chip is fabricated in a TSMC 28nm process. We provide measured results of the test chip and discuss many opportunities for future work. Modeling the Energy Efficiency of GEMM using Optical Random Access Memory Bingyi Zhang; Akhilesh Jaiswal; Clynn Mathew; Ravi Teja Lakkireddy; Ajey Jacob; Sasindu Wijeratne; Viktor K. Prasanna (Univ. of Southern California) General matrix-matrix multiplication (GEMM) is the key computation kernel in many applications. GEMM has been supported on various hardware platforms, including CPU, GPU, FPGA. To optimize the performance of GEMM, developers use on-chip electrical static random access memory (E-SRAM) to exploit the data locality of GEMM. However, intensively accessing E- SRAM for GEMM can lead to significant energy consumption, which is not energy-efficient for commercial data centers.     In this paper, we evaluate the optical static random access memory (O-SRAM) for GEMM. O-SRAM is a promising technology that has extremely low access latency and low energy consumption compared with the traditional E-SRAM.  First, we propose an O-SRAM based wafer-scale system for GEMM and a baseline E-SRAM based system. Second, we build the theoretical performance models of the two systems to analyze their energy consumption of on-chip memory accesses. Then, we conduct simulation-based experiments to evaluate the energy consumption of the two system. The evaluation results show that O-SRAM based system is $7\times$ more energy efficient than the baseline E-SRAM based system. Edge-Connected Jaccard Similarity for Graph Link Prediction on FPGA Paul Sathre; Atharva Gondhalekar; Wu-chun Feng (Virginia Tech) Graph analysis is a critical task in many fields, such as social networking, epidemiology, bioinformatics, and fraud detection. In particular, understanding and inferring relationships between graph elements lies at the core of many graph-based workloads. Real-world graph workloads and their associated data structures create irregular computational patterns that complicate the realization of high-performance kernels. Given these complications, there does not exist a de facto “best” architecture, language, or algorithmic approach that simultaneously balances performance, energy efficiency, portability, and productivity. In this paper, we realize different algorithms of edge-connected Jaccard similarity for graph link prediction and characterize their performance across a broad spectrum of graphs on an Intel Stratix 10 FPGA. By utilizing a high-level synthesis (HLS)-driven, high-productivity approach (via the C++-based SYCL language) we rapidly prototype two implementations – a from-scratch edge-centric version and a faithfully-ported commodity GPU implementation – which would have been intractable via a hardware description language. With these implementations, we further consider the benefit and necessity of four HLS-enabled optimizations, both in isolation and in concert — totaling seven distinct synthesized hardware pipelines. Leveraging real-world graphs of up to 516 million edges, we show empirically-measured speedups of up to 9.5× over the initial HLS implementations when all optimizations work in concert. Design and Implementation of a Real-Time Parallel FFT for a Wideband Radar System on an FPGA Lakshmi Pradeep Bheema; Rishu Anand; Pavan Vadakattu; Syed Azemuddin (IIIT Hyderabad); Aquibuddin Ahmed (RCI-DRDO) An FFT is an essential algorithm for radar signal processing in a radar system. Due to increase in computational power of FPGAs, it is possible to perform FFT operation onboard in an airborne vehicle. However, the FPGA resources have become a limitation for processing real-time signals using conventional methods. To address this issue, we have proposed a parallel pipelined FFT architecture that can achieve very high throughput with very low latency, making it capable of processing real-time continuous data. This architecture is implemented in a radar system, which works from L band to Ku band. In this radar system, the received RF signal is downconverted into an IF signal of 1 GHz frequency with a 500 MHz bandwidth and converted to digital data using a 10-bit ADC. On the converted digital data, a 512-point FFT is implemented on a Xilinx Virtex-7 XC7VX485T FPGA using 8 parallel channels with 64 data frames and is compared with the conventional IP core-based architecture. The proposed architecture takes 1.307µs to implement FFT, which is 5.15 times faster than the IP core-based architecture and requires fewer arithmetic computations. The overall total number of complex multiplications, complex additions, multipliers & adders were reduced by 10.42%, 30.64%, 10.42% & 23.90% respectively. Apart from very low latency and fewer arithmetic operations, the proposed parallel FFT architecture achieved a throughput of 1.350 Giga Samples per second (Gsps). Challenges Designing for FPGAs Using High-Level Synthesis Clayton J Faber; Steven D Harris; Zhili Xiao; Roger Chamberlain (Washington Univ. in St. Louis); Anthony M Cabrera (Oak Ridge National Laboratory) High-Level Synthesis (HLS) tools are aimed at enabling performant FPGA designs that are authored in a high-level language. While commercial HLS tools are available today, there is still a substantial performance gap between most designs developed via HLS relative to traditional, labor intensive approaches. We report on several cases where an anticipated performance improvement was either not realized or resulted in decreased performance. These include: programming paradigm choices between data parallel vs. pipelined designs; dataflow implementations; configuration parameter choices; and handling odd data set sizes. The results point to a number of improvements that are needed for HLS tool flows, including a strong need for performance modeling that can reliably guide the compilation optimization process. 4-3: ASIC and FPGA Advances 3 Session (14:15-15:30) Co-Chairs: Sanmukh Rao Kuppannagari & Siddarth Samsi FPGA Acceleration of Fully Homomorphic Encryption over the Torus Tian Ye (Univ. of Southern California); Rajgopal Kannan (Army Research Lab-West); Viktor K. Prasanna (Univ. of Southern California) Fully Homomorphic Encryption over the Torus (TFHE) is a promising approach for secure computing in cloud servers to perform computations directly on encrypted data. However, TFHE has much higher computation complexity than its unencrypted counterpart. In this work, we propose an FPGA accelerator for TFHE computations. We illustrate the effects of an optimization called bootstrapping key unrolling on the tradeoff between performance of bootstrapping and FPGA resource consumption. We customize the data layout of TFHE ciphertext to optimize data access and improve data reuse.  We parameterize the FPGA design for TFHE bootstrapping, which can be configured to achieve high performance for different user-specified security requirements and given FPGA resources. We implement our design on a state-of-the-art FPGA and compare it with existing results on CPUs. Our implementation for TFHE bootstrapping achieves 216x improvement in throughput and 16.5x improvement in latency compared with the software baseline on a state-of-the-art CPU server. Optimizing Open-source FPGA CAD Tools Shachi Vaman Khadilkar; Martin Margala (UMass Lowell) The development of open-source FPGA CAD tools is challenging. Vendor CAD tools exist but are closed-source and specific to each FPGA family. These tools give limited opportunity for customizing and also have the disadvantage of long compile times. Several open source tools have been built, but the quality of results from these tools is still far behind commercial tools. This research is aimed at bridging some of the gap between commercial and open-source FPGA CAD tools by tuning policies of existing implementation algorithms. We build a set of synthetic benchmarks to identify the mapping between input circuit patterns and policies of a given CAD algorithm and architecture. This work discusses the framework for policy tuning and is aimed at showing that algorithm policies need to adapt to the input circuit. We demonstrate our framework with the packing step of the FPGA CAD flow and show how our approach can be used to improve the quality of results in open-source tools. LIMA: Hardware for FFT based large integer multiplication James Nguyen; Michael Cai; Ziyi Zuo; Larry Tang; Ken Mai; Franz Franchetti (Carnegie Mellon Univ.) Applications in theorem proving and cryptography heavily rely on the multiplication of large integer values. Utilizing properties of the Fourier Transform, one can multiply two values in O(n log n) as compared to the traditional O(n^2) grade school algorithm. In this work, we propose a FFT-based multiplication algorithm that can be computed O(n log n) time, utilizing interval arithmetic to circumvent the real number uncertainty. This work will also present hardware for FFT based large integer multiplication, LIMA, which is a test chip implementing of a portion of this algorithm fabricated on the TSMC 28 nm process. We aim to demonstrate LIMA as a high throughput implementation of this portion of the algorithm, with the intent of being able to run at a 1 GHz with a area of 1.9 mm^2.  LIMA utilizes custom double precision floating point and FIFO architectures optimized for area and high clock speed operation. Towards a Generic UVM Kholoud Mahmoud; Randa Ahmed; Karim Ayman; Mostafa Ayman; Waleed Taie; Yasser Ibrahim; Hassan Mostafa (Cairo University); Khaled Salah (Siemens) In the modern era of CPU complexity advancements, Processor veriﬁcation has always been an ever-increasing challenge. The gap between what a verification plan can offer nowadays and the current technology requirements is constantly widened. Despite many efforts on perfecting “Golden-verification-models” during the design phase, and adoption of object-oriented programming into the whole process; many industry experts still consider solo verification test benches as an extreme, time- consuming barricade that leads to a longer time-to-market and a questionable continuity of the current veriﬁcation process. The Universal Verification Methodology (UVM), came in action as a literal savior to the whole verification community, by offering a merge between SystemVerilog and SystemC into one environment that is completely standardized, constrained, and reusable, allowing a powerful veriﬁcation methodology to a wide range of design sizes and types. The main contribution that this project introduces is implementing a generic UVM, in other words, building one veriﬁcation environment that can be used to accommodate many RTL designs (Soft Processors), having not only different Instruction Set Architectures (ISAs) -of the same categories-, but also different techniques/mechanisms handling the pipeline infrastructures. The proposed generic UVM (GUVM) structure permits the targeted user to attach any soft processor (core) having nearly the same micro-architecture to our test bench, and monitor both: CPU internal behavior and the complete flow of all supported instructions. How to Prevent a Sick ASIC William F. Ellersick (Analog Circuit Works) High performance computing systems increasingly require mixed-signal ASICs to achieve competitive speed, power efficiency and cost. The integration of processing, transceivers, sensors and power management results in dramatic reductions in size, which can yield great savings in power, enabling higher performance. However, few design elements demand such high quality as a mixed-signal ASIC. In this paper, actual near-disasters from decades of integrated circuit design are presented along with methods to prevent potentially severe damage to projects, careers, and even companies. Such stories of failure are rarely told, but acknowledging them is crucial to avoiding repeating the mistakes and to reducing ASIC development risk and ultimately ensuring success. Key takeaways include planning for failure with designed-in observability, controllability and workarounds; the use of simple and robust circuits; and that organizing the people can be as important and challenging as arranging the transistors. 4-4: General Purpose GPU Computing 2 Session (15:45-17:00) Co-Chairs: Hameed Badawy & Mark Barnell Optimal GPU Frequency Selection using Multi-Objective Approaches for HPC Systems Ghazanfar Ali (Texas Tech Univ.); Sridutt Bhalachandra; Nicholas Wright (Lawrence Berkeley National Laboratory); Mert Side; Yong Chen (Texas Tech Univ.) Power consumption poses a significant challenge in current and emerging GPU-enabled high-performance computing (HPC) systems. In modern GPUs, controls like dynamic voltage frequency scaling (DVFS), among others, exist to regulate power consumption. Due to varying computational intensities and the availability of a wide range of frequency settings, selecting the optimal frequency configuration for a given GPU workload is non-trivial. Applying a power control with the single objective of reducing power may cause performance degradation, leading to more energy consumption. In this study, we characterize and identify GPU utilization metrics that influence both the power and execution time of a given workload. Analytical models for power and execution time are then proposed using the characterized feature set. Multi-objective functions (i.e., energy-delay product (EDP) and ED2P) are used to select an optimal GPU DVFS configuration for a workload such that power consumption is reduced with no or negligible degradation in performance. The evaluation was conducted using SPEC ACCEL benchmarks on NVIDIA GV100 GPU. The proposed power and performance analytical models demonstrated prediction accuracies of up to 99.2% and 98.8%, respectively. On average, the benchmarks showed 28.6% and 25.2% energy savings using EDP and ED2P approaches, respectively, without performance degradation. Furthermore, the proposed models require metric collection at only the maximum frequency rather than all supported DVFS configurations. GPU-Accelerated High-Bandwidth Radar Centroiding David J Brigada; Maximilian Merfeld; Kara Warner (MIT Lincoln Laboratory) Radar signal processing is a computationally intensive task, especially for high-bandwidth systems.  Traditionally, such systems have relied on the interleaving of processing on multiple nodes of large compute clusters to achieve the necessary throughput.  Development in general-purpose GPU computing has led to a massive increase in the computational power available to highly parallel tasks.  Most parts of the radar signal processing pipeline are well suited for such a task.  This paper describes an algorithm for centroiding, a key part of the search radar pipeline that has not yet been demonstrated on a GPU.  With this centroiding algorithm, the entire high-data-rate portion of the processing pipeline can be run on the GPU, yielding a speedup factor of approximately 40.  The primary benefit of this approach is a massive reduction in data copying from the GPU to the CPU—a factor of over 1200 in this case—alleviating the main barrier to GPU-based radar processing systems. A Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory Mohammad Shafaet Islam; Qiqi Wang (MIT) This paper presents an algorithm to accelerate the Jacobi iteration for solving linear systems of equations arising from structured problems on graphics processing units (GPUs). Acceleration is achieved by utilization of on-chip GPU shared memory via a domain decomposition procedure. In particular, the problem domain is partitioned into subdomains whose data is copied to the shared memory of each GPU block. Jacobi iterations are performed internally within each block's shared memory while avoiding expensive global memory accesses every iteration, resulting in a hierarchical algorithm (which takes advantage of the GPU memory hierarchy). We investigate the algorithm performance on the linear systems arising from the discretization of Poisson’s equation in 1D and 2D, and observe an 8x speedup in convergence in the 1D problem and a nearly 6x speedup in 2D compared to a conventional GPU implementation of Jacobi iteration which only relies on global memory. Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis Hamdy Abdelkhalik (New Mexico State Univ.); Yehia Arafa (Qualcomm Technologies); Nandakishore Santhi (Los Alamos National Laboratory); Hameed Badawy (New Mexico State Univ.) Abstract—Graphics Processing Units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instruction with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. We measure the clock cycles for PTX ISA instructions and their SASS ISA counterparts using microbenchmarks. We further calculate the clock cycles needed to access each memory unit. Moreover, we demystify the new version of the tensor core unit found in the Ampere architecture using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. Our results should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware. Apple Silicon Performance in Scientific Computing Connor Kenyon; Collin Capano (UMass Dartmouth) With the release of the Apple Silicon System-on-a-Chip processors, and the impressive performance shown in general use by both the M1 and M1 Ultra, the potential use for Apple Silicon processors in scientific computing is explored. Both the M1 and M1 Ultra arecompared to current state-of-the-art data-center GPUs, including an NVIDIA V100 with PCIe, an NVIDIA V100 with NVLink, and an NVIDIA A100 with PCIe. The scientific performance is measured using the Scalable Heterogeneous Computing (SHOC) benchmark suite using OpenCL benchmarks. We find that both M1 processors out perform the GPUs in all benchmarks. 4-S1: Graph Challenge Special (17:30-19:30) Organizer: Jeremy Kepner Sparse Deep Neural Network Inference Using Different Programming Models Hyungro Lee; Milan Jain; Sayan Ghosh (PNNL) Sparse deep neural networks have gained increasing attention recently in achieving speedups on inference with reduced memory footprints. Real-world applications often have to deal with sparse data and irregularities in the computations, yet a wide variety of Deep Neural Network (DNN) tasks remain dense without exploiting the advantages of sparsity in networks. Recent works presented in MIT/IEEE/Amazon GraphChallenge have demonstrated significant speedups and various techniques. Still, we find that there is limited investigation of the impact of various Python and C/C++ based programming models to explore new opportunities for the general cases. In this work, we provide extensive quantitative evaluations of various contemporary GPGPU programming models such as CuPy, CUDA cuSPARSE, and OpenMP in the context of Sparse Deep Neural Network (SpDNN) implementations (derived from the Graph Challenge reference serial code) on single and multiple GPUs from NVIDIA DGX-A100 40GB/80GB platforms. Kalman Filter Driven Estimation of Community Structure in Time Varying Graphs Lisa JK Durbeck; Peter Athanas (Virginia Tech) Community detection is an NP-hard graph problem that has been the subject of decades of research. Moreover, efficient methods are needed for time-varying graphs. In this paper we propose and evaluate a method of approximating the latent block structure within a time-varying graph using a Kalman filter. The method described breaks a stream of graph updates into samples of sufficient size, each one forming a graph G_t, and has the desirable feature that it accurately updates its representation of the latent block structure using a relatively small amount of information: the prior t-1 predicted block structure and the current datastream sample G_t. This paper details the underlying system of linear equations, used here to represent community detection, that achieves 97% accuracy estimating the latent block representation as the community structure changes. This is demonstrated for synthetic graphs generated by a hybrid mixed-model stochastic block model from the DARPA/MIT Graph Challenge with time-varying block structure.   Improved Distributed-Memory Triangle Counting by Exploiting the Graph Structure Sayan Ghosh (PNNL, Washington State Univ.) " Graphs are ubiquitous in modeling complex systems and representing interactions between entities to uncover structural information of the domain. Traditionally, graph analytics workloads are challenging to efficiently scale (both strong and weak cases) on distributed memory due to the irregular memory-access driven nature (with little or no computations) of the methods. The structure of graphs and their relative distribution over the processing elements poses another level of complexity, making it difficult to attain sustainable scalability across platforms. In this paper, we discuss enhancements to TriC, a distributed-memory implementation of graph triangle counting using Message Passing Interface (MPI), which was featured in the 2020 Graph Challenge competition. We have made some incremental enhancements to TriC, primarily adopting a user-defined buffering strategy to overcome the startup problem for large graphs (by fixing the memory for intermediate data), and experimenting with probabilistic data structures such as bloom filter to improve the query response time for assessing edge existence, at the expense of increasing the overall false positive rate. These adjustments have led to a modest 2--3x improvement in certain cases, as compared to the previous version. HTC: Hybrid Vertex-parallel and Edge-parallel Triangle Counting Li Zeng; Kang Yang; Haoran Cai; Jinhua Zhou; Rongqian Zhao; Xin Chen (Huawei Technologies) Graph algorithms (e.g., triangle counting) are widely used to find the deep association of data in various real-world applications such as friend recommendation and junk mail detection. However, even if using the massive parallelism of GPU, existing methods fail to run triangle counting queries efficiently on various large graphs. In this paper, we propose a fast hybrid algorithm HTC, which can utilize both vertex-parallel and edge-parallel paradigm and deliver much better performance on GPU. Different from current GPU implementations, HTC adaptively selects different parallel paradigm for different vertices. Also, bitwise-based intersection on segmented bitmap is proposed instead of naive binary search. Furthermore, preprocessing techniques like graph reordering and recursive clipping are adopted to optimize the graph structure. Extensive experiments show that HTC outperforms all state-of-the-art triangle counting implementations on GPU by 1.2×~42×. FAST: A Scalable Subgraph Matching Framework over Large Graphs Jiezhong He; Zhouyang Liu; Yixin Chen; Hengyue Pan; Zhen Huang; Dongsheng Li (NUDT) As one of the most fundamental operations in graph analysis, subgraph matching is widely used in various fields such as social network analysis, knowledge graph query, and fraud detection. Due to its NP-complete complexity, subgraph matching is challenging on large graphs. Previous work is limited on either scalability or the types of queries that can be handled. To address these problems, we propose a fast, scalable subgraph matching framework that consists of filtering, ordering, and enumeration stages. We exploit the parallelism in the filtering stage, and design a learning-based filtering method to remove false matching candidates; propose heuristic constraint and ordering generation methods to improve the matching efficiency; devise a distributed enumeration algorithm that is further optimized with the introduction of graph cache. Our learning- based filtering method delivers over 90% accuracy for basic queries. Compared with PruneJuice, our matching framework achieves 2-8× speedup in triangle enumeration and up to 3-4 orders of magnitude higher throughput on generic query enumeration. The caching mechanism further boosts the performance by about 1.5× to 2.5× on average. Experiments also demonstrate the scalability of our framework. Towards Fast GPU-based Sparse DNN Inference: A Hybrid Compute Model Shaoxian Xu; Minkang Wu; Long Zheng; Zhiyuan Shao; Xiangyu Ye; Xiaofei Liao; Hai Jin (Huazhong Univ. of Science and Technology) Sparse Deep Neural Networks (SpDNNs) are promising to cope with this challenge by using fewer weights while preserving accuracy. However, the sparsity nature of SpDNN models makes it difficult to run efficiently on GPUs. To stimulate technical advances for improving the efficiency of SpDNN inference, the MIT/IEEE/Amazon GraphChallenge proposes the SpDNN challenge in 2019. In this paper, we present a hybrid compute model to improve the efficiency of sparse matrix multiplications (SpMMs), the core computation of SpDNN inference. First, the given sparse weight matrix will be divided to generate many (sparse and dense) submatrices. For sparse submatrices, we leverage compile-time data embedding to compile the sparse data together with their corresponding computations into instructions, and hence the number of random accesses can be reduced significantly. For dense submatrices, we follow the traditional computing model where the data is obtained from the memory to exploit the high memory bandwidth of GPU. This hybrid compute model effectively balances the memory and instruction bottlenecks and offers more scheduling opportunities to overlap computing operations and memory accesses on GPU. To determine whether a submatrix is sparse, we present a cost model to estimate its time cost under the traditional computing model and the data-embedded computing mode in an accurate and efficient manner. Once the computing mode for all submatrices is determined, customized codes will be generated for the SpDNN inference. Experimental results on the SpDNN Challenge benchmarks show that our approach achieves up to 197.86 tera-edges per second inference throughput on a single NVIDIA A100 GPU. Compared to the 2021 and 2020 champions, our approach offers up to 6.37x and 89.94x speedups on a single GPU, respectively. We also implement a 16-GPU version, showing up to 9.49x and 80.11x speedups over the former 16- GPU baselines of the 2021 and  2020 champions. Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores Yufei Sun; Long Zheng; Qinggang Wang; Xiangyu Ye; Yu Huang; Pengcheng Yao; Xiaofei Liao; Hai Jin (Huazhong Univ. of Science and Technology) Sparse deep neural networks (SpDNN) attract a lot of research and industry attention because of their powerful learning capability, whose execution time is dominated by the sparse matrix-dense matrix multiplication (SpMM). As one of the specialized processors for matrix multiplication, NVIDIA GPU Tensor Cores can perform half-precision matrix-matrix multiplication with higher performance than CUDA Cores, which provides great opportunities for SpMM acceleration. However, performing SpMM efficiently on Tensor Cores remains tremendously challenging. First, typical Tensor Cores do not handle extremely sparse matrix computations well, delivering much lower performance compared to the dense counterparts. Second, the single-precision Challenge dataset prevents them from leveraging powerful Tensor Cores to improve performance. To this end, we first propose a similarity-based matrix transformation scheme, which polarizes the weight matrix to be either denser or sparser in local regions. Then the denser and sparser workloads are respectively processed on Tensor Cores and CUDA Cores, boosting the overall efficiency. Second, considering the half- precision limitation of Tensor Cores, we further propose a lightweight emulation algorithm to achieve the single-precision computation on Tensor Cores without affecting the correctness of the final results. To the best of our knowledge, this paper is the first to accelerate SpDNN inference on Tensor Cores without compromising the precision requirement. Extensive experiments validate that our work reaches up to 300 TeraEdges per second inference throughput on a single A100 GPU, yielding up to 89.41× and 8.12× speedups against the champions of the 2020 and 2021 Sparse Deep Neural Network Graph Challenge, respectively. Moreover, our 4-GPU version is also up to 6.56× faster over the 2021 champion running on 4 GPUs and 7.55× faster over the 2020 champion running on 768 GPUs.
2022 Abstract Book