2022
IEEE High Performance Extreme Computing
Virtual Conference
19 - 23 September 2022
Thursday, September 22
4-V Session (10:30-11:00)
Co-Chairs: Albert Reuther
4-1: ASIC and FPGA Advances 1 Session (11:00-12:15)
Co-Chairs: Sadas Shankar & Ken Cain
Invited Talk: GNN Acceleration on FPGAs
Prof. Viktor Prasanna (USC)
Flexible Hardware Accelerator Design Generation with SPIRAL
Guanglin Xu; James Hoe; Franz Franchetti (Carnegie Mellon Univ.)
Hardware specialization has become a widely employed technique for approaching higher performance and energy efficiency in
computer systems. Yet obtaining efficient custom hardware designs remains a challenging and tedious task, calling for the
automated approaches. In the past, Spiral has been used for generating high-throughput streaming hardware designs for linear
transform kernels. This paper is motivated by an observation that a memory-based iterative computing model may allow us to
trade off throughput for algorithmic flexibility. In this paper, we present a hardware generation approach that generates and
optimizes algorithms using Spiral's multi-level domain-specific languages (DSLs), targeting a scalar load-store architecture. We
have incorporated this approach as a hardware backend into the Spiral system. Our evaluation of this approach on several
fundamental kernels shows flexibility with reasonable performance and resource utilization.
On the Characterization of the Performance-Productivity Gap for FPGA
Atharva Gondhalekar; Thomas Twomey; Wu-chun Feng (Virginia Tech)
Today, FPGA vendors provide a C++/C-based programming environment to enhance programmer productivity over using a
hardware-description language at the register-transfer level. The common perception is that this enhanced productivity comes at
the expense of significantly less performance, e.g., as much an order of magnitude worse. To characterize this performance-
productivity tradeoff, we propose a new composite metric, Π, that quantitatively captures the perceived discrepancy between the
performance and productivity of any two given FPGA programming languages, e.g., Verilog vs. OpenCL. We then present the
implications of our metric via a case study on the design of a Sobel filter (i.e., edge detector) using three different programming
models — Verilog, OpenCL, oneAPI — on an Intel Arria 10 GX FPGA accelerator. Relative to performance, our results show that
an optimized OpenCL kernel achieves 84% of the performance of an optimized Verilog version of the code on a 7680×4320 (8K)
image. Conversely, relative to productivity, OpenCL offers a 6.1× improvement in productivity over Verilog, while oneAPI
improves the productivity by an additional factor of 1.25× over OpenCL.
Optimizing Designs Using Several Types of Memories on Modern FPGAs
Mehmet Gungor; Kai Huang; Stratis Ioannidis; Miriam Leeser (Northeastern Univ.)
Modern FPGAs targeting data centers are designed to accelerate problems with large data. They offer many different types of
memory including on-chip and on-board memories. A recent addition is High Bandwidth Memory (HBM), whose advantages
have been demonstrated by others. However, there is little research that looks at how interactions among different memory
types impact application performance. We investigate how a combination of HBM and on-chip memory (BRAM or URAM)
impact clock rate and overall application latency. In these designs, the on-chip memory is used as an on-chip cache for the
larger amounts of data stored in HBM. Our experiments show that as the size of data stored in BRAM or URAM increases, the
achievable clock speed is reduced. This in turn may result in degraded performance. We examine Garbled Circuits, an
implementation of Secure Function Evaluation (SFE) with high memory demands and out-of-order data access, and examine
how different choices of BRAM, URAM and HBM usage alters its performance.
Performance Modeling Sparse MTTKRP Using Optical Static Random Access Memory on FPGA
Sasindu Wijeratne (Univ. of Southern California); Akhilesh Jaiswal; Ajey Jacob (USC-ISI); Bingyi Zhang; Viktor K Prasanna
(Univ. of Southern California)
Electrical static random memory (E-SRAM) is the current standard for internal static memory in Field Programmable Gate Array
(FPGA). Despite the dramatic improvement in E-SRAM technology over the past decade, the goal of ultra-fast, energy-efficient
static random memory has yet to be achieved with E-SRAM technology. However, preliminary research into optical static random
access memory (O-SRAM) has shown promising results in creating energy-efficient ultra-fast static memories.
This paper investigates the advantage of O-SRAM over E-SRAM in access speed and energy performance while executing
sparse Matricized Tensor Times Khatri-Rao Product (spMTTKRP). spMTTKRP is an essential component of tensor
decomposition algorithms which is heavily used in data science applications. The evaluation results show O-SRAMs can achieve
speeds of 1.1x - 2.9x while saving 2.8x - 8.1x energy compared to conventional E-SRAM technology.
Tutorial Session: 4-T (12:15-15:45): Exploring Graph Analysis for HPC with Near-Memory Accelerators
Organizer(s): Jeffrey Young (Georgia Tech), Patrick Lavin (Georgia Tech), Jason Riedy (Lucata), Srinivas Eswar (Georgia Tech)
4-2: ASIC and FPGA Advances 2 Session (12:30-13:45)
Co-Chairs: Darrell Ricke & Siddarth Samsi
A High Throughput Hardware Accelerator for FFTW Codelets: A First Look
Larry Tang; Siyuan Chen; Keshav Harisrikanth; Guanglin Xu; Ken Mai; Franz Franchetti (Carnegie Mellon Univ.)
The Fast Fourier Transform (FFT) is a critical computation for numerous applications in science and engineering. Its
implementation has been widely studied and optimized on various computing platforms, with the FFTW library becoming the
standard interface in HPC. In this work, we propose hardware acceleration of the FFTW library by putting a software codelet into
hardware. The hardware is exposed to the user through an FFTW-compatible software library while actual computation takes
place behind the scenes on a custom accelerator. To demonstrate a first look at this idea, we design a high throughput
accelerator for FFTW twiddle codelets. The FFT hardware is automatically generated using SPIRAL and a test chip is fabricated
in a TSMC 28nm process. We provide measured results of the test chip and discuss many opportunities for future work.
Modeling the Energy Efficiency of GEMM using Optical Random Access Memory
Bingyi Zhang; Akhilesh Jaiswal; Clynn Mathew; Ravi Teja Lakkireddy; Ajey Jacob; Sasindu Wijeratne; Viktor K. Prasanna (Univ.
of Southern California)
General matrix-matrix multiplication (GEMM) is the key computation kernel in many applications. GEMM has been supported on
various hardware platforms, including CPU, GPU, FPGA. To optimize the performance of GEMM, developers use on-chip
electrical static random access memory (E-SRAM) to exploit the data locality of GEMM. However, intensively accessing E-
SRAM for GEMM can lead to significant energy consumption, which is not energy-efficient for commercial data centers.
In this paper, we evaluate the optical static random access memory (O-SRAM) for GEMM. O-SRAM is a promising technology
that has extremely low access latency and low energy consumption compared with the traditional E-SRAM. First, we propose
an O-SRAM based wafer-scale system for GEMM and a baseline E-SRAM based system. Second, we build the theoretical
performance models of the two systems to analyze their energy consumption of on-chip memory accesses. Then, we conduct
simulation-based experiments to evaluate the energy consumption of the two system. The evaluation results show that O-SRAM
based system is $7\times$ more energy efficient than the baseline E-SRAM based system.
Edge-Connected Jaccard Similarity for Graph Link Prediction on FPGA
Paul Sathre; Atharva Gondhalekar; Wu-chun Feng (Virginia Tech)
Graph analysis is a critical task in many fields, such as social networking, epidemiology, bioinformatics, and fraud detection. In
particular, understanding and inferring relationships between graph elements lies at the core of many graph-based workloads.
Real-world graph workloads and their associated data structures create irregular computational patterns that complicate the
realization of high-performance kernels. Given these complications, there does not exist a de facto “best” architecture, language,
or algorithmic approach that simultaneously balances performance, energy efficiency, portability, and productivity.
In this paper, we realize different algorithms of edge-connected Jaccard similarity for graph link prediction and characterize their
performance across a broad spectrum of graphs on an Intel Stratix 10 FPGA. By utilizing a high-level synthesis (HLS)-driven,
high-productivity approach (via the C++-based SYCL language) we rapidly prototype two implementations – a from-scratch
edge-centric version and a faithfully-ported commodity GPU implementation – which would have been intractable via a hardware
description language. With these implementations, we further consider the benefit and necessity of four HLS-enabled
optimizations, both in isolation and in concert — totaling seven distinct synthesized hardware pipelines. Leveraging real-world
graphs of up to 516 million edges, we show empirically-measured speedups of up to 9.5× over the initial HLS implementations
when all optimizations work in concert.
Design and Implementation of a Real-Time Parallel FFT for a Wideband Radar System on an FPGA
Lakshmi Pradeep Bheema; Rishu Anand; Pavan Vadakattu; Syed Azemuddin (IIIT Hyderabad); Aquibuddin Ahmed (RCI-DRDO)
An FFT is an essential algorithm for radar signal processing in a radar system. Due to increase in computational power of
FPGAs, it is possible to perform FFT operation onboard in an airborne vehicle. However, the FPGA resources have become a
limitation for processing real-time signals using conventional methods.
To address this issue, we have proposed a parallel pipelined FFT architecture that can achieve very high throughput with very
low latency, making it capable of processing real-time continuous data. This architecture is implemented in a radar system,
which works from L band to Ku band. In this radar system, the received RF signal is downconverted into an IF signal of 1 GHz
frequency with a 500 MHz bandwidth and converted to digital data using a 10-bit ADC. On the converted digital data, a 512-point
FFT is implemented on a Xilinx Virtex-7 XC7VX485T FPGA using 8 parallel channels with 64 data frames and is compared with
the conventional IP core-based architecture. The proposed architecture takes 1.307µs to implement FFT, which is 5.15 times
faster than the IP core-based architecture and requires fewer arithmetic computations. The overall total number of complex
multiplications, complex additions, multipliers & adders were reduced by 10.42%, 30.64%, 10.42% & 23.90% respectively. Apart
from very low latency and fewer arithmetic operations, the proposed parallel FFT architecture achieved a throughput of 1.350
Giga Samples per second (Gsps).
Challenges Designing for FPGAs Using High-Level Synthesis
Clayton J Faber; Steven D Harris; Zhili Xiao; Roger Chamberlain (Washington Univ. in St. Louis); Anthony M Cabrera (Oak
Ridge National Laboratory)
High-Level Synthesis (HLS) tools are aimed at enabling performant FPGA designs that are authored in a high-level language.
While commercial HLS tools are available today, there is still a substantial performance gap between most designs developed
via HLS relative to traditional, labor intensive approaches. We report on several cases where an anticipated performance
improvement was either not realized or resulted in decreased performance. These include: programming paradigm choices
between data parallel vs. pipelined designs; dataflow implementations; configuration parameter choices; and handling odd data
set sizes. The results point to a number of improvements that are needed for HLS tool flows, including a strong need for
performance modeling that can reliably guide the compilation optimization process.
4-3: ASIC and FPGA Advances 3 Session (14:15-15:30)
Co-Chairs: Sanmukh Rao Kuppannagari & Siddarth Samsi
FPGA Acceleration of Fully Homomorphic Encryption over the Torus
Tian Ye (Univ. of Southern California); Rajgopal Kannan (Army Research Lab-West); Viktor K. Prasanna (Univ. of Southern
California)
Fully Homomorphic Encryption over the Torus (TFHE) is a promising approach for secure computing in cloud servers to perform
computations directly on encrypted data. However, TFHE has much higher computation complexity than its unencrypted
counterpart.
In this work, we propose an FPGA accelerator for TFHE computations. We illustrate the effects of an optimization called
bootstrapping key unrolling on the tradeoff between performance of bootstrapping and FPGA resource consumption. We
customize the data layout of TFHE ciphertext to optimize data access and improve data reuse. We parameterize the FPGA
design for TFHE bootstrapping, which can be configured to achieve high performance for different user-specified security
requirements and given FPGA resources. We implement our design on a state-of-the-art FPGA and compare it with existing
results on CPUs. Our implementation for TFHE bootstrapping achieves 216x improvement in throughput and 16.5x improvement
in latency compared with the software baseline on a state-of-the-art CPU server.
Optimizing Open-source FPGA CAD Tools
Shachi Vaman Khadilkar; Martin Margala (UMass Lowell)
The development of open-source FPGA CAD tools is challenging. Vendor CAD tools exist but are closed-source and specific to
each FPGA family. These tools give limited opportunity for customizing and also have the disadvantage of long compile times.
Several open source tools have been built, but the quality of results from these tools is still far behind commercial tools. This
research is aimed at bridging some of the gap between commercial and open-source FPGA CAD tools by tuning policies of
existing implementation algorithms. We build a set of synthetic benchmarks to identify the mapping between input circuit
patterns and policies of a given CAD algorithm and architecture. This work discusses the framework for policy tuning and is
aimed at showing that algorithm policies need to adapt to the input circuit. We demonstrate our framework with the packing step
of the FPGA CAD flow and show how our approach can be used to improve the quality of results in open-source tools.
LIMA: Hardware for FFT based large integer multiplication
James Nguyen; Michael Cai; Ziyi Zuo; Larry Tang; Ken Mai; Franz Franchetti (Carnegie Mellon Univ.)
Applications in theorem proving and cryptography heavily rely on the multiplication of large integer values. Utilizing properties of
the Fourier Transform, one can multiply two values in O(n log n) as compared to the traditional O(n^2) grade school algorithm. In
this work, we propose a FFT-based multiplication algorithm that can be computed O(n log n) time, utilizing interval arithmetic to
circumvent the real number uncertainty. This work will also present hardware for FFT based large integer multiplication, LIMA,
which is a test chip implementing of a portion of this algorithm fabricated on the TSMC 28 nm process. We aim to demonstrate
LIMA as a high throughput implementation of this portion of the algorithm, with the intent of being able to run at a 1 GHz with a
area of 1.9 mm^2. LIMA utilizes custom double precision floating point and FIFO architectures optimized for area and high clock
speed operation.
Towards a Generic UVM
Kholoud Mahmoud; Randa Ahmed; Karim Ayman; Mostafa Ayman; Waleed Taie; Yasser Ibrahim; Hassan Mostafa (Cairo
University); Khaled Salah (Siemens)
In the modern era of CPU complexity advancements, Processor verification has always been an ever-increasing challenge. The
gap between what a verification plan can offer nowadays and the current technology requirements is constantly widened.
Despite many efforts on perfecting “Golden-verification-models” during the design phase, and adoption of object-oriented
programming into the whole process; many industry experts still consider solo verification test benches as an extreme, time-
consuming barricade that leads to a longer time-to-market and a questionable continuity of the current verification process. The
Universal Verification Methodology (UVM), came in action as a literal savior to the whole verification community, by offering a
merge between SystemVerilog and SystemC into one environment that is completely standardized, constrained, and reusable,
allowing a powerful verification methodology to a wide range of design sizes and types. The main contribution that this project
introduces is implementing a generic UVM, in other words, building one verification environment that can be used to
accommodate many RTL designs (Soft Processors), having not only different Instruction Set Architectures (ISAs) -of the same
categories-, but also different techniques/mechanisms handling the pipeline infrastructures. The proposed generic UVM (GUVM)
structure permits the targeted user to attach any soft processor (core) having nearly the same micro-architecture to our test
bench, and monitor both: CPU internal behavior and the complete flow of all supported instructions.
How to Prevent a Sick ASIC
William F. Ellersick (Analog Circuit Works)
High performance computing systems increasingly require mixed-signal ASICs to achieve competitive speed, power efficiency
and cost. The integration of processing, transceivers, sensors and power management results in dramatic reductions in size,
which can yield great savings in power, enabling higher performance. However, few design elements demand such high quality
as a mixed-signal ASIC. In this paper, actual near-disasters from decades of integrated circuit design are presented along with
methods to prevent potentially severe damage to projects, careers, and even companies. Such stories of failure are rarely told,
but acknowledging them is crucial to avoiding repeating the mistakes and to reducing ASIC development risk and ultimately
ensuring success. Key takeaways include planning for failure with designed-in observability, controllability and workarounds; the
use of simple and robust circuits; and that organizing the people can be as important and challenging as arranging the
transistors.
4-4: General Purpose GPU Computing 2 Session (15:45-17:00)
Co-Chairs: Hameed Badawy & Mark Barnell
Optimal GPU Frequency Selection using Multi-Objective Approaches for HPC Systems
Ghazanfar Ali (Texas Tech Univ.); Sridutt Bhalachandra; Nicholas Wright (Lawrence Berkeley National Laboratory); Mert Side;
Yong Chen (Texas Tech Univ.)
Power consumption poses a significant challenge in current and emerging GPU-enabled high-performance computing (HPC)
systems. In modern GPUs, controls like dynamic voltage frequency scaling (DVFS), among others, exist to regulate power
consumption. Due to varying computational intensities and the availability of a wide range of frequency settings, selecting the
optimal frequency configuration for a given GPU workload is non-trivial. Applying a power control with the single objective of
reducing power may cause performance degradation, leading to more energy consumption. In this study, we characterize and
identify GPU utilization metrics that influence both the power and execution time of a given workload. Analytical models for
power and execution time are then proposed using the characterized feature set. Multi-objective functions (i.e., energy-delay
product (EDP) and ED2P) are used to select an optimal GPU DVFS configuration for a workload such that power consumption is
reduced with no or negligible degradation in performance. The evaluation was conducted using SPEC ACCEL benchmarks on
NVIDIA GV100 GPU. The proposed power and performance analytical models demonstrated prediction accuracies of up to
99.2% and 98.8%, respectively. On average, the benchmarks showed 28.6% and 25.2% energy savings using EDP and ED2P
approaches, respectively, without performance degradation. Furthermore, the proposed models require metric collection at only
the maximum frequency rather than all supported DVFS configurations.
GPU-Accelerated High-Bandwidth Radar Centroiding
David J Brigada; Maximilian Merfeld; Kara Warner (MIT Lincoln Laboratory)
Radar signal processing is a computationally intensive task, especially for high-bandwidth systems. Traditionally, such systems
have relied on the interleaving of processing on multiple nodes of large compute clusters to achieve the necessary throughput.
Development in general-purpose GPU computing has led to a massive increase in the computational power available to highly
parallel tasks. Most parts of the radar signal processing pipeline are well suited for such a task. This paper describes an
algorithm for centroiding, a key part of the search radar pipeline that has not yet been demonstrated on a GPU. With this
centroiding algorithm, the entire high-data-rate portion of the processing pipeline can be run on the GPU, yielding a speedup
factor of approximately 40. The primary benefit of this approach is a massive reduction in data copying from the GPU to the
CPU—a factor of over 1200 in this case—alleviating the main barrier to GPU-based radar processing systems.
A Hierarchical Jacobi Iteration for Structured Matrices on GPUs using Shared Memory
Mohammad Shafaet Islam; Qiqi Wang (MIT)
This paper presents an algorithm to accelerate the Jacobi iteration for solving linear systems of equations arising from structured
problems on graphics processing units (GPUs). Acceleration is achieved by utilization of on-chip GPU shared memory via a
domain decomposition procedure. In particular, the problem domain is partitioned into subdomains whose data is copied to the
shared memory of each GPU block. Jacobi iterations are performed internally within each block's shared memory while avoiding
expensive global memory accesses every iteration, resulting in a hierarchical algorithm (which takes advantage of the GPU
memory hierarchy). We investigate the algorithm performance on the linear systems arising from the discretization of Poisson’s
equation in 1D and 2D, and observe an 8x speedup in convergence in the 1D problem and a nearly 6x speedup in 2D compared
to a conventional GPU implementation of Jacobi iteration which only relies on global memory.
Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis
Hamdy Abdelkhalik (New Mexico State Univ.); Yehia Arafa (Qualcomm Technologies); Nandakishore Santhi (Los Alamos
National Laboratory); Hameed Badawy (New Mexico State Univ.)
Abstract—Graphics Processing Units (GPUs) are now considered the leading hardware to accelerate general-purpose
workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating
the microarchitecture features of various GPU architectures
beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads
and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their
successor, Ampere. However, some microarchitecture
features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In
this paper, we study the clock cycles per instruction with various data types found in the instruction-set architecture (ISA) of
Nvidia GPUs. We measure the clock cycles for PTX ISA
instructions and their SASS ISA counterparts using microbenchmarks. We further calculate the clock cycles needed to access
each memory unit. Moreover, we demystify the new version of the tensor core unit found in the Ampere architecture using the
WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. Our
results should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely
used by performance modeling simulators and tools to model and predict the performance of the hardware.
Apple Silicon Performance in Scientific Computing
Connor Kenyon; Collin Capano (UMass Dartmouth)
With the release of the Apple Silicon System-on-a-Chip processors, and the impressive performance shown in general use by
both the M1 and M1 Ultra, the potential use for Apple Silicon processors in scientific computing is explored. Both the M1 and M1
Ultra arecompared to current state-of-the-art data-center GPUs, including an NVIDIA V100 with PCIe, an NVIDIA V100 with
NVLink, and an NVIDIA A100 with PCIe. The scientific performance is measured using the Scalable Heterogeneous Computing
(SHOC) benchmark suite using OpenCL benchmarks. We find that both M1 processors out perform the GPUs in all benchmarks.
4-S1: Graph Challenge Special (17:30-19:30)
Organizer: Jeremy Kepner
Sparse Deep Neural Network Inference Using Different Programming Models
Hyungro Lee; Milan Jain; Sayan Ghosh (PNNL)
Sparse deep neural networks have gained increasing attention recently in achieving speedups on inference with reduced
memory footprints. Real-world applications often have to deal with sparse data and irregularities in the computations, yet a wide
variety of Deep Neural Network (DNN) tasks remain dense without exploiting the advantages of sparsity in networks. Recent
works presented in MIT/IEEE/Amazon GraphChallenge have demonstrated significant speedups and various techniques. Still,
we find that there is limited investigation of the impact of various Python and C/C++ based programming models to explore new
opportunities for the general cases.
In this work, we provide extensive quantitative evaluations of various contemporary GPGPU programming models such as
CuPy, CUDA cuSPARSE, and OpenMP in the context of Sparse Deep Neural Network (SpDNN) implementations (derived from
the Graph Challenge reference serial code) on single and multiple GPUs from NVIDIA DGX-A100 40GB/80GB platforms.
Kalman Filter Driven Estimation of Community Structure in Time Varying Graphs
Lisa JK Durbeck; Peter Athanas (Virginia Tech)
Community detection is an NP-hard graph problem that has been the subject of decades of research. Moreover, efficient
methods are needed for time-varying graphs. In this paper we propose and evaluate a method of approximating the latent block
structure within a time-varying graph using a Kalman filter. The method described breaks a stream of graph updates into
samples of sufficient size, each one forming a graph G_t, and has the desirable feature that it accurately updates its
representation of the latent block structure using a relatively small amount of information: the prior t-1 predicted block structure
and the current datastream sample G_t. This paper details the underlying system of linear equations, used here to represent
community detection, that achieves 97% accuracy estimating the latent block representation as the community structure
changes. This is demonstrated for synthetic graphs generated by a hybrid mixed-model stochastic block model from the
DARPA/MIT Graph Challenge with time-varying block structure.
Improved Distributed-Memory Triangle Counting by Exploiting the Graph Structure
Sayan Ghosh (PNNL, Washington State Univ.)
"
Graphs are ubiquitous in modeling complex systems and representing interactions between entities to uncover structural
information of the domain. Traditionally, graph analytics workloads are challenging to efficiently scale (both strong and weak
cases) on distributed memory due to the irregular memory-access driven nature (with little or no computations) of the methods.
The structure of graphs and their relative distribution over the processing elements poses another level of complexity, making it
difficult to attain sustainable scalability across platforms.
In this paper, we discuss enhancements to TriC, a distributed-memory implementation of graph triangle counting using Message
Passing Interface (MPI), which was featured in the 2020 Graph Challenge competition. We have made some incremental
enhancements to TriC, primarily adopting a user-defined buffering strategy to overcome the startup problem for large graphs (by
fixing the memory for intermediate data), and experimenting with probabilistic data structures such as bloom filter to improve the
query response time for assessing edge existence, at the expense of increasing the overall false positive rate. These
adjustments have led to a modest 2--3x improvement in certain cases, as compared to the previous version.
HTC: Hybrid Vertex-parallel and Edge-parallel Triangle Counting
Li Zeng; Kang Yang; Haoran Cai; Jinhua Zhou; Rongqian Zhao; Xin Chen (Huawei Technologies)
Graph algorithms (e.g., triangle counting) are widely used to find the deep association of data in various real-world applications
such as friend recommendation and junk mail detection.
However, even if using the massive parallelism of GPU, existing methods fail to run triangle counting queries efficiently on
various large graphs.
In this paper, we propose a fast hybrid algorithm HTC, which can utilize both vertex-parallel and edge-parallel paradigm and
deliver much better performance on GPU.
Different from current GPU implementations, HTC adaptively selects different parallel paradigm for different vertices.
Also, bitwise-based intersection on segmented bitmap is proposed instead of naive binary search.
Furthermore, preprocessing techniques like graph reordering and recursive clipping are adopted to optimize the graph structure.
Extensive experiments show that HTC outperforms all state-of-the-art triangle counting implementations on GPU by 1.2×~42×.
FAST: A Scalable Subgraph Matching Framework over Large Graphs
Jiezhong He; Zhouyang Liu; Yixin Chen; Hengyue Pan; Zhen Huang; Dongsheng Li (NUDT)
As one of the most fundamental operations in graph analysis, subgraph matching is widely used in various fields such as social
network analysis, knowledge graph query, and fraud detection. Due to its NP-complete complexity, subgraph matching is
challenging on large graphs. Previous work is limited on either scalability or the types of queries that can be handled. To address
these problems, we propose a fast, scalable subgraph matching framework that consists of filtering, ordering, and enumeration
stages. We exploit the parallelism in the filtering stage, and design a learning-based filtering method to remove false matching
candidates; propose heuristic constraint and ordering generation methods to improve the matching
efficiency; devise a distributed enumeration algorithm that is further optimized with the introduction of graph cache. Our learning-
based filtering method delivers over 90% accuracy for basic queries. Compared with PruneJuice, our matching
framework achieves 2-8× speedup in triangle enumeration and up to 3-4 orders of magnitude higher throughput on generic
query enumeration. The caching mechanism further boosts the performance by about 1.5× to 2.5× on average. Experiments
also demonstrate the scalability of our framework.
Towards Fast GPU-based Sparse DNN Inference: A Hybrid Compute Model
Shaoxian Xu; Minkang Wu; Long Zheng; Zhiyuan Shao; Xiangyu Ye; Xiaofei Liao; Hai Jin (Huazhong Univ. of Science and
Technology)
Sparse Deep Neural Networks (SpDNNs) are promising to cope with this challenge by using fewer weights while preserving
accuracy. However, the sparsity nature of SpDNN models makes it difficult to run efficiently on GPUs. To stimulate technical
advances for improving the efficiency of SpDNN inference, the MIT/IEEE/Amazon GraphChallenge proposes the SpDNN
challenge in 2019. In this paper, we present a hybrid compute model to improve the efficiency of sparse matrix multiplications
(SpMMs), the core computation of SpDNN inference. First, the given sparse weight matrix will be divided to generate many
(sparse and dense) submatrices. For sparse submatrices, we leverage compile-time data embedding to compile the sparse data
together with their corresponding computations into instructions, and hence the number of random accesses can be reduced
significantly. For dense submatrices, we follow the traditional computing model where the data is obtained from the memory to
exploit the high memory bandwidth of GPU. This hybrid compute model effectively balances the memory and instruction
bottlenecks and offers more scheduling opportunities to overlap computing operations and memory accesses on GPU. To
determine whether a submatrix is sparse, we present a cost model to estimate its time cost under the traditional computing
model and the data-embedded computing mode in an accurate and efficient manner. Once the computing mode for all
submatrices is determined, customized codes will be generated for the SpDNN inference. Experimental results on the SpDNN
Challenge benchmarks show that our approach achieves up to 197.86 tera-edges per second inference throughput on a single
NVIDIA A100 GPU. Compared to the 2021 and 2020 champions, our approach offers up to 6.37x and 89.94x speedups on a
single GPU, respectively. We also implement a 16-GPU version, showing up to 9.49x and 80.11x speedups over the former 16-
GPU baselines of the 2021 and 2020 champions.
Accelerating Sparse Deep Neural Network Inference Using GPU Tensor Cores
Yufei Sun; Long Zheng; Qinggang Wang; Xiangyu Ye; Yu Huang; Pengcheng Yao; Xiaofei Liao; Hai Jin (Huazhong Univ. of
Science and Technology)
Sparse deep neural networks (SpDNN) attract a lot of research and industry attention because of their powerful learning
capability, whose execution time is dominated by the sparse matrix-dense matrix multiplication (SpMM). As one of the
specialized processors for matrix multiplication, NVIDIA GPU Tensor Cores can perform half-precision matrix-matrix
multiplication with higher performance than CUDA Cores, which provides great opportunities for SpMM acceleration. However,
performing SpMM efficiently on Tensor Cores remains tremendously challenging. First, typical Tensor Cores do not handle
extremely sparse matrix computations well, delivering much lower performance compared to the dense counterparts. Second,
the single-precision Challenge dataset prevents them from leveraging powerful Tensor Cores to improve performance. To this
end, we first propose a similarity-based matrix transformation scheme, which polarizes the weight matrix to be either denser or
sparser in local regions. Then the denser and sparser workloads are respectively processed on Tensor Cores and CUDA Cores,
boosting the overall efficiency. Second, considering the half- precision limitation of Tensor Cores, we further propose a
lightweight emulation algorithm to achieve the single-precision computation on Tensor Cores without affecting the correctness of
the final results. To the best of our knowledge, this paper is the first to accelerate SpDNN inference on Tensor Cores without
compromising the precision requirement. Extensive experiments validate that our work reaches up to 300 TeraEdges per second
inference throughput on a single A100 GPU, yielding up to 89.41× and 8.12× speedups against the champions of the 2020 and
2021 Sparse Deep Neural Network Graph Challenge, respectively. Moreover, our 4-GPU version is also up to 6.56× faster over
the 2021 champion running on 4 GPUs and 7.55× faster over the 2020 champion running on 768 GPUs.
2022 Abstract Book