2021
IEEE High Performance Extreme Computing
Virtual Conference
20 - 24 September 2021

1-P: Poster Session (12:15-15:45)
GBTLX Code Generation: sparse-matrixsparse-vector multiplication
Sanil Rao (Carnegie Mellon University)*; Scott McMillan (CMU Software Engineering Institute); Franz Franchetti (Carnegie
Mellon University)
GBTLX is a code generation system that takes a program written using the GraphBLAS Template Library
(GBTL) and transforms it into a high-performance implementation without human intervention. Within the GBTLX system is a preprocessing step that
captures GBTL operations, placing them into a computational trace file. The trace file is the input to the code generator, SPIRAL, which analyzes the
operations and gives an equivalent computational kernel. We highlight the stages that the SPIRAL system goes through to take a trace file and
transform it into a computational kernel. We use as an example sparse-matrix sparse-vector multiplication (SpMSpV).
Fusing Non Element-wise Layers in DNNs
upasana sridhar (Carnegie Mellon University)*; Tze Meng Low (Carnegie Mellon University); Martin Schatz (Facebook)
Most deep learning networks usually require a large amount of memory due to the storage of weights associated with the large number of layers, and
the intermediate data between two separate layers. Layer fusion, i.e. fusing multiple layers into a single layer, is one of the many ways to reduce the
memory requirements. However, many deep learning compilers restrict layer fusion to layers that perform element-wise operations. This limitation
reduces opportunities to decrease the memory overhead and improve performance. We identify different fused implementations of non element-wise
layers, and discuss the trade-offs between them. We demonstrate the generality of our fused approach by applying layer fusion to different non
element-wise layers, and provide a performance comparison against popular machine learning frameworks.
Big Memory Servers and Modern Approaches to Disk-Based Computation
Po Hao Chen (Boston University)*; Kurt Keville (Massachusetts Institute of Technology)
The Big Memory solution is a new computing paradigm facilitated by commodity server platforms that is available today. It exposes a large RAM
subsystem to the Operating System and therefore affords application programmers a number of previously unavailable options for data management.
Additionally, certain vendor-specific solutions offer additional memory management options that will result in better data reliability and access speeds.
Performance of a GPU-Based Radar Processor
Mark Bolding (Georgia Tech Research Institute)*; David Ediger (Georgia Institute of Technology); Joseph Samo (Georgia TEch
Research Institute); Saul Crumpton (Georgia Tech Research Institute)
We describe a software system for performing radar processing on GPU. Such systems may appear downstream from FPGA front end processors,
hence their appeal for use in complicated radar processing tasks. The strength of such systems is their relative ease of implementation, but
questions are often raised regarding their performance. In this work we give a broad overview of the system and discuss high performance
algorithms that can be used to achieve desired performance, as well as providing benchmarks.
DMM-GAPBS: Adapting the GAP Benchmark Suite to a Distributed Memory Model
ZACHARY HANSEN (University of Nebraska Omaha); Brody Williams (Texas Tech University)*; John Leidel (Tactical
Computing Laboratories); Xi Wang (RIOS Laboratory); Yong Chen (Texas Tech University)
Due to the ability of graphs to model diverse real-world scenarios such as social networks, roads, or biological networks, effective graph processing
techniques are of critical importance to a wide array of fields. As a consequence of the growth of data volumes, some graphs have already outgrown
the memory capacities of single servers. In such cases, it is desirable to partition and keep the entire graph in a distributed memory space into order
to bring the resources of a computing cluster to bear on the problem. This approach introduces a number of challenges, such as communication
bottlenecks and low hardware utilization. However, it is difficult to effectively measure the impact of innovations addressing these challenges due to a
lack of standardization in the domain of distributed graph processing. This research study was inspired by, and builds off of, the widely-used GAP
Benchmark Suite (GAPBS), which was developed to provide an effective baseline and consistent set of evaluation methodologies for shared memory
multiprocessor graph processing systems. We design and develop a new benchmark suite called DMM-GAPBS, a distributed-memory-model
GAPBS. We adapt the GAPBS graph building infrastructure and algorithms, but utilize OpenSHMEM to enable a distributed memory environment, in
the hope of providing a modular, extensible baseline for the distributed graph processing community. In order to showcase our design and
implementation for processing graphs that cannot fit within a single server, we present the results of executing the DMM-GAPBS benchmark kernels
on two large synthetic graphs distributed across sixteen nodes of an enterprise class system.
The K-Core Decomposition Algorithm Under the Framework of GraphBLAS
longlong Li (Shandong University)*; Hu Chen (Shanndong University); Ping Li (Huawei Technologies Co. Ltd); Jie Han (Huawei
Technologies Co. Ltd); Guanghui Wang (Shandong University)
A graph’s k-core is the unique largest induced subgraph where all nodes’ degrees are greater than or equal to k. The k-core decomposition algorithm
is to find the coreness value for each node in a graph, which is the maximum value of k all the k-core containing this node. K-core is one of the most
commonly used references to evaluate the node importance in various scientific disciplines. The widely used classical k-core decomposition
algorithm has O(n+m) complexity . However, it is not suitable for parallelization. In this paper, we propose an algebraic k-core decomposition
algorithm that is O(kmaxn+m) in computational complexity and can be efficiently parallelized on GPU under the GraphBLAS framework. We can
efficiently parallelize and calculate the coreness value for graphs with billions of edges. On a 14-core CPU server and a large scale sparse datasets,
our algebraic algorithm outperforms the state-of-the-art ParK and PKC algorithms. The algebraic algorithm, in particular, achieves up to 4×
acceleration in CPU, whereas our parallel GPU implementation on several large scale graphs achieves up to 6× acceleration over our CPU version.
A Machine Learning Enabled NoC Performance Evaluation
Sajal Jain (NIT Karnataka); Prachi Kale (NIT Karnataka); Pallabi Hazarika (NIT Karnataka); Biswajit Bhowmik (NIT Karnataka)*
With a growing number of diverse applications, systems-on-ship (SoCs) have rapidly developed, including integrating memory, IP cores, I/O
subsystems, etc. However, SoCbased communications become a significant concern as these architectures often fail to fulfill real-time requirements
due
to communication bottlenecks for the applications. Networks-on-Chip (NoCs) offer a high-performance guarantee and have become an alternate
solution. NoC simulator is generally used to evaluate performance parameters like latency, power consumption, etc., as they are the crucial measure
of NoC designs. With the increase in NoC size, an NoC simulation is very time-consuming. This paper proposes a machine learning framework
based on Support Vector Regression (SVR) to predict and analyze the NoC
performance metrics. Intensive experiments are conducted for multiple topology sizes. Results show that the proposed scheme predicts latency as
25-85 cycles, hop count as 2-12, maximum, and minimum switch power consumption as 13 μW and 0.04 μW. Besides, the minimum and maximum
predicted NoC area is 0.08μm^2 and five μm^2, respectively. Further, the prediction error is around 3-5% while the speed-up achieved is about 300-
2350x.

2021 Abstract Book