2021 IEEE High Performance Extreme Computing Virtual Conference 20 - 24 September 2021
Home Monday, Sept 20 Tuesday, Sept 21 Wednesday, Sept 22 Thursday, Sept 23 Friday, Sept 24 Subscribe to HPEC 2022 Poster Session
1-P: Poster Session (12:15-15:45) GBTLX Code Generation: sparse-matrixsparse-vector multiplication Sanil Rao (Carnegie Mellon University)*; Scott McMillan (CMU Software Engineering Institute); Franz Franchetti (Carnegie Mellon University) GBTLX is a code generation system that takes a program written using the GraphBLAS Template Library (GBTL) and transforms it into a high-performance implementation without human intervention. Within the GBTLX system is a preprocessing step that captures GBTL operations, placing them into a computational trace file. The trace file is the input to the code generator, SPIRAL, which analyzes the operations and gives an equivalent computational kernel. We highlight the stages that the SPIRAL system goes through to take a trace file and transform it into a computational kernel. We use as an example sparse-matrix sparse-vector multiplication (SpMSpV). Fusing Non Element-wise Layers in DNNs upasana sridhar (Carnegie Mellon University)*; Tze Meng Low (Carnegie Mellon University); Martin Schatz (Facebook) Most deep learning networks usually require a large amount of memory due to the storage of weights associated with the large number of layers, and the intermediate data between two separate layers. Layer fusion, i.e. fusing multiple layers into a single layer, is one of the many ways to reduce the memory requirements. However, many deep learning compilers restrict layer fusion to layers that perform element-wise operations. This limitation reduces opportunities to decrease the memory overhead and improve performance. We identify different fused implementations of non element-wise layers, and discuss the trade-offs between them. We demonstrate the generality of our fused approach by applying layer fusion to different non element-wise layers, and provide a performance comparison against popular machine learning frameworks. Big Memory Servers and Modern Approaches to Disk-Based Computation Po Hao Chen (Boston University)*; Kurt Keville (Massachusetts Institute of Technology) The Big Memory solution is a new computing paradigm facilitated by commodity server platforms that is available today. It exposes a large RAM subsystem to the Operating System and therefore affords application programmers a number of previously unavailable options for data management. Additionally, certain vendor-specific solutions offer additional memory management options that will result in better data reliability and access speeds. Performance of a GPU-Based Radar Processor Mark Bolding (Georgia Tech Research Institute)*; David Ediger (Georgia Institute of Technology); Joseph Samo (Georgia TEch Research Institute); Saul Crumpton (Georgia Tech Research Institute) We describe a software system for performing radar processing on GPU. Such systems may appear downstream from FPGA front end processors, hence their appeal for use in complicated radar processing tasks. The strength of such systems is their relative ease of implementation, but questions are often raised regarding their performance. In this work we give a broad overview of the system and discuss high performance algorithms that can be used to achieve desired performance, as well as providing benchmarks.  DMM-GAPBS: Adapting the GAP Benchmark Suite to a Distributed Memory Model ZACHARY HANSEN (University of Nebraska Omaha); Brody Williams (Texas Tech University)*; John Leidel (Tactical Computing Laboratories); Xi Wang (RIOS Laboratory); Yong Chen (Texas Tech University) Due to the ability of graphs to model diverse real-world scenarios such as social networks, roads, or biological networks, effective graph processing techniques are of critical importance to a wide array of fields. As a consequence of the growth of data volumes, some graphs have already outgrown the memory capacities of single servers. In such cases, it is desirable to partition and keep the entire graph in a distributed memory space into order to bring the resources of a computing cluster to bear on the problem. This approach introduces a number of challenges, such as communication bottlenecks and low hardware utilization. However, it is difficult to effectively measure the impact of innovations addressing these challenges due to a lack of standardization in the domain of distributed graph processing. This research study was inspired by, and builds off of, the widely-used GAP Benchmark Suite (GAPBS), which was developed to provide an effective baseline and consistent set of evaluation methodologies for shared memory multiprocessor graph processing systems. We design and develop a new benchmark suite called DMM-GAPBS, a distributed-memory-model GAPBS. We adapt the GAPBS graph building infrastructure and algorithms, but utilize OpenSHMEM to enable a distributed memory environment, in the hope of providing a modular, extensible baseline for the distributed graph processing community. In order to showcase our design and implementation for processing graphs that cannot fit within a single server, we present the results of executing the DMM-GAPBS benchmark kernels on two large synthetic graphs distributed across sixteen nodes of an enterprise class system. The K-Core Decomposition Algorithm Under the Framework of GraphBLAS longlong Li (Shandong University)*; Hu Chen (Shanndong University); Ping Li (Huawei Technologies Co. Ltd); Jie Han (Huawei Technologies Co. Ltd); Guanghui Wang (Shandong University) A graph’s k-core is the unique largest induced subgraph where all nodes’ degrees are greater than or equal to k. The k-core decomposition algorithm is to find the coreness value for each node in a graph, which is the maximum value of k all the k-core containing this node. K-core is one of the most commonly used references to evaluate the node importance in various scientific disciplines. The widely used classical k-core decomposition algorithm has O(n+m) complexity . However, it is not suitable for parallelization. In this paper, we propose an algebraic k-core decomposition algorithm that is O(kmaxn+m) in computational complexity and can be efficiently parallelized on GPU under the GraphBLAS framework. We can efficiently parallelize and calculate the coreness value for graphs with billions of edges. On a 14-core CPU server and a large scale sparse datasets, our algebraic algorithm outperforms the state-of-the-art ParK and PKC algorithms. The algebraic algorithm, in particular, achieves up to 4× acceleration in CPU, whereas our parallel GPU implementation on several large scale graphs achieves up to 6× acceleration over our CPU version. A Machine Learning Enabled NoC Performance Evaluation Sajal Jain (NIT Karnataka); Prachi Kale (NIT Karnataka); Pallabi Hazarika (NIT Karnataka); Biswajit Bhowmik (NIT Karnataka)* With a growing number of diverse applications, systems-on-ship (SoCs) have rapidly developed, including integrating memory, IP cores, I/O subsystems, etc. However, SoCbased communications become a significant concern as these architectures often fail to fulfill real-time requirements due to communication bottlenecks for the applications. Networks-on-Chip (NoCs) offer a high-performance guarantee and have become an alternate solution. NoC simulator is generally used to evaluate performance parameters like latency, power consumption, etc., as they are the crucial measure of NoC designs. With the increase in NoC size, an NoC simulation is very time-consuming. This paper proposes a machine learning framework based on Support Vector Regression (SVR) to predict and analyze the NoC performance metrics. Intensive experiments are conducted for multiple topology sizes. Results show that the proposed scheme predicts latency as 25-85 cycles, hop count as 2-12, maximum, and minimum switch power consumption as 13 μW and 0.04 μW. Besides, the minimum and maximum predicted NoC area is 0.08μm^2 and five μm^2, respectively. Further, the prediction error is around 3-5% while the speed-up achieved is about 300- 2350x.
Poster Session
2021 Abstract Book