2020 IEEE High Performance Extreme Computing Virtual Conference 21 - 25 September 2020
3-1: Big Data & Distributed Computing 2 Session (11:00-12:15 EDT) High-Throughput Image Alignment for Connectomics using Frugal Snap Judgments Tim Kaler (MIT CSAIL)*; Brian Wheatman (Johns Hopkins University); Sarah Wooders (MIT) The accuracy and computational efficiency of image alignment directly affects the advancement of connectomics, a field which seeks to understand the structure of the brain through electron microscopy. We introduce the algorithms Quilter and Stacker that are designed to perform 2D and 3D alignment respectively on petabyte-scale data sets from connectomics. Quilter and Stacker are efficient, scalable, and can run on hardware ranging from a researcher's laptop to a large computing cluster. On a single 18-core cloud machine each algorithm achieves throughputs of more than 1 TB/hr; when combined the algorithms produce an end-to-end alignment pipeline that processes data at a rate of 0.82 TB/hr --- an over 10x improvement from previous systems. This efficiency comes from both traditional optimizations and from the use of ``Frugal Snap Judgments'' to judiciously exploit performance--accuracy trade-offs. A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was evaluated using three datasets whose size ranged from 550 GB to 38 TB.  The full alignment pipeline achieved a throughput of 0.6--0.8 TB/hr and 1.4--1.5 TB/hr on an 18-core and 112-core shared-memory multicore, respectively. On a supercomputing cluster with 200 nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr. DS-SHMEM: Staging-enabled PGAS Programming for Data-intensive Workflows Daihou Wang (Rutgers University)* PGAS based programming models, represented by OpenSHMEM standard, has been bringing advanced performance and programmability via its performances uni- formly access over memory of local and remote nodes. However, as the development of data intensive analysis, data scale are growing over the memory of processing nodes. In this paper, we propose to introduce additional data staging area into the programming model, and introduce SHSPACES programming model. In SHAPCES, memory PEs can be allocated and accessed asynchronously on data staging nodes, in addition to local and remote memory. Self-Scaling Clusters and Reproducible Containers to Enable Scientific Computing Peter Vaillancourt (Cornell University)*; John Coulter (Indiana University); Richard Knepper (Cornell University); Brandon Barker (Cornell University) Container technologies such as Docker have become a crucial component of many software industry practices especially those pertaining to reproducibility and portability.  The containerization philosophy has influenced the scientific computing community, which has begun to adopt - and even develop - container technologies (such as Singularity).  Leveraging containers for scientific software often poses challenges distinct from those encountered in industry, and requires different methodologies.  This is especially true for HPC.  With an increasing number of options for HPC in the cloud (including NSF-funded cloud projects), there is strong motivation to seek solutions that provide flexibility to develop and deploy scientific software on a variety of computational infrastructures in a portable and reproducible way.  The Cyberinfrastructure Resource Integration team in the XSEDE project has developed a simple tool which provides HPC infrastructure in the cloud that scales with user demand.  We now present a solution which uses the Nix package manager in an MPI-capable Docker container that is converted to Singularity. It provides consistent installations, dependencies, and environments in each image that are reproducible and portable across scientific computing infrastructures.  We demonstrate the utility of these containers with cluster benchmark runs in a self-scaling virtual cluster using the Slurm scheduler deployed in the Jetstream and Aristotle Red Cloud OpenStack clouds.  We conclude this technique is useful as a template for scientific software application containers to be used in the XSEDE compute environment, other Singularity HPC environments, and cloud computing environments. mpiHDFS: High-Performance Computing over a Commodity Filesystem via Aggregation, Reordering, and Coordination of Computation and I/O Da Zhang (Virginia Tech); Jing Zhang (Virginia Tech); Kaixi Hou (Virginia Tech); Sarunya Pumma (Virginia Tech); Wu-chun Feng (Virginia Tech)* With the emerging trend of integrating high-per- formance computing (HPC) with big data (BIGDATA) processing, running MPI over HDFS offers a promising approach for deliv- ering better scalability and fault tolerance to traditional HPC applications. However, it comes with challenges that discourage such an approach: (1) slow two-sided communication in MPI to support intermediate data processing, (2) a focus on enabling N-1 writes that is subject to the default (and naive) HDFS block- placement policy, and (3) a pipelined writing mode in HDFS that cannot fully utilize the underlying HPC hardware. Hence, without a holistic and systematic approach, the integration of HPC and BIGDATA falls short of delivering competitive performance.  As such, we propose mpiHDFS, middleware that sits between MPI applications and HDFS, to aggregate and reorder interme- diate data and to coordinate computation and I/O. Specifically, mpiHDFS provides highly-optimized merge and sort for inter- mediate data to enrich MPI functionality, enables the overlap of computation and I/O via one-sided communication, and provides a coordinator to improve performance by leveraging data locality and communication patterns. For disk I/O, mpiHDFS realizes a parallel write mechanism with a delayed write mode for fast data flush in HDFS. To demonstrate the efficacy of mpiHDFS, we run two MPI applications (pBWA and DIAMOND) over mpiHDFS. The experimental results show that on a 17-node cluster, mpi- HDFS delivers up to 1.92× and 1.78× speedup over MPI I/O and HDFS pipelined write implementations, respectively. Optimizing Data Access with Next-Generation Storage Engine, Persistent Memory and Smart NICs Kenneth Cain (Intel)*; Venkata Krishnan (Intel); Johann Lombardi (Intel) Intel teams developing the Distributed Asynchronous Object Storage (DAOS) software and conducting HPC architecture research propose to present an introduction to DAOS and engage in a broader discussion with the HPEC community of end users, researchers, system developers/integrators, and technology providers. DAOS: DAOS provides a storage service running over a set of fabric-interconnected nodes, collectively providing a single tier of storage composed from two types of media installed in each node: non-volatile / persistent memory (PM) and NVMe solid state drives (SSDs). Applications also interact with the storage service over a high-speed fabric. Multiple application interfaces have been developed for use with DAOS storage including a DAOS key value API with python bindings, POSIX files/directories, MPI IO, HDF5, and Apache Spark. DAOS object storage is versioned, with the ability to create immutable snapshots of data at a particular version, and to perform aggregation, reclaiming storage associated with older versions of objects. Data splitting across servers provides opportunities for high storage bandwidth. Object replication to multiple servers and erasure coding features provide data protection against some number of faults that may occur during operation, with an online self-healing / rebuild feature triggered upon fault detection. Media selection strategies (PM and SSD) in DAOS that allow it to achieve very high IOPS and storage throughput will be discussed. Extreme levels of storage performance have been measured in recent executions of the IO-500 benchmarks integrated with DAOS. How DAOS interacts with the high-performance fabric will be discussed, including its end to end OS bypass architecture, and its use of the OpenFabrics Interface (OFI) libfabric API as a foundation for inter-node storage RPC and RDMA-based bulk data transfer. Versioned objects may be used as a vehicle to save application state (e.g., checkpoint/restart), as well as offering the possibility of a pipelined/asynchronous parallel producer/consumer communication workflow through storage. Architecture Research: COPA and DAOS Borne out of considerations for reducing footprint and power consumption for storage system infrastructure, investigations of the potential application of embedded architectures have been conducted. In particular, a proof of concept Smart NIC / accelerator FPGA framework COPA (COnfigurable network Protocol Accelerator) has been developed. COPA supports OFI that has been extended to expose various storage function acceleration modes (implemented in hardware) to middleware and applications. The acceleration functions may be invoked in the context of communication operations, for example inline with data movement, or as a separate lookaside operation. COPA has been implemented on different variants of Stratix10 FPGAs. Multiple COPA FPGAs can autonomously connect to a standard 100GigE switching network. The framework has been validated by microbenchmarks as well as proxy benchmarks that mimic the behavior of client/server flows – and achieves bandwidths close to 100Gbps when acceleration is enabled. Potential mappings of the DAOS software with the COPA communication and acceleration capabilities will also be explored. Poster Session: 3-P (12:15-15:45 EDT) Human Balance Models Optimized using a Large-scale, Parallel Architecture with Applications to Mild Traumatic Brain Injury Gregory Ciccarelli (Massachusetts Institute of Technology Lincoln Laboratory); Michael Nolan (U. Washington); Hrishikesh Rao (MIT Lincoln Laboratory); Tanya Talkar (MIT Lincoln Laboratory); Anne O'Brien (Spaulding Rehabilitation Hospital); Gloria Vergara-Diaz (Spaulding Rehabilitation Hospital); Ross Zafonte (Spaulding Rehabilitation Hospital); Thomas Quatieri (Massachusetts Institute of Technology Lincoln Laboratory); Ryan McKindles (MIT Lincoln Laboratory); Paolo Bonato (Spaulding Rehabilitation Hospital); Adam Lammert ()* Static and dynamic balance are frequently disrupted through brain injuries. The impairment can be complex and for mild traumatic brain injury (mTBI) can be undetectable by standard clinical tests.  Therefore, neurologically relevant modeling approaches are needed for detection and inference of mechanisms of injury. The current work presents models of static and dynamic balance that have a high degree of correspondence.  Emphasizing structural similarity between the domains facilitates development of both.  Furthermore, particular attention is paid to components of sensory feedback and sensory integration to ground mechanisms in neurobiology.  Models are adapted to fit experimentally collected data from 10 healthy control volunteers and 11 mild traumatic brain injury volunteers.  Through an analysis by synthesis approach whose implementation was made possible by a state-of-the-art high performance computing system, we derived an interpretable, model based feature set that could classify mTBI and controls in a static balance task with an ROC AUC of 0.72. Hardware Acceleration of Nonlocal Means-Based Speckle Noise Removal Applied to SAR Imagery Hector Li Sanchez (University of Pittsburgh)*; Alan George (NSF Center for High Performance Reconfigurable Computing) Removal of speckle noise from synthetic aperture radar (SAR) remains a significant challenge for space computers. Speckle noise is substantially harder to reduce than Gaussian noise due to its multiplicative nature. The probability patch-based (PPB) filter, based on the nonlocal means filter, can reduce speckle noise while preserving fine details. However, its high computational complexity inhibits its practical use in embedded applications. This issue is especially true for conventional space platforms, where radiation-hardened processors have significantly lower performance and energy-efficiency than their commercial-off-the-shelf counterparts. Combined with ever-increasing demands for data processing requirements and an emphasis on intelligent, autonomous systems, there is a need to enhance computing capabilities of space computers for current and future missions. Recently, hybrid System-on-Chip (SoC) devices have been increasingly adopted for use in space applications. Namely, CPU+FPGA devices present several architectural opportunities for efficient HW-SW partitioning and acceleration of applications. In this paper, a detailed description of a CPU+FPGA accelerator for speckle noise removal implementing the PPB filter is presented. The proposed architecture leverages the control-flow and data-flow strengths of the CPU and FPGA, respectively, to maximize performance. Studying the dataflow and computation properties of the algorithm allow for a highly parallelized, fully pipelined, and quantized design. When evaluated on the Xilinx Zynq-7045 SoC (Z-7045), our proposed architecture shows a significant performance improvement (up to ~750×) over a software-only baseline while maintaining modest FPGA resource utilization. The filtering quality is evaluated using both artificial and real SAR images. Quantitative analysis shows that use of the hardware design only introduces a negligible loss in quality. Storage Area Networks in Embedded Processing John Holland (Northrop Grumman)*; Jeremy Horner (Norhrop Grumman); Jason Harnish (Northrop Grumman); Timothy Linden (Northrop Grumman); Steve Mattson (Norhtrop Grumman) This paper explores the application of Storage Area Networks (SAN) to size, weight, and power (SWAP) constrained systems.  The goal is providing accessible, distributed, redundant storage with a high level of reliability and availability for multifunction sensors at the edge of the mission space.  This paper describes the advantage of applying SAN technology to these missions and describes the use of high performance embedded processing to provide the processing and storage for decision-making processing and data sharing at a SWAP appropriate for challenging form factors.  Specifically, this paper addresses how recent device-level integration advancements that bring Information Technology and Operational Technology onto the same silicon represent a paradigm shift which enables SAN in embedded processing platforms. Evaluating SEU Resilience of CNNs with Fault Injection Evan Kain (COSMIAC)*; Alan George (NSF Center for High Performance Reconfigurable Computing); Tyler Lovelly (U.S. Air Force Research Laboratory) Convolutional neural networks (CNNs) are quickly growing as a solution for advanced image processing in many mission-critical high- performance and embedded computing systems ranging from  supercomputers and data centers to aircraft and spacecraft. However, the systems running CNNs are increasingly susceptible to single-event upsets (SEUs) which are bit flips that result from charged particle strikes. To better understand how to mitigate the effects of SEUs on CNNs, the behavior of CNNs when exposed to SEUs must be better understood. Software fault- injection tools allow us to emulate SEUs to analyze the effects of various CNN architectures and input data features on overall resilience. Fault injection on three combinations of CNNs and datasets yielded insights into their behavior. When focusing on a threshold of 1% error in classification accuracy, more complex CNNs tended to be less resilient to SEUs, and easier classification tasks on well-clustered input data were more resilient to SEUs. Overall, the number of bits flipped to reach this threshold ranged from 20 to 3,790 bits. Results demonstrate that CNNs are highly resilient to SEUs, but the complexity of the CNN and difficulty of the classification task will decrease that resilience. Packing Narrow-Width Operands to Improve Energy Efficiency of General-Purpose GPU Computing Xin Wang (Virginia Commonwealth University); Wei Zhang (University of Louisville)* In this paper, we study the use of OWAR, an Operand-Width-Aware Register packing mechanism for GPU energy saving. In order to efficiently use the GPU register file (RF), OWAR employs a power gating method to shut down unused register sub-arrays for reducing dynamic and leakage energy consumption of RF. As the number of register accesses is reduced due to the packing of the narrow width operands, the dynamic energy dissipation is further decreased. Finally, with the help of RF usage optimized by register packing, OWAR allows GPUs to support more TLP (Thread Level Parallelism) through assigning additional thread blocks on SMs (Streaming Multiprocessors) for GPGPU (General-Purpose GPU) applications that suffer from the deficiency of register resources. The extra TLP opens opportunities for hiding more memory latencies and thus reduce the overall execution time, which can lower the overall energy consumption. We evaluate OWAR using a set of representative GPU benchmarks. The  experimental results show that compared to the baseline without optimization, OWAR can reduce the GPGPU’s total energy up to 29.6% and 9.5% on average. In addition, OWAR achieves performance improvement up to 1.97X and 1.18X on average. 3-2: Data Intensive Computing Session (12:30-13:45 EDT) Exploiting GPU Direct Access to Non-Volatile Memory to Accelerate Big Data Processing Mahsa Bayati (Northeastern University); Miriam Leeser (Northeastern University)*; Ningfang  Mi (Northeastern University) The amount of data being collected for analysis is growing at an exponential rate.  Along with this growth comes increasing necessity for computation and storage.  Researchers are addressing these needs by building heterogeneous clusters with CPUs and computational accelerators such as GPUs equipped with high I/O bandwidth storage devices.  One of the main bottlenecks of such heterogeneous systems is the data transfer bandwidth to GPUs when running I/O intensive applications.  The traditional approach gets data from storage to the host memory and then transfers it to the GPU, which can limit data throughput and processing and thus degrade the end-to-end performance. In this paper, we propose a new framework to address the above issue by exploiting  Peer-to-Peer Direct Memory Access to allow GPU direct access of the storage device and thus enhance the performance for parallel data processing applications in a heterogeneous big-data platform. Our heterogeneous cluster is supplied with CPUs and GPUs as computing resources and Non- Volatile Memory express (NVMe)  drives as storage resources. We deploy an Aoache Spark platform to execute representative data processing workloads over this heterogeneous cluster and then adopt Peer-to-Peer Direct Memory Access to connect GPUs to non- volatile storage directly to optimize the GPU data access.   Experimental results reveal that this heterogeneous Spark platform successfully bypasses the host memory and enables GPUs to communicate directly to the NVMe drive, thus achieving higher data transfer throughput and improving both data communication time and end-to-end performance by 20%. Profiling and Optimization of CT Reconstruction on Nvidia Quadro GV100 SHEKHAR DWIVEDI (Nvidia)*; Andreas Heumann (Nvidia) Computed Tomography (CT) Imaging is a widely used technique for medical and industrial applications. Iterative reconstruction algorithms are desired for improved reconstructed image quality and lower dose, but its computational requirements limit its practical usage. Reconstruction toolkit (RTK) is a package of open source GPU accelerated algorithms for CBCT (cone beam computed tomography). GPU based iterative algorithms gives immense acceleration, but it may not be optimized to use the GPU resources efficiently. Nvidia has released several profilers (Nsight-systems, Nsight-compute) to analyze the GPU implementation of an algorithm from compute utilization and memory efficiency perspective. This paper profiles and analyzes the GPU implementation of iterative FDK algorithm in RTK and optimizes it for computation and memory usage on a Quadro GV100 GPU with 32 GB of memory and over 5000 cuda cores. RTK based GPU accelerated iterative FDK when applied on a 4 byte per pixel input projection dataset of size 1.1 GB (512x512x1024) for 20 iterations, to reconstruct a volume of size 440 MB (512x512x441) with 4 byte per pixel, resulted in total runtime of ~11.2 seconds per iteration. Optimized RTK based iterative FDK presented in this paper took ~1.3 seconds per iteration. A Communication-Efficient Multi-Chip Design for Range-Limited Molecular Dynamics Chunshu Wu (Boston University)*; tong geng (Boston University); Chen Yang (Boston University); Vipin Sachdeva (Silicon Therapeutics); Woody Sherman (Silicon Therapeutics); Martin Herbordt (Boston University) Molecular Dynamics simulation (MD) has been thought a promising FPGA application for many years, especially with clusters of tightly coupled FPGAs where the large-scale, general-purpose, low-latency interconnects provide a communication capability not available with any other COTS computing technology. Parallelization of one part of the MD computation, the 3D FFT, has been studied previously; for likely FPGA cluster sizes, however, the range-limited computation (RL) is more challenging. The motivation here is that the direct replication of the single-chip design suffers from inefficient inter-board bandwidth usage. In particular, although communication in RL is local, likely bandwidth limitations will constrain performance unless great care is taken in design and analysis.  In the multi-chip scenario, inter-board bandwidth is the critical constraint and the main target of this work. We analyze it with respect to three application restructurings: workload distribution, data forwarding pattern, and data locality. We describe how bandwidth can be balanced by configuring workload distribution and data forwarding paths with respect to the number of on-board transceiver ports. We also show that, by manipulating data locality, the multi-chip design is efficiently migrated from the single-chip design, and the total bandwidth required can be configured to satisfy the bandwidth limit. Bit-Error Aware Quantization for DCT-based Lossy Compression Jialing Zhang (University of Massachusetts Lowell)*; Jiaxi Chen ( University of Massachusetts Lowell); Aekyeung Moon (ETRI); Xiaoyan Zhuo (University of Massachusetts Lowell); Seung Woo Son (University of Massachusetts Lowell) Scientific simulations run by high-performance computing (HPC) systems produce a large amount of data, which causes an extreme I/O bottleneck and a huge storage burden. Applying compression techniques can mitigate such overheads through reducing the data size. Unlike traditional lossless compressions, error-controlled lossy compressions, such as SZ, ZFP, and DCTZ, designed for scientists who demand not only high compression ratios but also a guarantee of certain degree of precision, is coming into prominence. While rate-distortion efficiency of recent lossy compressors, especially the DCT-based one, is promising due to its high- compression encoding, the overall coding architecture is still conservative, necessitating the quantization that strikes a balance between different encoding possibilities and varying rate-distortions. In this paper, we aim to improve the performance of DCT-based compressor, namely DCTZ, by optimizing the quantization model and encoding mechanism. Specifically, we propose a bit-efficient quantizer based on the DCTZ framework, develop a unique ordering mechanism based on the quantization table, and extend the encoding index. We evaluate the performance of our optimized DCTZ in terms of rate-distortion using real-world HPC datasets. Our experimental evaluations demonstrate that, on average, our proposed approach can improve the compression ratio of the original DCTZ by 1.38x. Moreover, combined with the extended encoding mechanism, the optimized DCTZ shows a competitive performance with state-of-the-art lossy compressors, SZ and ZFP. 3-3: Case Studies & Benchmarking Session (14:15-15:30 EDT) Accelerating MRI Reconstruction on TPUs Tianjian Lu (Google)*; Thibault Marin (Harvard Medical School); Yue Zhuo (Harvard Medical School ); Yi-fan Chen (Google); Chao Ma (Massachusetts General Hospital) The advanced magnetic resonance (MR) image reconstructions such as the compressed sensing and subspace-based imaging are considered as large-scale, iterative, optimization problems. Given the large number of reconstructions required by the practical clinical usage, the computation time of these advanced reconstruction methods is often unacceptable. In this work, we propose using Google's Tensor Processing Units (TPUs) to accelerate the MR image reconstruction.  TPU is an application-specific integrated circuit (ASIC) for machine learning applications, which has recently been used to solve large-scale scientific computing problems. As proof-of-concept, we implement the alternating direction method of multipliers (ADMM) in TensorFlow to reconstruct images on TPUs. The reconstruction is based on multi-channel, sparsely sampled, and radial-trajectory $k$-space data with sparsity constraints. The forward and inverse non-uniform Fourier transform operations are formulated in terms of matrix multiplications as in the discrete Fourier transform. The sparsifying transform and its adjoint operations are formulated as convolutions. The data decomposition is applied to the measured $k$-space data such that the aforementioned tensor operations are localized within individual TPU cores. The data decomposition and the inter-core communication strategy are designed in accordance with the TPU interconnect network topology in order to minimize the communication time. The accuracy and the high parallel efficiency of the proposed TPU-based image reconstruction method are demonstrated through numerical examples. Processing of Crowdsourced Observations of Aircraft in a High Performance Computing Environment Andrew Weinert (MIT Lincoln Laboratory)*; Ngaire Underhill (MIT Lincoln Laboratory); Bilal Gill (MIT Lincoln Laboratory); Ashley Wicks (MIT Lincoln Laboratory) As unmanned aircraft systems (UASs) continue to integrate into the U.S. National Airspace System (NAS), there is a need to quantify the risk of airborne collisions between unmanned and manned aircraft to support regulation and standards development. Both regulators and standards developing organizations have made extensive use of Monte Carlo collision risk analysis simulations using probabilistic models of aircraft flight. Northeast Cyberteam – Building an Environment for Sharing Best Practices and Solutions for Research Computing Scott Valcourt (University of New Hampshire)*; John Goodhue (MGHPCC); Julie Ma (MGHPCC); Adrian Del Maestro (University of Vermont); Sia Najafi (Worcester Polytechnic Institute); Bruce Segee (University of Maine); Ralph Zottola (University of Alabama) he Northeast Cyberteam Program is a collaborative effort across Maine, New Hampshire, Vermont, and Massachusetts that seeks to assist researchers at small and medium-sized institutions in the region with making use of cyberinfrastructure, while simultaneously building the next generation of research computing facilitators. Recognizing that research computing facilitators are frequently in short supply, the program also places intentional emphasis on capturing and disseminating best practices in an effort to enable opportunities to leverage and build on existing solutions whenever practical. The program combines direct assistance to computationally intensive research projects; experiential learning opportunities that pair experienced mentors with students interested in research computing facilitation; sharing of resources and knowledge across large and small institutions; and tools that enable efficient oversight and possible replication of these ideas in other regions. Each project involves a researcher seeking to better utilize cyberinfrastructure in research, a student facilitator, and a mentor with relevant domain expertise. These individuals may be at the same institution or at separate institutions.  The student works with the researcher and the mentor to become a bridge between the infrastructure and the research domain.  Through this model, students receive training and opportunities that otherwise would not be available, research projects get taken to a higher level, and the effectiveness of the mentor is multiplied.  Providing tools to enable self-service learning is a key concept in our strategy to develop facilitators through experiential learning, recognizing that one of the most fundamental skills of successful facilitators is their ability to quickly learn enough about new domains and applications to be able draw parallels with their existing knowledge and help to solve the problem at hand. The Cyberteam Portal is used to access the self-service learning resources developed to provide just-in-time information delivery to participants as they embark on projects in unfamiliar domains, and also serves as a receptacle for best practices, tools, and techniques developed during a project. Tools include Ask.CI, an interactive site for questions and answers; a learning resources repository used to collect online training modules vetted by cyberteam projects that provide starting points for subsequent projects or independent activities; and a Github repository. The Northeast Cyberteam was created with funding from the National Science Foundation, but has developed strategies for sustainable operations. Benchmarking network fabrics for data distributed training of deep neural networks Siddharth Samsi (MIT Lincoln Laboratory)*; Andrew Prout (MIT); Michael Jones (MIT Lincoln Laboratory); Andrew Kirby (Massachusetts Institute of Technology Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory - USA); Jeremy Kepner (MIT Lincoln Laboratory); Albert Reuther (MIT Lincoln Laboratory) Artificial Intelligence/Machine Learning applica- tions require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for com- municating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics. A congestion control mechanism for SDN-based  fat-tree networks Haitham Ghalwash (University of connecticut)*; Chun-Hsi Huang (Southern Illinois University) Nowadays, data centers are experiencing increasing applications’ need that is mandating a software-oriented network architecture. A Software-Defined Network (SDN) is the new technology to overcome the traditional network’s limitations. QoS is one current limitation that needs to be well-structured for successful software-oriented network architecture. Considerable key factors affect QoS, comprising traffic shaping and congestion control. This paper proposes a congestion control mechanism in SDN-based networks to enhance the overall QoS. The proposed mechanism monitors and detects congested parts and reacts automatically to reduce traffic load. The traffic load is redistributed to ensure better performance by re-routing subset flows. The re-routing decision is based on a passively measured QoS metric, port utilization. Experiments are conducted to prove the effectiveness of the proposed mechanism in an SDN-based fat- tree network. Results showed that the TCP recorded a noticeable improvement, namely, 22.4% in the average delay, 21.3% in throughput, 18.6% in max delay, and 15.3% for jitter. Moreover, the maximum monitored port utilization in the aggregation and core switches was also reduced by 22% on average. 3-4: Case Studies & Benchmarking Session (15:45-17:00 EDT) Performance Strategies for Parallel Bitonic Sort on a Migratory Thread Architecture Kaushik Velusamy (University of Maryland, Baltimore County); Thomas Rolinger (Laboratory for Physical Sciences)*; Janice McMahon (Emu Technologies) Large-scale data analytics often represent vast amounts of sparse data as a graph. As a result, the underlying kernels in data analytics can be reduced down to operations over graphs, such as searches and traversals. Graph algorithms are notoriously difficult to implement for high performance due to the irregular nature of their memory access patterns, resulting in poor utilization of a traditional cache memory hierarchy. As a result, new architectures have been proposed that specifically target irregular applications. One example is the cache-less Emu migratory thread architecture developed by Lucata Technology. While it is important to evaluate and understand irregular applications on a system such as Emu, it is equally important to explore applications which are not irregular themselves, but are often used as key pre-processing steps in irregular applications. Sorting a list of values is one such pre-processing step, as well as one of the fundamental operations in data analytics. In this paper, we extend our prior preliminary evaluation of parallel bitonic sort on the Emu architecture. We explore different performance strategies for bitonic sort by leveraging the unique features of Emu. In doing so, we implement three significant capabilities into bitonic sort: a smart data layout that periodically remaps data to avoid remote accesses, efficient thread spawning strategies, and adaptive loop parallelization to achieve proper load balancing over time. We present a performance evaluation that demonstrates speed-ups as much as 14.26x by leveraging these capabilities. Hash Table Scalability on Intel PIUMA Balasubramanian Seshasayee (Intel Corporation)*; Joshua Fryman (Intel); Ibrahim Hur (Intel Corporation) The Intel PIUMA (Programmable and Integrated Unified Memory Architecture) is a scalable, massively multithreaded architecture designed to operate on unstructured data, with a global address space, fine-grain memory access and variousnovelfeaturesforlatencyhidingduringdatamovement.Hash tables are a commonly used data structure with unstructured data, hence it is imperative that the performance and scaling for hash table usages are optimized for this architecture. We study three different hash table implementations on a PIUMA simulator to show that a dual-atomics based implementation, a unique feature in PIUMA, performs competitively both at larger scales and under hash collisions. Our implementations are able to achieve strong scaling up to 16,384 hardware threads. Enhanced Parallel Simulation for ACAS X Development Adam Gjersvik (MIT Lincoln Laboratory)* ACAS X is the next generation airborne collision avoidance system intended to meet the demands of the rapidly evolving U.S. National Airspace System (NAS). The collision avoidance safety and operational suitability of the system are optimized and continuously evaluated by simulating billions of characteristic aircraft encounters in a fast-time Monte Carlo environment. There is therefore an inherent computational cost associated with each ACAS X design iteration and parallelization of the simulations is necessary to keep up with rapid design cycles. This work describes an effort to profile and enhance the parallel computing infrastructure deployed on the computing resources offered by the Lincoln Laboratory Supercomputing Center. The approach to large-scale parallelization of our fast- time airspace encounter simulation tool is presented along with corresponding parallel profile data collected on different kinds of compute nodes. A simple stochastic model for distributed simulation is also presented to inform optimal work batching for improved simulation efficiency. The paper concludes with a discussion on how this high-performance parallel simulation method enables the rapid safety-critical design of ACAS X in a fast-paced iterative design process. Architectural Analysis of Deep Learning on Edge Accelerators Luke Kljucaric (NSF SHREC)*; Alex Johnson (NSF SHREC); Alan George (NSF SHREC) As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine-learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many devices feature hardware optimized for data types other than 32-bit floating-point numbers, the standard representation defined by MLPerf. Edge-computing devices often feature app-specific hardware to offload common operations found in ML apps from the constrained CPU. This research analyzes multiple low-power compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency for optical character recognition. Considering these models are custom and not the most widely used, many architectures are not specifically optimized for them. The performance of these models can stress devices in different, yet insightful, ways that generalizations of the performance of other models can be drawn from. The NVIDIA Jetson AGX Xavier (AGX), Intel Neural Compute Stick 2 (NCS2), and Google Edge TPU architectures are analyzed with respect to their performance. The design of the AGX and TPU devices showcased the lowest streaming latency for AlexNet and GoogLeNet, respectively. Additionally, the tightly-integrated NCS2 design showed the best generalizability in performance and efficiency across neural networks. 3-S1: Graph Challenge Special (17:30-19:30 EDT) Scaling Graph Clustering with Distributed Sketches Benjamin Priest (Lawrence Livermore National Laboratory)*; Alec Dunton (University of Colorado Boulder); Geoffrey Sanders (LLNL) The unsupervised learning of community structure, in particular the partitioning vertices into clusters or communities, is a canonical and well-studied problem in exploratory graph analysis. However, like most graph analyses the introduction of immense scale presents challenges to traditional methods. Spectral clustering in distributed memory, for example, requires hundreds of expensive bulk- synchronous communication rounds to compute an embedding of vertices to a few eigenvectors of a graph associated matrix. Furthermore, the whole computation may need to be repeated if the underlying graph changes some low percentage of edge updates. We present a method inspired by spectral clustering where we instead use matrix sketches derived from random dimension-reducing projections. We show that our method produces embeddings that yield performant clustering results given a fully-dynamic stochastic block model stream using both the fast Johnson-Lindenstrauss and CountSketch transforms. We also discuss the effects of stochastic block model parameters upon the required dimensionality of the subsequent embeddings, and show how random projections could significantly improve the performance of graph clustering in distributed memory. At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation Mert Hidayetoglu (University of Illinois at Urbana-Champaign)*; Carl Pearson (University of Illinois at Urbana-Champaign); Vikram Sharma Mailthody (University of Illinois at Urbana-Champaign); Eiman Ebrahimi (NVIDIA); Jinjun Xiong (IBM Thomas J. Watson Research Center); Rakesh Nagi (University of Illinois at Urbana-Champaign); Wen-Mei Hwu (University of Illinois at Urbana- Champaign) This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many neural networks beyond the capacity of available accelerators. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks. However, there is room for improvement in implementing SpDNN operations on GPUs. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function. The optimized kernels reuse input feature maps from the shared memory and sparse weights from registers. For multi-GPU parallelism, our SpDNN implementation duplicates weights and statically partition the feature maps across GPUs. Results for the challenge benchmarks show that the proposed kernel design and multi-GPU parallelization achieve up to 180 TeraEdges per second inference throughput. These results are up to 4.3X faster for a single GPU and an order of magnitude faster at full scale than those of the champion of the 2019 Sparse Deep Neural Network Graph Challenge for the same generation of NVIDIA V100 GPUs.  Using the same implementation, we also show single-GPU throughput on NVIDIA A100 is 2.37X faster than V100. A Novel Inference Algorithm for Large Sparse Neural Network using Task Graph Parallelism Dian-Lun Lin (University of Utah); Tsung-Wei Huang (University of Utah)* The ever-increasing size of modern deep neural network (DNN) architectures has put increasing strain on the hardware needed to implement them. Sparsified DNNs can greatly reduce memory costs and increase throughput over standard DNNs, if the loss of accuracy can be adequately controlled. However, sparse DNNs present unique computational challenges. Efficient model or data parallelism algorithms are extremely hard to design and implement. The recent MIT/IEEE/Amazon HPEC Graph Challenge has drawn attention to high-performance inference methods for large sparse DNNs. In this paper, we introduce SNIG, an efficient inference engine for large sparse DNNs. SNIG develops highly optimized inference kernels and leverages the power of CUDA Graphs to enable efficient decomposition of model and data parallelisms. Our decomposition strategy is flexible and scalable to different partitions of data volumes, model sizes, and GPU numbers. We have evaluated SNIG on the official benchmarks of HPEC Sparse DNN Challenge and demonstrated its promising performance scalable from a single GPU to multiple GPUs. Compared to the champion of the 2019 HPEC Sparse DNN Challenge, SNIG can finish all inference workloads using only a single GPU. At the largest DNN, which has more than 4 billion parameters across 1920 layers each of 65536 neurons, SNIG is up to 2.3× faster than a state-of-the-art baseline under a machine of 4 GPUs. TriC: Distributed-memory Triangle Counting by Exploiting the Graph Structure SAYAN GHOSH (WASHINGTON STATE UNIVERSITY)*; Mahantesh Halappanavar (Pacific Northwest National Laboratory) Graph analytics has emerged as an important tool in the analysis of large scale data from diverse application domains such as social networks, cyber security and bioinformatics. Counting the number of triangles in a graph is a fundamental kernel with several applications such as detecting the community structure of a graph or in identifying important vertices in a graph. The ubiquity of massive datasets is driving the need to scale graph analytics on parallel systems. However, numerous challenges exist in efficiently parallelizing graph algorithms, especially on distributed-memory systems. Irregular memory accesses and communication patterns, low computation to communication ratios, and the need for frequent synchronization are some of the leading challenges. In this paper, we present TriC, our distributed-memory implementation of triangle counting in graphs using the Message Passing Interface (MPI), as a submission to the 2020 Graph Challenge competition. Using a set of synthetic and real-world inputs from the challenge, we demonstrate a speedup of up to 90x relative to previous work on 32 processor-cores of a NERSC Cori node. We also provide details from distributed runs with up to 8192 processes along with strong scaling results. The observations presented in this work provide an understanding of the system- level bottlenecks at scale that specifically impact sparse-irregular workloads and will therefore benefit other efforts to parallelize graph algorithms. In this paper, we discuss our distributed-memory implementation of graph triangle counting using MPI, as a submission to the 2020 Graph Challenge competition. Combinatorial Tiling for Sparse Neural Networks Filip Pawłowski (ENS Lyon)*; Bora Uçar (CNRS and LIP (UMR5668)); Rob Bisseling (Utrecht University); Albert-Jan Yzelman (Huawei Zürich Research Center) Sparse deep neural networks (DNNs) emerged as the result of search for networks with less storage and lower computational complexity. The sparse DNN inference is the task of using such trained DNN networks to classify a batch of input data. We propose an efficient, hybrid model- and data-parallel DNN inference using hypergraph models and partitioners. We exploit tiling and weak synchronization to increase cache reuse, hide load imbalance, and hide synchronization costs. Finally, a blocking approach allows application of this new hybrid inference procedure for deep neural networks. We initially experiment using the hybrid tiled inference approach only, using the first five layers of networks from the IEEE HPEC 2019 Graph Challenge, and attain up to 2x speedup versus a data-parallel baseline. Studying the Effects of Hashing of Sparse Deep Neural Networks on Data and Model Parallelisms Mohammad Hasanzadeh Mofrad (University of Pittsburgh)*; Rami Melhem (University of Pittsburgh); Mohammad Hammoud (Carnegie Mellon University); Muhammad Yousuf Ahmad (Carnegie Mellon University in Qatar) Deep Neural Network (DNN) training and inference are two resource-intensive tasks that are usually scaled out using data or model parallelism where \textit{data parallelism} parallelizes over the \textit{input data} and \textit{model parallelism} parallelizes over the \textit{network}. Also, \textit{dense matrix-matrix multiplication} is the key primitive behind training/inference of \textit{dense DNNs}. On the contrary, \textit{sparse DNNs} are less resource-intensive compared to their dense counterparts while offering comparable accuracy. Similarly, they can be parallelized using data or model parallelism with \textit{Sparse Matrix-Matrix Multiplication (SpMM)} as the key primitive. To scale out, both data and model parallelisms initially use data parallelism to partition the input data among multiple machines. This initial partitioning of the input makes data and model parallelisms performance prone to load imbalance as partitions may be imbalanced. As part of this paper, we take a deeper look into data and model parallelisms and closely study the mechanics of the SpMM used for each. Moreover, to intuitively remedy their load imbalance problem, we incorporate hashing as a simple yet powerful method to address load imabalance. Finally, we use the IEEE HPEC sparse DNN challenge dataset to evaluate the performance of data and model parallelisms at scale. We scaled up to 32 machines (896 cores) and inferred a large sparse DNN with 4B parameters in 51 seconds. Results suggest that with hashing, data and model parallelisms achieve super-linear speedup due to better load balance and cache utilization. Incremental Streaming Graph Partitioning L Durbeck (Virginia Tech)*; Peter Athanas (Virginia Tech) Graph partitioning is an NP-hard problem whose efficient approximation has long been a subject of interest and progress. The I/O bounds of contemporary computing environments favor incremental or streaming graph partitioning methods. Methods have sought a balance between latency, simplicity, accuracy, and memory size. In this paper, we apply an incremental approach to streaming partitioning that tracks changes with a lightweight proxy triggering repartition as error increases. We evaluate its performance on the DARPA/MIT Graph Challenge streaming stochastic block partition dataset, and find that it can dramatically reduce the invocation of partitioning, which can provide an order of magnitude speedup. KTrussExplorer: Exploring the Design Space of K-truss Decomposition Optimizations on GPUs Safaa Diab (American University of Beirut); Mhd Ghaith Olabi (American University of Beirut); Izzat El Hajj (American University of Beirut)* K-truss decomposition is an important method in graph analytics for finding cohesive subgraphs in a graph. Various works have accelerated k-truss decomposition on GPUs and have proposed different optimizations while doing so. The combinations of these optimizations form a large design space. However, most GPU implementations focus on a specific combination or set of combinations in this space. This paper surveys the optimizations applied to k-truss decomposition on GPUs, and presents KTrussExplorer, a framework for exploring the design space formed by the combinations of these optimizations. Our evaluation shows that the best combination highly depends on the graph of choice, and analyses the conditions that make each optimization attractive. Some of the best combinations we find outperform previous Graph Challenge champions on many large graphs. Analysis of floating-point round-off error in linear algebra routines for graph clustering Lucia Yang (University of Colorado Boulder)*; Alyson Fox (LLNL) We explore the various ways rounding errors can impact the power method for calculating the Fielder vector for graph clustering. A rounding error analysis reveals that the best eigenpair that is computable with a certain floating point precision type has a worst-case error that scales to its unit round-off. Although rounding errors can accumulate in the power method at the worst-case bound, this behavior is not reflected in some practical examples. Furthermore, our numerical experiments show that rounding errors from the power method may satisfy the conditions necessary for the bounding of the mis-clustering rate and that the approximate eigenvectors with errors close to half precision unit round-off can yield sufficient clustering results for partitioning stochastic block model graphs.  
Wednesday, September 23
2020 Abstract Book
3-1: Big Data & Distributed Computing 2 Session (11:00-12:15 EDT) High-Throughput Image Alignment for Connectomics using Frugal Snap Judgments Tim Kaler (MIT CSAIL)*; Brian Wheatman (Johns Hopkins University); Sarah Wooders (MIT) The accuracy and computational efficiency of image alignment directly affects the advancement of connectomics, a field which seeks to understand the structure of the brain through electron microscopy. We introduce the algorithms Quilter and Stacker that are designed to perform 2D and 3D alignment respectively on petabyte-scale data sets from connectomics. Quilter and Stacker are efficient, scalable, and can run on hardware ranging from a researcher's laptop to a large computing cluster. On a single 18-core cloud machine each algorithm achieves throughputs of more than 1 TB/hr; when combined the algorithms produce an end-to-end alignment pipeline that processes data at a rate of 0.82 TB/hr --- an over 10x improvement from previous systems. This efficiency comes from both traditional optimizations and from the use of ``Frugal Snap Judgments'' to judiciously exploit performance-- accuracy trade-offs. A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was evaluated using three datasets whose size ranged from 550 GB to 38 TB.  The full alignment pipeline achieved a throughput of 0.6-- 0.8 TB/hr and 1.4--1.5 TB/hr on an 18-core and 112-core shared- memory multicore, respectively. On a supercomputing cluster with 200 nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr. DS-SHMEM: Staging-enabled PGAS Programming for Data- intensive Workflows Daihou Wang (Rutgers University)* PGAS based programming models, represented by OpenSHMEM standard, has been bringing advanced performance and programmability via its performances uni- formly access over memory of local and remote nodes. However, as the development of data intensive analysis, data scale are growing over the memory of processing nodes. In this paper, we propose to introduce additional data staging area into the programming model, and introduce SHSPACES programming model. In SHAPCES, memory PEs can be allocated and accessed asynchronously on data staging nodes, in addition to local and remote memory. Self-Scaling Clusters and Reproducible Containers to Enable Scientific Computing Peter Vaillancourt (Cornell University)*; John Coulter (Indiana University); Richard Knepper (Cornell University); Brandon Barker (Cornell University) Container technologies such as Docker have become a crucial component of many software industry practices especially those pertaining to reproducibility and portability.  The containerization philosophy has influenced the scientific computing community, which has begun to adopt - and even develop - container technologies (such as Singularity).  Leveraging containers for scientific software often poses challenges distinct from those encountered in industry, and requires different methodologies.  This is especially true for HPC.  With an increasing number of options for HPC in the cloud (including NSF-funded cloud projects), there is strong motivation to seek solutions that provide flexibility to develop and deploy scientific software on a variety of computational infrastructures in a portable and reproducible way.  The Cyberinfrastructure Resource Integration team in the XSEDE project has developed a simple tool which provides HPC infrastructure in the cloud that scales with user demand.  We now present a solution which uses the Nix package manager in an MPI- capable Docker container that is converted to Singularity. It provides consistent installations, dependencies, and environments in each image that are reproducible and portable across scientific computing infrastructures.  We demonstrate the utility of these containers with cluster benchmark runs in a self-scaling virtual cluster using the Slurm scheduler deployed in the Jetstream and Aristotle Red Cloud OpenStack clouds.  We conclude this technique is useful as a template for scientific software application containers to be used in the XSEDE compute environment, other Singularity HPC environments, and cloud computing environments. mpiHDFS: High-Performance Computing over a Commodity Filesystem via Aggregation, Reordering, and Coordination of Computation and I/O Da Zhang (Virginia Tech); Jing Zhang (Virginia Tech); Kaixi Hou (Virginia Tech); Sarunya Pumma (Virginia Tech); Wu-chun Feng (Virginia Tech)* With the emerging trend of integrating high-per- formance computing (HPC) with big data (BIGDATA) processing, running MPI over HDFS offers a promising approach for deliv- ering better scalability and fault tolerance to traditional HPC applications. However, it comes with challenges that discourage such an approach: (1) slow two-sided communication in MPI to support intermediate data processing, (2) a focus on enabling N-1 writes that is subject to the default (and naive) HDFS block- placement policy, and (3) a pipelined writing mode in HDFS that cannot fully utilize the underlying HPC hardware. Hence, without a holistic and systematic approach, the integration of HPC and BIGDATA falls short of delivering competitive performance.  As such, we propose mpiHDFS, middleware that sits between MPI applications and HDFS, to aggregate and reorder interme- diate data and to coordinate computation and I/O. Specifically, mpiHDFS provides highly-optimized merge and sort for inter- mediate data to enrich MPI functionality, enables the overlap of computation and I/O via one-sided communication, and provides a coordinator to improve performance by leveraging data locality and communication patterns. For disk I/O, mpiHDFS realizes a parallel write mechanism with a delayed write mode for fast data flush in HDFS. To demonstrate the efficacy of mpiHDFS, we run two MPI applications (pBWA and DIAMOND) over mpiHDFS. The experimental results show that on a 17-node cluster, mpi- HDFS delivers up to 1.92× and 1.78× speedup over MPI I/O and HDFS pipelined write implementations, respectively. Optimizing Data Access with Next-Generation Storage Engine, Persistent Memory and Smart NICs Kenneth Cain (Intel)*; Venkata Krishnan (Intel); Johann Lombardi (Intel) Intel teams developing the Distributed Asynchronous Object Storage (DAOS) software and conducting HPC architecture research propose to present an introduction to DAOS and engage in a broader discussion with the HPEC community of end users, researchers, system developers/integrators, and technology providers. DAOS: DAOS provides a storage service running over a set of fabric- interconnected nodes, collectively providing a single tier of storage composed from two types of media installed in each node: non- volatile / persistent memory (PM) and NVMe solid state drives (SSDs). Applications also interact with the storage service over a high-speed fabric. Multiple application interfaces have been developed for use with DAOS storage including a DAOS key value API with python bindings, POSIX files/directories, MPI IO, HDF5, and Apache Spark. DAOS object storage is versioned, with the ability to create immutable snapshots of data at a particular version, and to perform aggregation, reclaiming storage associated with older versions of objects. Data splitting across servers provides opportunities for high storage bandwidth. Object replication to multiple servers and erasure coding features provide data protection against some number of faults that may occur during operation, with an online self-healing / rebuild feature triggered upon fault detection. Media selection strategies (PM and SSD) in DAOS that allow it to achieve very high IOPS and storage throughput will be discussed. Extreme levels of storage performance have been measured in recent executions of the IO-500 benchmarks integrated with DAOS. How DAOS interacts with the high-performance fabric will be discussed, including its end to end OS bypass architecture, and its use of the OpenFabrics Interface (OFI) libfabric API as a foundation for inter-node storage RPC and RDMA-based bulk data transfer. Versioned objects may be used as a vehicle to save application state (e.g., checkpoint/restart), as well as offering the possibility of a pipelined/asynchronous parallel producer/consumer communication workflow through storage. Architecture Research: COPA and DAOS Borne out of considerations for reducing footprint and power consumption for storage system infrastructure, investigations of the potential application of embedded architectures have been conducted. In particular, a proof of concept Smart NIC / accelerator FPGA framework COPA (COnfigurable network Protocol Accelerator) has been developed. COPA supports OFI that has been extended to expose various storage function acceleration modes (implemented in hardware) to middleware and applications. The acceleration functions may be invoked in the context of communication operations, for example inline with data movement, or as a separate lookaside operation. COPA has been implemented on different variants of Stratix10 FPGAs. Multiple COPA FPGAs can autonomously connect to a standard 100GigE switching network. The framework has been validated by microbenchmarks as well as proxy benchmarks that mimic the behavior of client/server flows – and achieves bandwidths close to 100Gbps when acceleration is enabled. Potential mappings of the DAOS software with the COPA communication and acceleration capabilities will also be explored. Poster Session: 3-P (12:15-15:45 EDT) Human Balance Models Optimized using a Large-scale, Parallel Architecture with Applications to Mild Traumatic Brain Injury Gregory Ciccarelli (Massachusetts Institute of Technology Lincoln Laboratory); Michael Nolan (U. Washington); Hrishikesh Rao (MIT Lincoln Laboratory); Tanya Talkar (MIT Lincoln Laboratory); Anne O'Brien (Spaulding Rehabilitation Hospital); Gloria Vergara-Diaz (Spaulding Rehabilitation Hospital); Ross Zafonte (Spaulding Rehabilitation Hospital); Thomas Quatieri (Massachusetts Institute of Technology Lincoln Laboratory); Ryan McKindles (MIT Lincoln Laboratory); Paolo Bonato (Spaulding Rehabilitation Hospital); Adam Lammert ()* Static and dynamic balance are frequently disrupted through brain injuries. The impairment can be complex and for mild traumatic brain injury (mTBI) can be undetectable by standard clinical tests.  Therefore, neurologically relevant modeling approaches are needed for detection and inference of mechanisms of injury. The current work presents models of static and dynamic balance that have a high degree of correspondence.  Emphasizing structural similarity between the domains facilitates development of both.  Furthermore, particular attention is paid to components of sensory feedback and sensory integration to ground mechanisms in neurobiology.  Models are adapted to fit experimentally collected data from 10 healthy control volunteers and 11 mild traumatic brain injury volunteers.  Through an analysis by synthesis approach whose implementation was made possible by a state-of-the-art high performance computing system, we derived an interpretable, model based feature set that could classify mTBI and controls in a static balance task with an ROC AUC of 0.72. Hardware Acceleration of Nonlocal Means-Based Speckle Noise Removal Applied to SAR Imagery Hector Li Sanchez (University of Pittsburgh)*; Alan George (NSF Center for High Performance Reconfigurable Computing) Removal of speckle noise from synthetic aperture radar (SAR) remains a significant challenge for space computers. Speckle noise is substantially harder to reduce than Gaussian noise due to its multiplicative nature. The probability patch-based (PPB) filter, based on the nonlocal means filter, can reduce speckle noise while preserving fine details. However, its high computational complexity inhibits its practical use in embedded applications. This issue is especially true for conventional space platforms, where radiation- hardened processors have significantly lower performance and energy-efficiency than their commercial-off-the-shelf counterparts. Combined with ever-increasing demands for data processing requirements and an emphasis on intelligent, autonomous systems, there is a need to enhance computing capabilities of space computers for current and future missions. Recently, hybrid System-on-Chip (SoC) devices have been increasingly adopted for use in space applications. Namely, CPU+FPGA devices present several architectural opportunities for efficient HW-SW partitioning and acceleration of applications. In this paper, a detailed description of a CPU+FPGA accelerator for speckle noise removal implementing the PPB filter is presented. The proposed architecture leverages the control-flow and data-flow strengths of the CPU and FPGA, respectively, to maximize performance. Studying the dataflow and computation properties of the algorithm allow for a highly parallelized, fully pipelined, and quantized design. When evaluated on the Xilinx Zynq-7045 SoC (Z-7045), our proposed architecture shows a significant performance improvement (up to ~750×) over a software-only baseline while maintaining modest FPGA resource utilization. The filtering quality is evaluated using both artificial and real SAR images. Quantitative analysis shows that use of the hardware design only introduces a negligible loss in quality. Storage Area Networks in Embedded Processing John Holland (Northrop Grumman)*; Jeremy Horner (Norhrop Grumman); Jason Harnish (Northrop Grumman); Timothy Linden (Northrop Grumman); Steve Mattson (Norhtrop Grumman) This paper explores the application of Storage Area Networks (SAN) to size, weight, and power (SWAP) constrained systems.  The goal is providing accessible, distributed, redundant storage with a high level of reliability and availability for multifunction sensors at the edge of the mission space.  This paper describes the advantage of applying SAN technology to these missions and describes the use of high performance embedded processing to provide the processing and storage for decision-making processing and data sharing at a SWAP appropriate for challenging form factors.  Specifically, this paper addresses how recent device-level integration advancements that bring Information Technology and Operational Technology onto the same silicon represent a paradigm shift which enables SAN in embedded processing platforms. Evaluating SEU Resilience of CNNs with Fault Injection Evan Kain (COSMIAC)*; Alan George (NSF Center for High Performance Reconfigurable Computing); Tyler Lovelly (U.S. Air Force Research Laboratory) Convolutional neural networks (CNNs) are quickly growing as a solution for advanced image processing in many mission-critical high-performance and embedded computing systems ranging from  supercomputers and data centers to aircraft and spacecraft. However, the systems running CNNs are increasingly susceptible to single-event upsets (SEUs) which are bit flips that result from charged particle strikes. To better understand how to mitigate the effects of SEUs on CNNs, the behavior of CNNs when exposed to SEUs must be better understood. Software fault-injection tools allow us to emulate SEUs to analyze the effects of various CNN architectures and input data features on overall resilience. Fault injection on three combinations of CNNs and datasets yielded insights into their behavior. When focusing on a threshold of 1% error in classification accuracy, more complex CNNs tended to be less resilient to SEUs, and easier classification tasks on well-clustered input data were more resilient to SEUs. Overall, the number of bits flipped to reach this threshold ranged from 20 to 3,790 bits. Results demonstrate that CNNs are highly resilient to SEUs, but the complexity of the CNN and difficulty of the classification task will decrease that resilience. Packing Narrow-Width Operands to Improve Energy Efficiency of General-Purpose GPU Computing Xin Wang (Virginia Commonwealth University); Wei Zhang (University of Louisville)* In this paper, we study the use of OWAR, an Operand-Width- Aware Register packing mechanism for GPU energy saving. In order to efficiently use the GPU register file (RF), OWAR employs a power gating method to shut down unused register sub-arrays for reducing dynamic and leakage energy consumption of RF. As the number of register accesses is reduced due to the packing of the narrow width operands, the dynamic energy dissipation is further decreased. Finally, with the help of RF usage optimized by register packing, OWAR allows GPUs to support more TLP (Thread Level Parallelism) through assigning additional thread blocks on SMs (Streaming Multiprocessors) for GPGPU (General- Purpose GPU) applications that suffer from the deficiency of register resources. The extra TLP opens opportunities for hiding more memory latencies and thus reduce the overall execution time, which can lower the overall energy consumption. We evaluate OWAR using a set of representative GPU benchmarks. The  experimental results show that compared to the baseline without optimization, OWAR can reduce the GPGPU’s total energy up to 29.6% and 9.5% on average. In addition, OWAR achieves performance improvement up to 1.97X and 1.18X on average. 3-2: Data Intensive Computing Session (12:30- 13:45 EDT) Exploiting GPU Direct Access to Non-Volatile Memory to Accelerate Big Data Processing Mahsa Bayati (Northeastern University); Miriam Leeser (Northeastern University)*; Ningfang  Mi (Northeastern University) The amount of data being collected for analysis is growing at an exponential rate.  Along with this growth comes increasing necessity for computation and storage.  Researchers are addressing these needs by building heterogeneous clusters with CPUs and computational accelerators such as GPUs equipped with high I/O bandwidth storage devices.  One of the main bottlenecks of such heterogeneous systems is the data transfer bandwidth to GPUs when running I/O intensive applications.  The traditional approach gets data from storage to the host memory and then transfers it to the GPU, which can limit data throughput and processing and thus degrade the end-to-end performance. In this paper, we propose a new framework to address the above issue by exploiting  Peer-to-Peer Direct Memory Access to allow GPU direct access of the storage device and thus enhance the performance for parallel data processing applications in a heterogeneous big-data platform. Our heterogeneous cluster is supplied with CPUs and GPUs as computing resources and Non- Volatile Memory express (NVMe)  drives as storage resources. We deploy an Aoache Spark platform to execute representative data processing workloads over this heterogeneous cluster and then adopt Peer-to-Peer Direct Memory Access to connect GPUs to non-volatile storage directly to optimize the GPU data access.   Experimental results reveal that this heterogeneous Spark platform successfully bypasses the host memory and enables GPUs to communicate directly to the NVMe drive, thus achieving higher data transfer throughput and improving both data communication time and end-to-end performance by 20%. Profiling and Optimization of CT Reconstruction on Nvidia Quadro GV100 SHEKHAR DWIVEDI (Nvidia)*; Andreas Heumann (Nvidia) Computed Tomography (CT) Imaging is a widely used technique for medical and industrial applications. Iterative reconstruction algorithms are desired for improved reconstructed image quality and lower dose, but its computational requirements limit its practical usage. Reconstruction toolkit (RTK) is a package of open source GPU accelerated algorithms for CBCT (cone beam computed tomography). GPU based iterative algorithms gives immense acceleration, but it may not be optimized to use the GPU resources efficiently. Nvidia has released several profilers (Nsight- systems, Nsight-compute) to analyze the GPU implementation of an algorithm from compute utilization and memory efficiency perspective. This paper profiles and analyzes the GPU implementation of iterative FDK algorithm in RTK and optimizes it for computation and memory usage on a Quadro GV100 GPU with 32 GB of memory and over 5000 cuda cores. RTK based GPU accelerated iterative FDK when applied on a 4 byte per pixel input projection dataset of size 1.1 GB (512x512x1024) for 20 iterations, to reconstruct a volume of size 440 MB (512x512x441) with 4 byte per pixel, resulted in total runtime of ~11.2 seconds per iteration. Optimized RTK based iterative FDK presented in this paper took ~1.3 seconds per iteration. A Communication-Efficient Multi-Chip Design for Range- Limited Molecular Dynamics Chunshu Wu (Boston University)*; tong geng (Boston University); Chen Yang (Boston University); Vipin Sachdeva (Silicon Therapeutics); Woody Sherman (Silicon Therapeutics); Martin Herbordt (Boston University) Molecular Dynamics simulation (MD) has been thought a promising FPGA application for many years, especially with clusters of tightly coupled FPGAs where the large-scale, general- purpose, low-latency interconnects provide a communication capability not available with any other COTS computing technology. Parallelization of one part of the MD computation, the 3D FFT, has been studied previously; for likely FPGA cluster sizes, however, the range-limited computation (RL) is more challenging. The motivation here is that the direct replication of the single-chip design suffers from inefficient inter-board bandwidth usage. In particular, although communication in RL is local, likely bandwidth limitations will constrain performance unless great care is taken in design and analysis.  In the multi-chip scenario, inter-board bandwidth is the critical constraint and the main target of this work. We analyze it with respect to three application restructurings: workload distribution, data forwarding pattern, and data locality. We describe how bandwidth can be balanced by configuring workload distribution and data forwarding paths with respect to the number of on-board transceiver ports. We also show that, by manipulating data locality, the multi-chip design is efficiently migrated from the single-chip design, and the total bandwidth required can be configured to satisfy the bandwidth limit. Bit-Error Aware Quantization for DCT-based Lossy Compression Jialing Zhang (University of Massachusetts Lowell)*; Jiaxi Chen ( University of Massachusetts Lowell); Aekyeung Moon (ETRI); Xiaoyan Zhuo (University of Massachusetts Lowell); Seung Woo Son (University of Massachusetts Lowell) Scientific simulations run by high-performance computing (HPC) systems produce a large amount of data, which causes an extreme I/O bottleneck and a huge storage burden. Applying compression techniques can mitigate such overheads through reducing the data size. Unlike traditional lossless compressions, error-controlled lossy compressions, such as SZ, ZFP, and DCTZ, designed for scientists who demand not only high compression ratios but also a guarantee of certain degree of precision, is coming into prominence. While rate-distortion efficiency of recent lossy compressors, especially the DCT-based one, is promising due to its high-compression encoding, the overall coding architecture is still conservative, necessitating the quantization that strikes a balance between different encoding possibilities and varying rate- distortions. In this paper, we aim to improve the performance of DCT-based compressor, namely DCTZ, by optimizing the quantization model and encoding mechanism. Specifically, we propose a bit-efficient quantizer based on the DCTZ framework, develop a unique ordering mechanism based on the quantization table, and extend the encoding index. We evaluate the performance of our optimized DCTZ in terms of rate-distortion using real-world HPC datasets. Our experimental evaluations demonstrate that, on average, our proposed approach can improve the compression ratio of the original DCTZ by 1.38x. Moreover, combined with the extended encoding mechanism, the optimized DCTZ shows a competitive performance with state-of-the-art lossy compressors, SZ and ZFP. 3-3: Case Studies & Benchmarking Session (14:15-15:30 EDT) Accelerating MRI Reconstruction on TPUs Tianjian Lu (Google)*; Thibault Marin (Harvard Medical School); Yue Zhuo (Harvard Medical School ); Yi-fan Chen (Google); Chao Ma (Massachusetts General Hospital) The advanced magnetic resonance (MR) image reconstructions such as the compressed sensing and subspace-based imaging are considered as large-scale, iterative, optimization problems. Given the large number of reconstructions required by the practical clinical usage, the computation time of these advanced reconstruction methods is often unacceptable. In this work, we propose using Google's Tensor Processing Units (TPUs) to accelerate the MR image reconstruction.  TPU is an application- specific integrated circuit (ASIC) for machine learning applications, which has recently been used to solve large-scale scientific computing problems. As proof-of-concept, we implement the alternating direction method of multipliers (ADMM) in TensorFlow to reconstruct images on TPUs. The reconstruction is based on multi-channel, sparsely sampled, and radial-trajectory $k$-space data with sparsity constraints. The forward and inverse non- uniform Fourier transform operations are formulated in terms of matrix multiplications as in the discrete Fourier transform. The sparsifying transform and its adjoint operations are formulated as convolutions. The data decomposition is applied to the measured $k$-space data such that the aforementioned tensor operations are localized within individual TPU cores. The data decomposition and the inter-core communication strategy are designed in accordance with the TPU interconnect network topology in order to minimize the communication time. The accuracy and the high parallel efficiency of the proposed TPU-based image reconstruction method are demonstrated through numerical examples. Processing of Crowdsourced Observations of Aircraft in a High Performance Computing Environment Andrew Weinert (MIT Lincoln Laboratory)*; Ngaire Underhill (MIT Lincoln Laboratory); Bilal Gill (MIT Lincoln Laboratory); Ashley Wicks (MIT Lincoln Laboratory) As unmanned aircraft systems (UASs) continue to integrate into the U.S. National Airspace System (NAS), there is a need to quantify the risk of airborne collisions between unmanned and manned aircraft to support regulation and standards development. Both regulators and standards developing organizations have made extensive use of Monte Carlo collision risk analysis simulations using probabilistic models of aircraft flight. Northeast Cyberteam – Building an Environment for Sharing Best Practices and Solutions for Research Computing Scott Valcourt (University of New Hampshire)*; John Goodhue (MGHPCC); Julie Ma (MGHPCC); Adrian Del Maestro (University of Vermont); Sia Najafi (Worcester Polytechnic Institute); Bruce Segee (University of Maine); Ralph Zottola (University of Alabama) he Northeast Cyberteam Program is a collaborative effort across Maine, New Hampshire, Vermont, and Massachusetts that seeks to assist researchers at small and medium-sized institutions in the region with making use of cyberinfrastructure, while simultaneously building the next generation of research computing facilitators. Recognizing that research computing facilitators are frequently in short supply, the program also places intentional emphasis on capturing and disseminating best practices in an effort to enable opportunities to leverage and build on existing solutions whenever practical. The program combines direct assistance to computationally intensive research projects; experiential learning opportunities that pair experienced mentors with students interested in research computing facilitation; sharing of resources and knowledge across large and small institutions; and tools that enable efficient oversight and possible replication of these ideas in other regions. Each project involves a researcher seeking to better utilize cyberinfrastructure in research, a student facilitator, and a mentor with relevant domain expertise. These individuals may be at the same institution or at separate institutions.  The student works with the researcher and the mentor to become a bridge between the infrastructure and the research domain.  Through this model, students receive training and opportunities that otherwise would not be available, research projects get taken to a higher level, and the effectiveness of the mentor is multiplied.  Providing tools to enable self-service learning is a key concept in our strategy to develop facilitators through experiential learning, recognizing that one of the most fundamental skills of successful facilitators is their ability to quickly learn enough about new domains and applications to be able draw parallels with their existing knowledge and help to solve the problem at hand. The Cyberteam Portal is used to access the self-service learning resources developed to provide just-in-time information delivery to participants as they embark on projects in unfamiliar domains, and also serves as a receptacle for best practices, tools, and techniques developed during a project. Tools include Ask.CI, an interactive site for questions and answers; a learning resources repository used to collect online training modules vetted by cyberteam projects that provide starting points for subsequent projects or independent activities; and a Github repository. The Northeast Cyberteam was created with funding from the National Science Foundation, but has developed strategies for sustainable operations. Benchmarking network fabrics for data distributed training of deep neural networks Siddharth Samsi (MIT Lincoln Laboratory)*; Andrew Prout (MIT); Michael Jones (MIT Lincoln Laboratory); Andrew Kirby (Massachusetts Institute of Technology Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory - USA); Jeremy Kepner (MIT Lincoln Laboratory); Albert Reuther (MIT Lincoln Laboratory) Artificial Intelligence/Machine Learning applica- tions require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for com- municating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics. A congestion control mechanism for SDN-based  fat-tree networks Haitham Ghalwash (University of connecticut)*; Chun-Hsi Huang (Southern Illinois University) Nowadays, data centers are experiencing increasing applications’ need that is mandating a software-oriented network architecture. A Software-Defined Network (SDN) is the new technology to overcome the traditional network’s limitations. QoS is one current limitation that needs to be well-structured for successful software- oriented network architecture. Considerable key factors affect QoS, comprising traffic shaping and congestion control. This paper proposes a congestion control mechanism in SDN-based networks to enhance the overall QoS. The proposed mechanism monitors and detects congested parts and reacts automatically to reduce traffic load. The traffic load is redistributed to ensure better performance by re-routing subset flows. The re-routing decision is based on a passively measured QoS metric, port utilization. Experiments are conducted to prove the effectiveness of the proposed mechanism in an SDN-based fat-tree network. Results showed that the TCP recorded a noticeable improvement, namely, 22.4% in the average delay, 21.3% in throughput, 18.6% in max delay, and 15.3% for jitter. Moreover, the maximum monitored port utilization in the aggregation and core switches was also reduced by 22% on average. 3-4: Case Studies & Benchmarking Session (15:45-17:00 EDT) Performance Strategies for Parallel Bitonic Sort on a Migratory Thread Architecture Kaushik Velusamy (University of Maryland, Baltimore County); Thomas Rolinger (Laboratory for Physical Sciences)*; Janice McMahon (Emu Technologies) Large-scale data analytics often represent vast amounts of sparse data as a graph. As a result, the underlying kernels in data analytics can be reduced down to operations over graphs, such as searches and traversals. Graph algorithms are notoriously difficult to implement for high performance due to the irregular nature of their memory access patterns, resulting in poor utilization of a traditional cache memory hierarchy. As a result, new architectures have been proposed that specifically target irregular applications. One example is the cache-less Emu migratory thread architecture developed by Lucata Technology. While it is important to evaluate and understand irregular applications on a system such as Emu, it is equally important to explore applications which are not irregular themselves, but are often used as key pre-processing steps in irregular applications. Sorting a list of values is one such pre- processing step, as well as one of the fundamental operations in data analytics. In this paper, we extend our prior preliminary evaluation of parallel bitonic sort on the Emu architecture. We explore different performance strategies for bitonic sort by leveraging the unique features of Emu. In doing so, we implement three significant capabilities into bitonic sort: a smart data layout that periodically remaps data to avoid remote accesses, efficient thread spawning strategies, and adaptive loop parallelization to achieve proper load balancing over time. We present a performance evaluation that demonstrates speed-ups as much as 14.26x by leveraging these capabilities. Hash Table Scalability on Intel PIUMA Balasubramanian Seshasayee (Intel Corporation)*; Joshua Fryman (Intel); Ibrahim Hur (Intel Corporation) The Intel PIUMA (Programmable and Integrated Unified Memory Architecture) is a scalable, massively multithreaded architecture designed to operate on unstructured data, with a global address space, fine-grain memory access and variousnovelfeaturesforlatencyhidingduringdatamovement.Hash tables are a commonly used data structure with unstructured data, hence it is imperative that the performance and scaling for hash table usages are optimized for this architecture. We study three different hash table implementations on a PIUMA simulator to show that a dual-atomics based implementation, a unique feature in PIUMA, performs competitively both at larger scales and under hash collisions. Our implementations are able to achieve strong scaling up to 16,384 hardware threads. Enhanced Parallel Simulation for ACAS X Development Adam Gjersvik (MIT Lincoln Laboratory)* ACAS X is the next generation airborne collision avoidance system intended to meet the demands of the rapidly evolving U.S. National Airspace System (NAS). The collision avoidance safety and operational suitability of the system are optimized and continuously evaluated by simulating billions of characteristic aircraft encounters in a fast-time Monte Carlo environment. There is therefore an inherent computational cost associated with each ACAS X design iteration and parallelization of the simulations is necessary to keep up with rapid design cycles. This work describes an effort to profile and enhance the parallel computing infrastructure deployed on the computing resources offered by the Lincoln Laboratory Supercomputing Center. The approach to large-scale parallelization of our fast-time airspace encounter simulation tool is presented along with corresponding parallel profile data collected on different kinds of compute nodes. A simple stochastic model for distributed simulation is also presented to inform optimal work batching for improved simulation efficiency. The paper concludes with a discussion on how this high-performance parallel simulation method enables the rapid safety-critical design of ACAS X in a fast-paced iterative design process. Architectural Analysis of Deep Learning on Edge Accelerators Luke Kljucaric (NSF SHREC)*; Alex Johnson (NSF SHREC); Alan George (NSF SHREC) As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine-learning (ML) hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals. Additionally, many devices feature hardware optimized for data types other than 32-bit floating-point numbers, the standard representation defined by MLPerf. Edge-computing devices often feature app-specific hardware to offload common operations found in ML apps from the constrained CPU. This research analyzes multiple low-power compute architectures that feature ML-specific hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are benchmarked in terms of their streaming latency for optical character recognition. Considering these models are custom and not the most widely used, many architectures are not specifically optimized for them. The performance of these models can stress devices in different, yet insightful, ways that generalizations of the performance of other models can be drawn from. The NVIDIA Jetson AGX Xavier (AGX), Intel Neural Compute Stick 2 (NCS2), and Google Edge TPU architectures are analyzed with respect to their performance. The design of the AGX and TPU devices showcased the lowest streaming latency for AlexNet and GoogLeNet, respectively. Additionally, the tightly-integrated NCS2 design showed the best generalizability in performance and efficiency across neural networks. 3-S1: Graph Challenge Special (17:30-19:30 EDT) Scaling Graph Clustering with Distributed Sketches Benjamin Priest (Lawrence Livermore National Laboratory)*; Alec Dunton (University of Colorado Boulder); Geoffrey Sanders (LLNL) The unsupervised learning of community structure, in particular the partitioning vertices into clusters or communities, is a canonical and well-studied problem in exploratory graph analysis. However, like most graph analyses the introduction of immense scale presents challenges to traditional methods. Spectral clustering in distributed memory, for example, requires hundreds of expensive bulk-synchronous communication rounds to compute an embedding of vertices to a few eigenvectors of a graph associated matrix. Furthermore, the whole computation may need to be repeated if the underlying graph changes some low percentage of edge updates. We present a method inspired by spectral clustering where we instead use matrix sketches derived from random dimension-reducing projections. We show that our method produces embeddings that yield performant clustering results given a fully-dynamic stochastic block model stream using both the fast Johnson-Lindenstrauss and CountSketch transforms. We also discuss the effects of stochastic block model parameters upon the required dimensionality of the subsequent embeddings, and show how random projections could significantly improve the performance of graph clustering in distributed memory. At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation Mert Hidayetoglu (University of Illinois at Urbana-Champaign)*; Carl Pearson (University of Illinois at Urbana-Champaign); Vikram Sharma Mailthody (University of Illinois at Urbana-Champaign); Eiman Ebrahimi (NVIDIA); Jinjun Xiong (IBM Thomas J. Watson Research Center); Rakesh Nagi (University of Illinois at Urbana- Champaign); Wen-Mei Hwu (University of Illinois at Urbana- Champaign) This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network Challenge 2020. Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many neural networks beyond the capacity of available accelerators. Sparse deep neural networks (SpDNN) have shown promise for reining in the memory footprint of large neural networks. However, there is room for improvement in implementing SpDNN operations on GPUs. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function. The optimized kernels reuse input feature maps from the shared memory and sparse weights from registers. For multi-GPU parallelism, our SpDNN implementation duplicates weights and statically partition the feature maps across GPUs. Results for the challenge benchmarks show that the proposed kernel design and multi-GPU parallelization achieve up to 180 TeraEdges per second inference throughput. These results are up to 4.3X faster for a single GPU and an order of magnitude faster at full scale than those of the champion of the 2019 Sparse Deep Neural Network Graph Challenge for the same generation of NVIDIA V100 GPUs.  Using the same implementation, we also show single-GPU throughput on NVIDIA A100 is 2.37X faster than V100. A Novel Inference Algorithm for Large Sparse Neural Network using Task Graph Parallelism Dian-Lun Lin (University of Utah); Tsung-Wei Huang (University of Utah)* The ever-increasing size of modern deep neural network (DNN) architectures has put increasing strain on the hardware needed to implement them. Sparsified DNNs can greatly reduce memory costs and increase throughput over standard DNNs, if the loss of accuracy can be adequately controlled. However, sparse DNNs present unique computational challenges. Efficient model or data parallelism algorithms are extremely hard to design and implement. The recent MIT/IEEE/Amazon HPEC Graph Challenge has drawn attention to high-performance inference methods for large sparse DNNs. In this paper, we introduce SNIG, an efficient inference engine for large sparse DNNs. SNIG develops highly optimized inference kernels and leverages the power of CUDA Graphs to enable efficient decomposition of model and data parallelisms. Our decomposition strategy is flexible and scalable to different partitions of data volumes, model sizes, and GPU numbers. We have evaluated SNIG on the official benchmarks of HPEC Sparse DNN Challenge and demonstrated its promising performance scalable from a single GPU to multiple GPUs. Compared to the champion of the 2019 HPEC Sparse DNN Challenge, SNIG can finish all inference workloads using only a single GPU. At the largest DNN, which has more than 4 billion parameters across 1920 layers each of 65536 neurons, SNIG is up to 2.3× faster than a state-of-the-art baseline under a machine of 4 GPUs. TriC: Distributed-memory Triangle Counting by Exploiting the Graph Structure SAYAN GHOSH (WASHINGTON STATE UNIVERSITY)*; Mahantesh Halappanavar (Pacific Northwest National Laboratory) Graph analytics has emerged as an important tool in the analysis of large scale data from diverse application domains such as social networks, cyber security and bioinformatics. Counting the number of triangles in a graph is a fundamental kernel with several applications such as detecting the community structure of a graph or in identifying important vertices in a graph. The ubiquity of massive datasets is driving the need to scale graph analytics on parallel systems. However, numerous challenges exist in efficiently parallelizing graph algorithms, especially on distributed-memory systems. Irregular memory accesses and communication patterns, low computation to communication ratios, and the need for frequent synchronization are some of the leading challenges. In this paper, we present TriC, our distributed-memory implementation of triangle counting in graphs using the Message Passing Interface (MPI), as a submission to the 2020 Graph Challenge competition. Using a set of synthetic and real-world inputs from the challenge, we demonstrate a speedup of up to 90x relative to previous work on 32 processor-cores of a NERSC Cori node. We also provide details from distributed runs with up to 8192 processes along with strong scaling results. The observations presented in this work provide an understanding of the system- level bottlenecks at scale that specifically impact sparse-irregular workloads and will therefore benefit other efforts to parallelize graph algorithms. In this paper, we discuss our distributed-memory implementation of graph triangle counting using MPI, as a submission to the 2020 Graph Challenge competition. Combinatorial Tiling for Sparse Neural Networks Filip Pawłowski (ENS Lyon)*; Bora Uçar (CNRS and LIP (UMR5668)); Rob Bisseling (Utrecht University); Albert-Jan Yzelman (Huawei Zürich Research Center) Sparse deep neural networks (DNNs) emerged as the result of search for networks with less storage and lower computational complexity. The sparse DNN inference is the task of using such trained DNN networks to classify a batch of input data. We propose an efficient, hybrid model- and data-parallel DNN inference using hypergraph models and partitioners. We exploit tiling and weak synchronization to increase cache reuse, hide load imbalance, and hide synchronization costs. Finally, a blocking approach allows application of this new hybrid inference procedure for deep neural networks. We initially experiment using the hybrid tiled inference approach only, using the first five layers of networks from the IEEE HPEC 2019 Graph Challenge, and attain up to 2x speedup versus a data-parallel baseline. Studying the Effects of Hashing of Sparse Deep Neural Networks on Data and Model Parallelisms Mohammad Hasanzadeh Mofrad (University of Pittsburgh)*; Rami Melhem (University of Pittsburgh); Mohammad Hammoud (Carnegie Mellon University); Muhammad Yousuf Ahmad (Carnegie Mellon University in Qatar) Deep Neural Network (DNN) training and inference are two resource-intensive tasks that are usually scaled out using data or model parallelism where \textit{data parallelism} parallelizes over the \textit{input data} and \textit{model parallelism} parallelizes over the \textit{network}. Also, \textit{dense matrix-matrix multiplication} is the key primitive behind training/inference of \textit{dense DNNs}. On the contrary, \textit{sparse DNNs} are less resource- intensive compared to their dense counterparts while offering comparable accuracy. Similarly, they can be parallelized using data or model parallelism with \textit{Sparse Matrix-Matrix Multiplication (SpMM)} as the key primitive. To scale out, both data and model parallelisms initially use data parallelism to partition the input data among multiple machines. This initial partitioning of the input makes data and model parallelisms performance prone to load imbalance as partitions may be imbalanced. As part of this paper, we take a deeper look into data and model parallelisms and closely study the mechanics of the SpMM used for each. Moreover, to intuitively remedy their load imbalance problem, we incorporate hashing as a simple yet powerful method to address load imabalance. Finally, we use the IEEE HPEC sparse DNN challenge dataset to evaluate the performance of data and model parallelisms at scale. We scaled up to 32 machines (896 cores) and inferred a large sparse DNN with 4B parameters in 51 seconds. Results suggest that with hashing, data and model parallelisms achieve super-linear speedup due to better load balance and cache utilization. Incremental Streaming Graph Partitioning L Durbeck (Virginia Tech)*; Peter Athanas (Virginia Tech) Graph partitioning is an NP-hard problem whose efficient approximation has long been a subject of interest and progress. The I/O bounds of contemporary computing environments favor incremental or streaming graph partitioning methods. Methods have sought a balance between latency, simplicity, accuracy, and memory size. In this paper, we apply an incremental approach to streaming partitioning that tracks changes with a lightweight proxy triggering repartition as error increases. We evaluate its performance on the DARPA/MIT Graph Challenge streaming stochastic block partition dataset, and find that it can dramatically reduce the invocation of partitioning, which can provide an order of magnitude speedup. KTrussExplorer: Exploring the Design Space of K-truss Decomposition Optimizations on GPUs Safaa Diab (American University of Beirut); Mhd Ghaith Olabi (American University of Beirut); Izzat El Hajj (American University of Beirut)* K-truss decomposition is an important method in graph analytics for finding cohesive subgraphs in a graph. Various works have accelerated k-truss decomposition on GPUs and have proposed different optimizations while doing so. The combinations of these optimizations form a large design space. However, most GPU implementations focus on a specific combination or set of combinations in this space. This paper surveys the optimizations applied to k-truss decomposition on GPUs, and presents KTrussExplorer, a framework for exploring the design space formed by the combinations of these optimizations. Our evaluation shows that the best combination highly depends on the graph of choice, and analyses the conditions that make each optimization attractive. Some of the best combinations we find outperform previous Graph Challenge champions on many large graphs. Analysis of floating-point round-off error in linear algebra routines for graph clustering Lucia Yang (University of Colorado Boulder)*; Alyson Fox (LLNL) We explore the various ways rounding errors can impact the power method for calculating the Fielder vector for graph clustering. A rounding error analysis reveals that the best eigenpair that is computable with a certain floating point precision type has a worst- case error that scales to its unit round-off. Although rounding errors can accumulate in the power method at the worst-case bound, this behavior is not reflected in some practical examples. Furthermore, our numerical experiments show that rounding errors from the power method may satisfy the conditions necessary for the bounding of the mis-clustering rate and that the approximate eigenvectors with errors close to half precision unit round-off can yield sufficient clustering results for partitioning stochastic block model graphs.  
Wednesday, September 23