2020
IEEE High Performance Extreme Computing
Virtual Conference
21 - 25 September 2020
3-1: Big Data & Distributed Computing 2 Session (11:00-12:15 EDT)
High-Throughput Image Alignment for Connectomics using Frugal Snap Judgments
Tim Kaler (MIT CSAIL)*; Brian Wheatman (Johns Hopkins University); Sarah Wooders (MIT)
The accuracy and computational efficiency of image alignment directly affects the advancement of connectomics, a field which seeks
to understand the structure of the brain through electron microscopy.
We introduce the algorithms Quilter and Stacker that are designed to perform 2D and 3D alignment respectively on petabyte-scale
data sets from connectomics. Quilter and Stacker are efficient, scalable, and can run on hardware ranging from a researcher's laptop
to a large computing cluster. On a single 18-core cloud machine each algorithm achieves throughputs of more than 1 TB/hr; when
combined the algorithms produce an end-to-end alignment pipeline that processes data at a rate of 0.82 TB/hr --- an over 10x
improvement from previous systems. This efficiency comes from both traditional optimizations and from the use of ``Frugal Snap
Judgments'' to judiciously exploit performance--accuracy trade-offs.
A high-throughput image-alignment pipeline was implemented using the Quilter and Stacker algorithms and its performance was
evaluated using three datasets whose size ranged from 550 GB to 38 TB. The full alignment pipeline achieved a throughput of 0.6--0.8
TB/hr and 1.4--1.5 TB/hr on an 18-core and 112-core shared-memory multicore, respectively. On a supercomputing cluster with 200
nodes and 1600 total cores, the pipeline achieved a throughput of 21.4 TB/hr.
DS-SHMEM: Staging-enabled PGAS Programming for Data-intensive Workflows
Daihou Wang (Rutgers University)*
PGAS based programming models, represented by OpenSHMEM standard, has been bringing advanced performance and
programmability via its performances uni- formly access over memory of local and remote nodes. However, as the development of data
intensive analysis, data scale are growing over the memory of processing nodes. In this paper, we propose to introduce additional data
staging area into the programming model, and introduce SHSPACES programming model. In SHAPCES, memory PEs can be
allocated and accessed asynchronously on data staging nodes, in addition to local and remote memory.
Self-Scaling Clusters and Reproducible Containers to Enable Scientific Computing
Peter Vaillancourt (Cornell University)*; John Coulter (Indiana University); Richard Knepper (Cornell University); Brandon Barker
(Cornell University)
Container technologies such as Docker have become a crucial component of many software industry practices especially those
pertaining to reproducibility and portability. The containerization philosophy has influenced the scientific computing community, which
has begun to adopt - and even develop - container technologies (such as Singularity). Leveraging containers for scientific software
often poses challenges distinct from those encountered in industry, and requires different methodologies. This is especially true for
HPC. With an increasing number of options for HPC in the cloud (including NSF-funded cloud projects), there is strong motivation to
seek solutions that provide flexibility to develop and deploy scientific software on a variety of computational infrastructures in a portable
and reproducible way. The Cyberinfrastructure Resource Integration team in the XSEDE project has developed a simple tool which
provides HPC infrastructure in the cloud that scales with user demand. We now present a solution which uses the Nix package
manager in an MPI-capable Docker container that is converted to Singularity. It provides consistent installations, dependencies, and
environments in each image that are reproducible and portable across scientific computing infrastructures. We demonstrate the utility
of these containers with cluster benchmark runs in a self-scaling virtual cluster using the Slurm scheduler deployed in the Jetstream
and Aristotle Red Cloud OpenStack clouds. We conclude this technique is useful as a template for scientific software application
containers to be used in the XSEDE compute environment, other Singularity HPC environments, and cloud computing environments.
mpiHDFS: High-Performance Computing over a Commodity Filesystem via Aggregation, Reordering, and Coordination of
Computation and I/O
Da Zhang (Virginia Tech); Jing Zhang (Virginia Tech); Kaixi Hou (Virginia Tech); Sarunya Pumma (Virginia Tech); Wu-chun Feng
(Virginia Tech)*
With the emerging trend of integrating high-per- formance computing (HPC) with big data (BIGDATA) processing, running MPI over
HDFS offers a promising approach for deliv- ering better scalability and fault tolerance to traditional HPC applications. However, it
comes with challenges that discourage such an approach: (1) slow two-sided communication in MPI to support intermediate data
processing, (2) a focus on enabling N-1 writes that is subject to the default (and naive) HDFS block- placement policy, and (3) a
pipelined writing mode in HDFS that cannot fully utilize the underlying HPC hardware. Hence, without a holistic and systematic
approach, the integration of HPC and BIGDATA falls short of delivering competitive performance. As such, we propose mpiHDFS,
middleware that sits between MPI applications and HDFS, to aggregate and reorder interme- diate data and to coordinate computation
and I/O. Specifically, mpiHDFS provides highly-optimized merge and sort for inter- mediate data to enrich MPI functionality, enables the
overlap of computation and I/O via one-sided communication, and provides a coordinator to improve performance by leveraging data
locality and communication patterns. For disk I/O, mpiHDFS realizes a parallel write mechanism with a delayed write mode for fast
data flush in HDFS. To demonstrate the efficacy of mpiHDFS, we run two MPI applications (pBWA and DIAMOND) over mpiHDFS. The
experimental results show that on a 17-node cluster, mpi- HDFS delivers up to 1.92× and 1.78× speedup over MPI I/O and HDFS
pipelined write implementations, respectively.
Optimizing Data Access with Next-Generation Storage Engine, Persistent Memory and Smart NICs
Kenneth Cain (Intel)*; Venkata Krishnan (Intel); Johann Lombardi (Intel)
Intel teams developing the Distributed Asynchronous Object Storage (DAOS) software and conducting HPC architecture research
propose to present an introduction to DAOS and engage in a broader discussion with the HPEC community of end users, researchers,
system developers/integrators, and technology providers.
DAOS:
DAOS provides a storage service running over a set of fabric-interconnected nodes, collectively providing a single tier of storage
composed from two types of media installed in each node: non-volatile / persistent memory (PM) and NVMe solid state drives (SSDs).
Applications also interact with the storage service over a high-speed fabric. Multiple application interfaces have been developed for use
with DAOS storage including a DAOS key value API with python bindings, POSIX files/directories, MPI IO, HDF5, and Apache Spark.
DAOS object storage is versioned, with the ability to create immutable snapshots of data at a particular version, and to perform
aggregation, reclaiming storage associated with older versions of objects. Data splitting across servers provides opportunities for high
storage bandwidth. Object replication to multiple servers and erasure coding features provide data protection against some number of
faults that may occur during operation, with an online self-healing / rebuild feature triggered upon fault detection.
Media selection strategies (PM and SSD) in DAOS that allow it to achieve very high IOPS and storage throughput will be discussed.
Extreme levels of storage performance have been measured in recent executions of the IO-500 benchmarks integrated with DAOS.
How DAOS interacts with the high-performance fabric will be discussed, including its end to end OS bypass architecture, and its use of
the OpenFabrics Interface (OFI) libfabric API as a foundation for inter-node storage RPC and RDMA-based bulk data transfer.
Versioned objects may be used as a vehicle to save application state (e.g., checkpoint/restart), as well as offering the possibility of a
pipelined/asynchronous parallel producer/consumer communication workflow through storage.
Architecture Research: COPA and DAOS Borne out of considerations for reducing footprint and power consumption for storage system
infrastructure, investigations of the potential application of embedded architectures have been conducted. In particular, a proof of
concept Smart NIC / accelerator FPGA framework COPA (COnfigurable network Protocol Accelerator) has been developed.
COPA supports OFI that has been extended to expose various storage function acceleration modes (implemented in hardware) to
middleware and applications. The acceleration functions may be invoked in the context of communication operations, for example
inline with data movement, or as a separate lookaside operation. COPA has been implemented on different variants of Stratix10
FPGAs. Multiple COPA FPGAs can autonomously connect to a standard 100GigE switching network. The framework has been
validated by microbenchmarks as well as proxy benchmarks that mimic the behavior of client/server flows – and achieves bandwidths
close to 100Gbps when acceleration is enabled.
Potential mappings of the DAOS software with the COPA communication and acceleration capabilities will also be explored.
Poster Session: 3-P (12:15-15:45 EDT)
Human Balance Models Optimized using a Large-scale, Parallel Architecture with Applications to Mild Traumatic Brain Injury
Gregory Ciccarelli (Massachusetts Institute of Technology Lincoln Laboratory); Michael Nolan (U. Washington); Hrishikesh Rao (MIT
Lincoln Laboratory); Tanya Talkar (MIT Lincoln Laboratory); Anne O'Brien (Spaulding Rehabilitation Hospital); Gloria Vergara-Diaz
(Spaulding Rehabilitation Hospital); Ross Zafonte (Spaulding Rehabilitation Hospital); Thomas Quatieri (Massachusetts Institute of
Technology Lincoln Laboratory); Ryan McKindles (MIT Lincoln Laboratory); Paolo Bonato (Spaulding Rehabilitation Hospital); Adam
Lammert ()*
Static and dynamic balance are frequently disrupted through brain injuries. The impairment can be complex and for mild traumatic
brain injury (mTBI) can be undetectable by standard clinical tests. Therefore, neurologically relevant modeling approaches are needed
for detection and inference of mechanisms of injury. The current work presents models of static and dynamic balance that have a high
degree of correspondence. Emphasizing structural similarity between the domains facilitates development of both. Furthermore,
particular attention is paid to components of sensory feedback and sensory integration to ground mechanisms in neurobiology. Models
are adapted to fit experimentally collected data from 10 healthy control volunteers and 11 mild traumatic brain injury volunteers.
Through an analysis by synthesis approach whose implementation was made possible by a state-of-the-art high performance
computing system, we derived an interpretable, model based feature set that could classify mTBI and controls in a static balance task
with an ROC AUC of 0.72.
Hardware Acceleration of Nonlocal Means-Based Speckle Noise Removal Applied to SAR Imagery
Hector Li Sanchez (University of Pittsburgh)*; Alan George (NSF Center for High Performance Reconfigurable Computing)
Removal of speckle noise from synthetic aperture radar (SAR) remains a significant challenge for space computers. Speckle noise is
substantially harder to reduce than Gaussian noise due to its multiplicative nature. The probability patch-based (PPB) filter, based on
the nonlocal means filter, can reduce speckle noise while preserving fine details. However, its high computational complexity inhibits its
practical use in embedded applications. This issue is especially true for conventional space platforms, where radiation-hardened
processors have significantly lower performance and energy-efficiency than their commercial-off-the-shelf counterparts. Combined with
ever-increasing demands for data processing requirements and an emphasis on intelligent, autonomous systems, there is a need to
enhance computing capabilities of space computers for current and future missions. Recently, hybrid System-on-Chip (SoC) devices
have been increasingly adopted for use in space applications. Namely, CPU+FPGA devices present several architectural opportunities
for efficient HW-SW partitioning and acceleration of applications. In this paper, a detailed description of a CPU+FPGA accelerator for
speckle noise removal implementing the PPB filter is presented. The proposed architecture leverages the control-flow and data-flow
strengths of the CPU and FPGA, respectively, to maximize performance. Studying the dataflow and computation properties of the
algorithm allow for a highly parallelized, fully pipelined, and quantized design. When evaluated on the Xilinx Zynq-7045 SoC (Z-7045),
our proposed architecture shows a significant performance improvement (up to ~750×) over a software-only baseline while maintaining
modest FPGA resource utilization. The filtering quality is evaluated using both artificial and real SAR images. Quantitative analysis
shows that use of the hardware design only introduces a negligible loss in quality.
Storage Area Networks in Embedded Processing
John Holland (Northrop Grumman)*; Jeremy Horner (Norhrop Grumman); Jason Harnish (Northrop Grumman); Timothy Linden
(Northrop Grumman); Steve Mattson (Norhtrop Grumman)
This paper explores the application of Storage Area Networks (SAN) to size, weight, and power (SWAP) constrained systems. The
goal is providing accessible, distributed, redundant storage with a high level of reliability and availability for multifunction sensors at the
edge of the mission space. This paper describes the advantage of applying SAN technology to these missions and describes the use
of high performance embedded processing to provide the processing and storage for decision-making processing and data sharing at
a SWAP appropriate for challenging form factors. Specifically, this paper addresses how recent device-level integration advancements
that bring Information Technology and Operational Technology onto the same silicon represent a paradigm shift which enables SAN in
embedded processing platforms.
Evaluating SEU Resilience of CNNs with Fault Injection
Evan Kain (COSMIAC)*; Alan George (NSF Center for High Performance Reconfigurable Computing); Tyler Lovelly (U.S. Air Force
Research Laboratory)
Convolutional neural networks (CNNs) are quickly growing as a solution for advanced image processing in many mission-critical high-
performance and embedded computing systems ranging from supercomputers and data centers to aircraft and spacecraft. However,
the systems running CNNs are
increasingly susceptible to single-event upsets (SEUs) which are bit flips that result from charged particle strikes. To better understand
how to mitigate the effects of SEUs on CNNs, the behavior of CNNs when exposed to SEUs must be better understood. Software fault-
injection tools allow us to emulate SEUs to analyze the effects of various CNN architectures and input data features on overall
resilience. Fault injection on three combinations of CNNs and datasets yielded insights into their behavior. When focusing on a
threshold of 1% error in classification accuracy, more complex CNNs tended to be less resilient to SEUs, and easier classification tasks
on well-clustered input data were more resilient to SEUs. Overall, the number of bits flipped to reach this threshold ranged from 20 to
3,790 bits. Results demonstrate that CNNs are highly resilient to SEUs, but the complexity of the CNN and difficulty of the classification
task will decrease that resilience.
Packing Narrow-Width Operands to Improve Energy Efficiency of General-Purpose GPU Computing
Xin Wang (Virginia Commonwealth University); Wei Zhang (University of Louisville)*
In this paper, we study the use of OWAR, an Operand-Width-Aware Register packing mechanism for GPU energy saving. In order to
efficiently use the GPU register file (RF), OWAR employs a power gating method to shut down unused register sub-arrays for reducing
dynamic and leakage energy consumption of RF. As the number of register accesses is reduced due to the packing of the narrow width
operands, the dynamic energy dissipation is further decreased. Finally, with the help of RF usage optimized by register packing, OWAR
allows GPUs to support more TLP (Thread Level Parallelism) through assigning additional thread blocks on SMs (Streaming
Multiprocessors) for GPGPU (General-Purpose GPU) applications that suffer from the deficiency of register resources. The extra TLP
opens opportunities for hiding more memory latencies and thus reduce the overall execution time, which can lower the overall energy
consumption. We evaluate OWAR using a set of representative GPU benchmarks. The experimental results show that compared to
the baseline without optimization, OWAR can reduce the GPGPU’s total energy up to 29.6% and 9.5% on average. In addition, OWAR
achieves performance improvement up to 1.97X and 1.18X on average.
3-2: Data Intensive Computing Session (12:30-13:45 EDT)
Exploiting GPU Direct Access to Non-Volatile Memory to Accelerate Big Data Processing
Mahsa Bayati (Northeastern University); Miriam Leeser (Northeastern University)*; Ningfang Mi (Northeastern University)
The amount of data being collected for analysis is growing at an exponential rate. Along with this growth comes increasing necessity
for computation and storage. Researchers are addressing these needs by building heterogeneous clusters with CPUs and
computational accelerators such as GPUs equipped with high I/O bandwidth storage devices. One of the main bottlenecks of such
heterogeneous systems is the data transfer bandwidth to GPUs when running I/O intensive applications. The traditional approach gets
data from storage to the host memory and then transfers it to the GPU, which can limit data throughput and processing and thus
degrade the end-to-end performance.
In this paper, we propose a new framework to address the above issue by exploiting Peer-to-Peer Direct Memory Access to allow
GPU direct access of the storage device and thus enhance the performance for parallel data processing applications in a
heterogeneous big-data platform. Our heterogeneous cluster is supplied with CPUs and GPUs as computing resources and Non-
Volatile Memory express (NVMe) drives as storage resources. We deploy an Aoache Spark platform to execute representative data
processing workloads over this heterogeneous cluster and then adopt Peer-to-Peer Direct Memory Access to connect GPUs to non-
volatile storage directly to optimize the GPU data access. Experimental results reveal that this heterogeneous Spark platform
successfully bypasses the host memory and enables GPUs to communicate directly to the NVMe drive, thus achieving higher data
transfer throughput and improving both data communication time and end-to-end performance by 20%.
Profiling and Optimization of CT Reconstruction on Nvidia Quadro GV100
SHEKHAR DWIVEDI (Nvidia)*; Andreas Heumann (Nvidia)
Computed Tomography (CT) Imaging is a widely used technique for medical and industrial applications. Iterative reconstruction
algorithms are desired for improved reconstructed image quality and lower dose, but its computational requirements limit its practical
usage. Reconstruction toolkit (RTK) is a package of open source GPU accelerated algorithms for CBCT (cone beam computed
tomography). GPU based iterative algorithms gives immense acceleration, but it may not be optimized to use the GPU resources
efficiently. Nvidia has released several profilers (Nsight-systems, Nsight-compute) to analyze the GPU implementation of an algorithm
from compute utilization and memory efficiency perspective. This paper profiles and analyzes the GPU implementation of iterative FDK
algorithm in RTK and optimizes it for computation and memory usage on a Quadro GV100 GPU with 32 GB of memory and over 5000
cuda cores. RTK based GPU accelerated iterative FDK when applied on a 4 byte per pixel input projection dataset of size 1.1 GB
(512x512x1024) for 20 iterations, to reconstruct a volume of size 440 MB (512x512x441) with 4 byte per pixel, resulted in total runtime
of ~11.2 seconds per iteration. Optimized RTK based iterative FDK presented in this paper took ~1.3 seconds per iteration.
A Communication-Efficient Multi-Chip Design for Range-Limited Molecular Dynamics
Chunshu Wu (Boston University)*; tong geng (Boston University); Chen Yang (Boston University); Vipin Sachdeva (Silicon
Therapeutics); Woody Sherman (Silicon Therapeutics); Martin Herbordt (Boston University)
Molecular Dynamics simulation (MD) has been thought a promising FPGA application for many years, especially with clusters of tightly
coupled FPGAs where the large-scale, general-purpose, low-latency interconnects provide a communication capability not available
with any other COTS computing technology. Parallelization of one part of the MD computation, the 3D FFT, has been studied
previously; for likely FPGA cluster sizes, however, the range-limited computation (RL) is more challenging. The motivation here is that
the direct replication of the single-chip design suffers from inefficient inter-board bandwidth usage. In particular, although
communication in RL is local, likely bandwidth limitations will constrain performance unless great care is taken in design and analysis.
In the multi-chip scenario, inter-board bandwidth is the critical constraint and the main target of this work. We analyze it with respect to
three application restructurings: workload distribution, data forwarding pattern, and data locality. We describe how bandwidth can be
balanced by configuring workload distribution and data forwarding paths with respect to the number of on-board transceiver ports. We
also show that, by manipulating data locality, the multi-chip design is efficiently migrated from the single-chip design, and the total
bandwidth required can be configured to satisfy the bandwidth limit.
Bit-Error Aware Quantization for DCT-based Lossy Compression
Jialing Zhang (University of Massachusetts Lowell)*; Jiaxi Chen (
University of Massachusetts Lowell); Aekyeung Moon (ETRI);
Xiaoyan Zhuo (University of Massachusetts Lowell); Seung Woo Son (University of Massachusetts Lowell)
Scientific simulations run by high-performance computing (HPC) systems produce a large amount of data, which causes an extreme
I/O bottleneck and a huge storage burden. Applying compression techniques can mitigate such overheads through reducing the data
size. Unlike traditional lossless compressions, error-controlled lossy compressions, such as SZ, ZFP, and DCTZ, designed for
scientists who demand not only high compression ratios but also a guarantee of certain degree of precision, is coming into
prominence. While rate-distortion efficiency of recent lossy compressors, especially the DCT-based one, is promising due to its high-
compression encoding, the overall coding architecture is still conservative, necessitating the quantization that strikes a balance
between different encoding possibilities and varying rate-distortions. In this paper, we aim to improve the performance of DCT-based
compressor, namely DCTZ, by optimizing the quantization model and encoding mechanism. Specifically, we propose a bit-efficient
quantizer based on the DCTZ framework, develop a unique ordering mechanism based on the quantization table, and extend the
encoding index. We evaluate the performance of our optimized DCTZ in terms of rate-distortion using real-world HPC datasets. Our
experimental evaluations demonstrate that, on average, our proposed approach can improve the compression ratio of the original
DCTZ by 1.38x. Moreover, combined with the extended encoding mechanism, the optimized DCTZ shows a competitive performance
with state-of-the-art lossy compressors, SZ and ZFP.
3-3: Case Studies & Benchmarking Session (14:15-15:30 EDT)
Accelerating MRI Reconstruction on TPUs
Tianjian Lu (Google)*; Thibault Marin (Harvard Medical School); Yue Zhuo (Harvard Medical School
); Yi-fan Chen (Google); Chao
Ma (Massachusetts General Hospital)
The advanced magnetic resonance (MR) image reconstructions such as the compressed sensing and subspace-based imaging are
considered as large-scale, iterative, optimization problems. Given the large number of reconstructions required by the practical clinical
usage, the computation time of these advanced reconstruction methods is often unacceptable. In this work, we propose using Google's
Tensor Processing Units (TPUs) to accelerate the MR image reconstruction. TPU is an application-specific integrated circuit (ASIC) for
machine learning applications, which has recently been used to solve large-scale scientific computing problems. As proof-of-concept,
we implement the alternating direction method of multipliers (ADMM) in TensorFlow to reconstruct images on TPUs. The reconstruction
is based on multi-channel, sparsely sampled, and radial-trajectory $k$-space data with sparsity constraints. The forward and inverse
non-uniform Fourier transform operations are formulated in terms of matrix multiplications as in the discrete Fourier transform. The
sparsifying transform and its adjoint operations are formulated as convolutions. The data decomposition is applied to the measured
$k$-space data such that the aforementioned tensor operations are localized within individual TPU cores. The data decomposition and
the inter-core communication strategy are designed in accordance with the TPU interconnect network topology in order to minimize the
communication time. The accuracy and the high parallel efficiency of the proposed TPU-based image reconstruction method are
demonstrated through numerical examples.
Processing of Crowdsourced Observations of Aircraft in a High Performance Computing Environment
Andrew Weinert (MIT Lincoln Laboratory)*; Ngaire Underhill (MIT Lincoln Laboratory); Bilal Gill (MIT Lincoln Laboratory); Ashley Wicks
(MIT Lincoln Laboratory)
As unmanned aircraft systems (UASs) continue to integrate into the U.S. National Airspace System (NAS), there is a need to quantify
the risk of airborne collisions between unmanned and manned aircraft to support regulation and standards development. Both
regulators and standards developing organizations have made extensive use of Monte Carlo collision risk analysis simulations using
probabilistic models of aircraft flight.
Northeast Cyberteam – Building an Environment for Sharing Best Practices and Solutions for Research Computing
Scott Valcourt (University of New Hampshire)*; John Goodhue (MGHPCC); Julie Ma (MGHPCC); Adrian Del Maestro (University of
Vermont); Sia Najafi (Worcester Polytechnic Institute); Bruce Segee (University of Maine); Ralph Zottola (University of Alabama)
he Northeast Cyberteam Program is a collaborative effort across Maine, New Hampshire, Vermont, and Massachusetts that seeks to
assist researchers at small and medium-sized institutions in the region with making use of cyberinfrastructure, while simultaneously
building the next generation of research computing facilitators. Recognizing that research computing facilitators are frequently in short
supply, the program also places intentional emphasis on capturing and disseminating best practices in an effort to enable opportunities
to leverage and build on existing solutions whenever practical. The program combines direct assistance to computationally intensive
research projects; experiential learning opportunities that pair experienced mentors with students interested in research computing
facilitation; sharing of resources and knowledge across large and small institutions; and tools that enable efficient oversight and
possible replication of these ideas in other regions.
Each project involves a researcher seeking to better utilize cyberinfrastructure in research, a student facilitator, and a mentor with
relevant domain expertise. These individuals may be at the same institution or at separate institutions. The student works with the
researcher and the mentor to become a bridge between the infrastructure and the research domain. Through this model, students
receive training and opportunities that otherwise would not be available, research projects get taken to a higher level, and the
effectiveness of the mentor is multiplied.
Providing tools to enable self-service learning is a key concept in our strategy to develop facilitators through experiential learning,
recognizing that one of the most fundamental skills of successful facilitators is their ability to quickly learn enough about new domains
and applications to be able draw parallels with their existing knowledge and help to solve the problem at hand. The Cyberteam Portal is
used to access the self-service learning resources developed to provide just-in-time information delivery to participants as they embark
on projects in unfamiliar domains, and also serves as a receptacle for best practices, tools, and techniques developed during a project.
Tools include Ask.CI, an interactive site for questions and answers; a learning resources repository used to collect online training
modules vetted by cyberteam projects that provide starting points for subsequent projects or independent activities; and a Github
repository. The Northeast Cyberteam was created with funding from the National Science Foundation, but has developed strategies for
sustainable operations.
Benchmarking network fabrics for data distributed training of deep neural networks
Siddharth Samsi (MIT Lincoln Laboratory)*; Andrew Prout (MIT); Michael Jones (MIT Lincoln Laboratory); Andrew Kirby
(Massachusetts Institute of Technology Lincoln Laboratory); Vijay Gadepally (MIT Lincoln Laboratory - USA); Jeremy Kepner (MIT
Lincoln Laboratory); Albert Reuther (MIT Lincoln Laboratory)
Artificial Intelligence/Machine Learning applica- tions require the training of complex models on large amounts of labelled data. The
large computational requirements for training deep models have necessitated the development of new methods for faster training. One
such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is
simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach
leverages MPI for com- municating gradients across all nodes. In this paper, we examine the effects of using different physical
hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of
using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC
systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional
HPC applications such as Computational Fluid Dynamics.
A congestion control mechanism for SDN-based fat-tree networks
Haitham Ghalwash (University of connecticut)*; Chun-Hsi Huang (Southern Illinois University)
Nowadays, data centers are experiencing increasing applications’ need that is mandating a software-oriented network architecture. A
Software-Defined Network (SDN) is the new technology to overcome the traditional network’s limitations. QoS is one current limitation
that needs to be well-structured for successful software-oriented network architecture. Considerable key factors affect QoS, comprising
traffic shaping and congestion control. This paper proposes a congestion control mechanism in SDN-based networks to enhance the
overall QoS. The proposed mechanism monitors and detects congested parts and reacts automatically to reduce traffic load. The traffic
load is redistributed to ensure better performance by re-routing subset flows. The re-routing decision is based on a passively measured
QoS metric, port utilization. Experiments are conducted to prove the effectiveness of the proposed mechanism in an SDN-based fat-
tree network. Results showed that the TCP recorded a noticeable improvement, namely, 22.4% in the average delay, 21.3% in
throughput, 18.6% in max delay, and 15.3% for jitter. Moreover, the maximum monitored port utilization in the aggregation and core
switches was also reduced by 22% on average.
3-4: Case Studies & Benchmarking Session (15:45-17:00 EDT)
Performance Strategies for Parallel Bitonic Sort on a Migratory Thread Architecture
Kaushik Velusamy (University of Maryland, Baltimore County); Thomas Rolinger (Laboratory for Physical Sciences)*; Janice McMahon
(Emu Technologies)
Large-scale data analytics often represent vast amounts of sparse data as a graph. As a result, the underlying kernels in data analytics
can be reduced down to operations over graphs, such as searches and traversals. Graph algorithms are notoriously difficult to
implement for high performance due to the irregular nature of their memory access patterns, resulting in poor utilization of a traditional
cache memory hierarchy. As a result, new architectures have been proposed that specifically target irregular applications. One
example is the cache-less Emu migratory thread architecture developed by Lucata Technology. While it is important to evaluate and
understand irregular applications on a system such as Emu, it is equally important to explore applications which are not irregular
themselves, but are often used as key pre-processing steps in irregular applications. Sorting a list of values is one such pre-processing
step, as well as one of the fundamental operations in data analytics. In this paper, we extend our prior preliminary evaluation of parallel
bitonic sort on the Emu architecture. We explore different performance strategies for bitonic sort by leveraging the unique features of
Emu. In doing so, we implement three significant capabilities into bitonic sort: a smart data layout that periodically remaps data to avoid
remote accesses, efficient thread spawning strategies, and adaptive loop parallelization to achieve proper load balancing over time.
We present a performance evaluation that demonstrates speed-ups as much as 14.26x by leveraging these capabilities.
Hash Table Scalability on Intel PIUMA
Balasubramanian Seshasayee (Intel Corporation)*; Joshua Fryman (Intel); Ibrahim Hur (Intel Corporation)
The Intel PIUMA (Programmable and Integrated Unified Memory Architecture) is a scalable, massively multithreaded architecture
designed to operate on unstructured data, with a global address space, fine-grain memory access and
variousnovelfeaturesforlatencyhidingduringdatamovement.Hash tables are a commonly used data structure with unstructured data,
hence it is imperative that the performance and scaling for hash table usages are optimized for this architecture. We study three
different hash table implementations on a PIUMA simulator to show that a dual-atomics based implementation, a unique feature in
PIUMA, performs competitively both at larger scales and under hash collisions. Our implementations are able to achieve strong scaling
up to 16,384 hardware threads.
Enhanced Parallel Simulation for ACAS X Development
Adam Gjersvik (MIT Lincoln Laboratory)*
ACAS X is the next generation airborne collision avoidance system intended to meet the demands of the rapidly evolving U.S. National
Airspace System (NAS). The collision avoidance safety and operational suitability of the system are optimized and continuously
evaluated by simulating billions of characteristic aircraft encounters in a fast-time Monte Carlo environment. There is therefore an
inherent computational cost associated with each ACAS X design iteration and parallelization of the simulations is necessary to keep
up with rapid design cycles. This work describes an effort to profile and enhance the parallel computing infrastructure deployed on the
computing resources offered by the Lincoln Laboratory Supercomputing Center. The approach to large-scale parallelization of our fast-
time airspace encounter simulation tool is presented along with corresponding parallel profile data collected on different kinds of
compute nodes. A simple stochastic model for distributed simulation is also presented to inform optimal work batching for improved
simulation efficiency. The paper concludes with a discussion on how this high-performance parallel simulation method enables the
rapid safety-critical design of ACAS X in a fast-paced iterative design process.
Architectural Analysis of Deep Learning on Edge Accelerators
Luke Kljucaric (NSF SHREC)*; Alex Johnson (NSF SHREC); Alan George (NSF SHREC)
As computer architectures continue to integrate application-specific hardware, it is critical to understand the relative performance of
devices for maximum app acceleration. The goal of benchmarking suites, such as MLPerf for analyzing machine-learning (ML)
hardware performance, is to standardize a fair comparison of different hardware architectures. However, there are many apps that are
not well represented by these standards that require different workloads, such as ML models and datasets, to achieve similar goals.
Additionally, many devices feature hardware optimized for data types other than 32-bit floating-point numbers, the standard
representation defined by MLPerf. Edge-computing devices often feature app-specific hardware to offload common operations found in
ML apps from the constrained CPU. This research analyzes multiple low-power compute architectures that feature ML-specific
hardware on a case study of handwritten Chinese character recognition. Specifically, AlexNet and a custom version of GoogLeNet are
benchmarked in terms of their streaming latency for optical character recognition. Considering these models are custom and not the
most widely used, many architectures are not specifically optimized for them. The performance of these models can stress devices in
different, yet insightful, ways that generalizations of the performance of other models can be drawn from. The NVIDIA Jetson AGX
Xavier (AGX), Intel Neural Compute Stick 2 (NCS2), and Google Edge TPU architectures are analyzed with respect to their
performance. The design of the AGX and TPU devices showcased the lowest streaming latency for AlexNet and GoogLeNet,
respectively. Additionally, the tightly-integrated NCS2 design showed the best generalizability in performance and efficiency across
neural networks.
3-S1: Graph Challenge Special (17:30-19:30 EDT)
Scaling Graph Clustering with Distributed Sketches
Benjamin Priest (Lawrence Livermore National Laboratory)*; Alec Dunton (University of Colorado Boulder); Geoffrey Sanders (LLNL)
The unsupervised learning of community structure, in particular the partitioning vertices into clusters or communities, is a canonical and
well-studied problem in exploratory graph analysis. However, like most graph analyses the introduction of immense scale presents
challenges to traditional methods. Spectral clustering in distributed memory, for example, requires hundreds of expensive bulk-
synchronous communication rounds to compute an embedding of vertices to a few eigenvectors of a graph associated matrix.
Furthermore, the whole computation may need to be repeated if the underlying graph changes some low percentage of edge updates.
We present a method inspired by spectral clustering where we instead use matrix sketches derived from random dimension-reducing
projections. We show that our method produces embeddings that yield performant clustering results given a fully-dynamic stochastic
block model stream using both the fast Johnson-Lindenstrauss and CountSketch transforms. We also discuss the effects of stochastic
block model parameters upon the required dimensionality of the subsequent embeddings, and show how random projections could
significantly improve the performance of graph clustering in distributed memory.
At-Scale Sparse Deep Neural Network Inference With Efficient GPU Implementation
Mert Hidayetoglu (University of Illinois at Urbana-Champaign)*; Carl Pearson (University of Illinois at Urbana-Champaign); Vikram
Sharma Mailthody (University of Illinois at Urbana-Champaign); Eiman Ebrahimi (NVIDIA); Jinjun Xiong (IBM Thomas J. Watson
Research Center); Rakesh Nagi (University of Illinois at Urbana-Champaign); Wen-Mei Hwu (University of Illinois at Urbana-
Champaign)
This paper presents GPU performance optimization and scaling results for inference models of the Sparse Deep Neural Network
Challenge 2020. Demands for network quality have increased rapidly, pushing the size and thus the memory requirements of many
neural networks beyond the capacity of available accelerators. Sparse deep neural networks (SpDNN) have shown promise for reining
in the memory footprint of large neural networks. However, there is room for improvement in implementing SpDNN operations on
GPUs. This work presents optimized sparse matrix multiplication kernels fused with the ReLU function. The optimized kernels reuse
input feature maps from the shared memory and sparse weights from registers. For multi-GPU parallelism, our SpDNN implementation
duplicates weights and statically partition the feature maps across GPUs. Results for the challenge benchmarks show that the
proposed kernel design and multi-GPU parallelization achieve up to 180 TeraEdges per second inference throughput. These results
are up to 4.3X faster for a single GPU and an order of magnitude faster at full scale than those of the champion of the 2019 Sparse
Deep Neural Network Graph Challenge for the same generation of NVIDIA V100 GPUs. Using the same implementation, we also
show single-GPU throughput on NVIDIA A100 is 2.37X faster than V100.
A Novel Inference Algorithm for Large Sparse Neural Network using Task Graph Parallelism
Dian-Lun Lin (University of Utah); Tsung-Wei Huang (University of Utah)*
The ever-increasing size of modern deep neural network (DNN) architectures has put increasing strain on the hardware needed to
implement them. Sparsified DNNs can greatly reduce memory costs and increase throughput over standard DNNs, if the loss of
accuracy can be adequately controlled. However, sparse DNNs present unique computational challenges. Efficient model or data
parallelism algorithms are extremely hard to design and implement. The recent MIT/IEEE/Amazon HPEC Graph Challenge has drawn
attention to high-performance inference methods for large sparse DNNs. In this paper, we introduce SNIG, an efficient inference
engine for large sparse DNNs. SNIG develops highly optimized inference kernels and leverages the power of CUDA Graphs to enable
efficient decomposition of model and data parallelisms. Our decomposition strategy is flexible and scalable to different partitions of
data volumes, model sizes, and GPU numbers. We have evaluated SNIG on the official benchmarks of HPEC Sparse DNN Challenge
and demonstrated its promising performance scalable from a single GPU to multiple GPUs. Compared to the champion of the 2019
HPEC Sparse DNN Challenge, SNIG can finish all inference workloads using only a single GPU. At the largest DNN, which has more
than 4 billion parameters across 1920 layers each of 65536 neurons, SNIG is up to 2.3× faster than a state-of-the-art baseline under a
machine of 4 GPUs.
TriC: Distributed-memory Triangle Counting by Exploiting the Graph Structure
SAYAN GHOSH (WASHINGTON STATE UNIVERSITY)*; Mahantesh Halappanavar (Pacific Northwest National Laboratory)
Graph analytics has emerged as an important tool in the analysis of large scale data from diverse application domains such as social
networks, cyber security and bioinformatics. Counting the number of triangles in a graph is a fundamental kernel with several
applications such as detecting the community structure of a graph or in identifying important vertices in a graph. The ubiquity of
massive datasets is driving the need to scale graph analytics on parallel systems. However, numerous challenges exist in efficiently
parallelizing graph algorithms, especially on distributed-memory systems. Irregular memory accesses and communication patterns, low
computation to communication ratios, and the need for frequent synchronization are some of the leading challenges.
In this paper, we present TriC, our distributed-memory implementation of triangle counting in graphs using the Message Passing
Interface (MPI), as a submission to the 2020 Graph Challenge competition. Using a set of synthetic and real-world inputs from the
challenge, we demonstrate a speedup of up to 90x relative to previous work on 32 processor-cores of a NERSC Cori node. We also
provide details from distributed runs with up to 8192 processes along with strong scaling results. The observations presented in this
work provide an understanding of the system- level bottlenecks at scale that specifically impact sparse-irregular workloads and will
therefore benefit other efforts to parallelize graph algorithms.
In this paper, we discuss our distributed-memory implementation of graph triangle counting using MPI, as a submission to the 2020
Graph Challenge competition.
Combinatorial Tiling for Sparse Neural Networks
Filip Pawłowski (ENS Lyon)*; Bora Uçar (CNRS and LIP (UMR5668)); Rob Bisseling (Utrecht University); Albert-Jan Yzelman (Huawei
Zürich Research Center)
Sparse deep neural networks (DNNs) emerged as the result of search for networks with less storage and lower computational
complexity. The sparse DNN inference is the task of using such trained DNN networks to classify a batch of input data. We propose an
efficient, hybrid model- and data-parallel DNN inference using hypergraph models and partitioners. We exploit tiling and weak
synchronization to increase cache reuse, hide load imbalance, and hide synchronization costs. Finally, a blocking approach allows
application of this new hybrid inference procedure for deep neural networks. We initially experiment using the hybrid tiled inference
approach only, using the first five layers of networks from the IEEE HPEC 2019 Graph Challenge, and attain up to 2x speedup versus
a data-parallel baseline.
Studying the Effects of Hashing of Sparse Deep Neural Networks on Data and Model Parallelisms
Mohammad Hasanzadeh Mofrad (University of Pittsburgh)*; Rami Melhem (University of Pittsburgh); Mohammad Hammoud (Carnegie
Mellon University); Muhammad Yousuf Ahmad (Carnegie Mellon University in Qatar)
Deep Neural Network (DNN) training and inference are two resource-intensive tasks that are usually scaled out using data or model
parallelism where \textit{data parallelism} parallelizes over the \textit{input data} and \textit{model parallelism} parallelizes over the
\textit{network}. Also, \textit{dense matrix-matrix multiplication} is the key primitive behind training/inference of \textit{dense DNNs}. On
the contrary, \textit{sparse DNNs} are less resource-intensive compared to their dense counterparts while offering comparable
accuracy. Similarly, they can be parallelized using data or model parallelism with \textit{Sparse Matrix-Matrix Multiplication (SpMM)} as
the key primitive. To scale out, both data and model parallelisms initially use data parallelism to partition the input data among multiple
machines. This initial partitioning of the input makes data and model parallelisms performance prone to load imbalance as partitions
may be imbalanced. As part of this paper, we take a deeper look into data and model parallelisms and closely study the mechanics of
the SpMM used for each. Moreover, to intuitively remedy their load imbalance problem, we incorporate hashing as a simple yet
powerful method to address load imabalance. Finally, we use the IEEE HPEC sparse DNN challenge dataset to evaluate the
performance of data and model parallelisms at scale. We scaled up to 32 machines (896 cores) and inferred a large sparse DNN with
4B parameters in 51 seconds. Results suggest that with hashing, data and model parallelisms achieve super-linear speedup due to
better load balance and cache utilization.
Incremental Streaming Graph Partitioning
L Durbeck (Virginia Tech)*; Peter Athanas (Virginia Tech)
Graph partitioning is an NP-hard problem whose efficient approximation has long been a subject of interest and progress. The I/O
bounds of contemporary computing environments favor incremental or streaming graph partitioning methods. Methods have sought a
balance between latency, simplicity, accuracy, and memory size. In this paper, we apply an incremental approach to streaming
partitioning that tracks changes with a lightweight proxy triggering repartition as error increases. We evaluate its performance on the
DARPA/MIT Graph Challenge streaming stochastic block partition dataset, and find that it can dramatically reduce the invocation of
partitioning, which can provide an order of magnitude speedup.
KTrussExplorer: Exploring the Design Space of K-truss Decomposition Optimizations on GPUs
Safaa Diab (American University of Beirut); Mhd Ghaith Olabi (American University of Beirut); Izzat El Hajj (American University of
Beirut)*
K-truss decomposition is an important method in graph analytics for finding cohesive subgraphs in a graph. Various works have
accelerated k-truss decomposition on GPUs and have proposed different optimizations while doing so. The combinations of these
optimizations form a large design space. However, most GPU implementations focus on a specific combination or set of combinations
in this space.
This paper surveys the optimizations applied to k-truss decomposition on GPUs, and presents KTrussExplorer, a framework for
exploring the design space formed by the combinations of these optimizations. Our evaluation shows that the best combination highly
depends on the graph of choice, and analyses the conditions that make each optimization attractive. Some of the best combinations
we find outperform previous Graph Challenge champions on many large graphs.
Analysis of floating-point round-off error in linear algebra routines for graph clustering
Lucia Yang (University of Colorado Boulder)*; Alyson Fox (LLNL)
We explore the various ways rounding errors can impact the power method for calculating the Fielder vector for graph clustering. A
rounding error analysis reveals that the best eigenpair that is computable with a certain floating point precision type has a worst-case
error that scales to its unit round-off. Although rounding errors can accumulate in the power method at the worst-case bound, this
behavior is not reflected in some practical examples. Furthermore, our numerical experiments show that rounding errors from the
power method may satisfy the conditions necessary for the bounding of the mis-clustering rate and that the approximate eigenvectors
with errors close to half precision unit round-off can yield sufficient clustering results for partitioning stochastic block model graphs.
Wednesday, September 23