2021
IEEE High Performance Extreme Computing
Virtual Conference
20 - 24 September 2021

Tuesday, September 21
2-V: Policy Spotlight Session (10:30-11:00)
Session Chair(s): Albert Reuther
Invited Talk: Public Investments In Science for the Economic Security of the United States
Prof. Jonathan Gruber (MIT Sloan School)

Spectral Graph Partitioning Using Geodesic Distance-based Projection
Yasunori Futamura (University of Tsukuba)*; Ryota Wakaki (University of Tsukuba); Tetsuya Sakurai (University of Tsukuba)
The graph partitioning problem is one of the most fundamental combinatorial problems with a wide variety of applications. In this paper, we propose an
efficient approach for implementing spectral graph partitioning that has a solid theoretical foundation but is practically less efficient than standard multilevel
partitioners. The proposed method is based on the approximation of the eigenvectors of a graph-Laplacian matrix derived using the basis vectors by
geodesic distance-based approach, instead of a standard linear algebraic approach. This geodesic distance-based approach is the key to making our
method competitive in comparison to the multilevel partitioners. The primary building blocks of our method are the breadth-first search and the sparse
matrix-vector product, which are the main kernels intensively studied in the high-performance computing field. Based on a performance evaluation in the
shared-memory parallel setting, we demonstrate that our method is comparable to standard partitioners in terms of both quality and speed. We also show
that our method has the potential to facilitate reproducibility-aware parallel partitioning.
Vertical, Temporal, and Horizontal Scaling of Hierarchical Hypersparse GraphBLAS Matrices
Jeremy Kepner (MIT Lincoln Laboratory)*; Timothy A Davis (Texas A&M University); Chansup Byun (MIT Lincoln Laboratory); William
Arcand (MIT); David Bestor (MIT); William Bergeon (MIT); Vijay Gadepally (MIT Lincoln Laboratory); Michael Hurray (MIT); Matthew
Hubbell (MIT Lincoln Laboratory); Michael S Jones (MIT Lincoln Laboratory); Anna Klein (MIT); Lauren Milechin (MIT); Julie S Mullen
(MIT Lincoln Laboratory); Andrew Prout (MIT); Albert Reuther (MIT Lincoln Laboratory); Antonio Rosa (MIT); Siddharth Samsi (MIT
Lincoln Laboratory); Charles Yee (MIT Lincoln Laboratory); Peter Michaleas (MIT Lincoln Laboratory)
Hypersparse matrices are a powerful enabler for a variety of network, health, finance, and social applications. Hierarchical hypersparse GraphBLAS
matrices enable rapid streaming updates while preserving algebraic analytic power and convenience. In many contexts, the rate of these updates sets the
bounds on performance. This paper explores hierarchical hypersparse update performance on a variety of hardware with identical software configurations.
The high-level language bindings of the GraphBLAS readily enable performance experiments on simultaneous diverse hardware. The best single process
performance measured was 20,000,000 updates per second. The best single node performance measured was 170,000,000 updates per second. The
hardware used spans nearly a decade and allows a direct comparison of hardware improvements for this computation over this time range; showing a 2x
increase in single-core performance, a 3x increase in single process performance, and a 5x increase in single node performance. Running on nearly two
thousand MIT SuperCloud nodes simultaneously achieved a sustained update rate of over 200,000,000,000 updates per second. Hierarchical
hypersparse GraphBLAS allows the MIT SuperCloud to analyze extremely large streaming network data sets.
Delayed Asynchronous Iterative Graph Algorithms
Mark P Blanco (Carnegie Mellon University)*; Tze Meng Low (Carnegie Mellon University); Scott McMillan (CMU Software
Engineering Institute)
Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for
computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values
are asynchronously made available to other updates computed in the same iteration. In a multi-threaded shared memory system, the immediate
propagation of updated values can cause memory contention that may offset the benefit of propagating updates sooner. In some cases, the benefit of a
smaller number of iterations may be diminished by each iteration taking longer. Our key idea is to combine the low memory contention that synchronous
approaches have with the faster information sharing of asynchronous approaches. Our hybrid approach buffers updates from threads locally before
committing them to the global store to control how often threads may cause conflicts for others while still sharing data within one iteration and hence
speeding convergence. On a 112-thread CPU system, our hybrid approach attains up to 4.5\% - 19.4\% speedup over an asynchronous approach for
Pagerank and up to 1.9\% - 17\% speedup over asynchronous Bellman Ford SSSP. Further, our hybrid approach attains 2.56x better performance than the
synchronous approach. Finally, we provide insights as to why delaying updates is not helpful on certain graphs where connectivity is clustered on the main
diagonal of the adjacency matrix.
Inverse-Deletion BFS - Revisiting Static Graph BFS Traversals with Dynamic Graph Operations
Oded Green (NVIDIA)
Breadth-First Search (BFS) traversals appear in a wide range of applications and domains. BFS traversals determine the distance between key vertices
and the remaining vertices in the network. The distance between the vertices often called the number of hops, is the shortest path between a root and the
remaining vertices in the graph. Given its applicability across multiple domains, BFS has received significant attention from theoretical and applied
communities, including many algorithms for parallelizing BFS. BFS traversals are typically conducted on a graph snapshot (commonly referred to as a
static graph). In this paper, we show a novel algorithm for executing a BFS traversal. While our algorithm also uses a static graph for its input, we show
how to perform a BFS traversal using dynamic graph operations. Specifically, in each BFS level, we remove the subset of the edges that are unnecessary
for the next phase. These edges do not impact finding the vertices in the next level and reduce random memory accesses. We show a top-down BFS
variation of our new algorithm. While our implementation does not outperform state-of-the-art implementations, its performance is competitive.
Furthermore, it shows a novel way to implement BFS and opens up many research opportunities.
Invited Talk: The New Internet: From Communication Medium to Transaction Medium
Prof. Sandy Pentland (MIT Media Lab)

2-2: Graph Analytics & Network Science 2 Session (12:30-13:45)
Session Co-Chairs: Sadas Shankar & Nikos Pitsianis

2-1: Graph Analytics & Network Science 1 Session (11:00-12:15)
Session Co-Chairs: Sadas Shankar & TBD

The GraphBLAS in Julia and Python:the PageRank and Triangle Centralities
Michel Pelletier (Graphegon)*; William R Kimmerer (Massachusetts Institute of technology); Timothy A Davis (Texas A&M University);
Timothy Mattson (Intel)
The GraphBLAS is a standard API for expressing Graphs in the language of Linear algebra. The goal is to provide high performance while exploiting the
fundamental simplicity of Graph algorithms in terms of a common set of “Basic Linear Algebra Subprograms”. A robust parallel implementation of the
GraphBLAS C specification is available as the SuiteSparse:GraphBLAS library. The simplicity of the GraphBLAS, so apparent “in the math”, is
diminished when expressed in terms of a low level language such as C. To see the full expressive power of the GraphBLAS, a high level interface is
needed so that the elegance of the mathematical underpinnings of the GraphBLAS is clearly apparent in the code. In this paper we introduce the Julia
interface to the SuiteSparse:GraphBLAS library and compare it to the Python interface [2]. We implement the Page Rank and Triangle Centrality
algorithms with remarkably little code and show that no significant performance is sacrificed by moving from C to the more productive Python and
Julia interfaces.
HyPC-Map: A Hybrid Parallel Community Detection Algorithm Using Information-Theoretic Approach
Md Abdul Motaleb Faysal (University of New Orleans)*; Shaikh M. Arifuzzaman (University of New Orleans); Cy Chan (Berkeley Lab);
Maximilian Bremer (Berkeley Lab); Doru Thom Popovici (Lawrence Berkeley National Laboratory); John Shalf (Lawrence Berkeley
National Laboratory)
Community detection has become an important graph analysis kernel due to the tremendous growth of social networks and genomics discoveries. Even
though there exist a large number of algorithms in the literature, studies show that community detection based on an information-theoretic approach (known
as Infomap) delivers better quality solutions than others. Being inherently sequential, the Infomap algorithm does not scale well for large networks. In this
work, we develop a hybrid parallel approach for community detection in graphs using Information Theory. We perform extensive benchmarking and analyze
hardware parameters to identify and address performance bottlenecks. Additionally, we use cache-optimized data structures to improve cache locality. All of
these optimizations lead to an efficient and scalable community detection algorithm, HyPC-Map, which demonstrates a 25-fold speedup (much higher than the
state-of-the-art map-based techniques) without sacrificing the quality of the solution.
Extended Abstract: K-Truss Implementation in Arkouda
Joseph T Patchett (New Jersey Institute of Technology)*; ZHIHUI DU (New Jersey Institute of Technology); David Bader (New Jersey
Institute of Technology)
K-Truss Implementation in Arkouda, a data science framework for graph analytics. Arkouda is an open source framework.
A GraphBLAS implementation of Triangle Centrality
Fuhuan Li (New Jersey Institute of Technology)*; David Bader (New Jersey Institute of Technology)
Identifying key members in large social network graphs is an important graph analytic. Recently, a new centrality measure called triangle centrality finds
members based on the triangle support of vertices in graph. In this paper, we describe our rapid implementation of triangle centrality using Graph-
BLAS, an API specification for describing graph algorithms in the language of linear algebra. We use triangle centrality’s algebraic algorithm and
easily implement it using the SuiteSparse GraphBLAS library. A set of experiments on large, sparse graph datasets is conducted to verify the
implementation.

2-3: Graph Analytics & Network Science 3 Session (14:15-15:30)
Session Co-Chairs: Scott McMillan & Nikos Pitsianis

Enabling Exploratory Large Scale Graph Analytics through Arkouda
ZHIHUI DU (New Jersey Institute of Technology); Oliver A Alvarado Rodriguez (New Jersey Institute of Technology)*; David Bader (New
Jersey Institute of Technology)
Exploratory graph analytics helps maximize the informational value from a graph. However, the increasing graph size makes it impossible for existing popular
exploratory data analysis tools to handle dozens-of-terabytes or even larger data sets in the memory of a common laptop/personal computer. Arkouda is a
framework under early-development that brings together the productivity of Python at the user side with the high-performance of Chapel at the server side. In
this paper, we present preliminary work on overcoming the memory limit and high performance computing coding roadblock for high level Python users to
perform large graph analysis. A simple and succinct graph data structure design and implementation at both the Python front-end and the Chapel back-end in
the Arkouda framework are provided. A typical graph algorithm, Breadth-First Search (BFS), is used to show how we can use Chapel to develop high
performance parallel graph algorithm productively. Two Chapel-based parallel Breadth-First Search (BFS) algorithms, one high level version and one
corresponding low level version, have been implemented in Arkouda to support analyzing large graphs. Multiple graph benchmarks are used to evaluate the
performance of the provided graph algorithms. Ex- perimental results show that we can optimize the performance by tuning the selection of different Chapel
high level data structures and parallel constructs. Our code is open source and available from GitHub (https://github.com/Bader-Research/arkouda).
Digraph Clustering by the BlueRed Method
Tiancheng Liu (Duke University)*; Dimitris Floros (Aristotle University of Thessaloniki); Nikos P Pitsianis (Aristotle University of
Thessaloniki); Xiaobai Sun (Duke University)
We introduce a new method for vertex clustering or community detection on directed graphs (digraphs). The new method is an extension of the BlueRed
method introduced initially for undirected graphs. Complementary to supervised or semisupervised classification, unsupervised graph clustering is
indispensable to exploratory data analysis and knowledge discovery. Conventional graph clustering methods are fundamentally hindered in effectiveness and
efficiency by either the resolution limit or various problems with resolution parameter selection. BlueRed is originative in analysis, modeling, and solution
approach. Its clustering process is simple, fully autonomous and unsupervised. Among other potential impacts, BlueRed breaks new ground for high-
throughput, low-cost and high-performance graph clustering computation, as it has removed the barrier of parameter tuning/selection. We report benchmark
studies with real-world graph data for evaluating the new method. The clustering results are in remarkable agreement with the ground truth labels. We also
present an important study on the U.S. patent citation graph CITE75_99. More than a quarter of 3.7 million patents have no electronic records of category
codes. With BlueRed, we are able to efficiently and economically give a semantic presentation of the patents without category codes.
An Efficient Algorithm for the Construction of Dynamically Updating Trajectory Networks
Deniz Gurevin (University of Connecticut)*; Chris Michael (Naval Research Laboratory); Omer Khan (University of Connecticut)
Trajectory based spatiotemporal networks (STN) are useful in a wide range of applications, such as crowd behavior analysis. Significant portion of trajectory
network based research focuses on optimizing the analysis of STN to characterize, control, and predict network behavior. However, these mining algorithms
are typically carried out on a pre-constructed network structure that tracks all moving objects and their trajectories in real time. The construction of such a
trajectory network is itself a computationally expensive task and it is becoming a bigger burden with advancements in analysis algorithms. The traditional
approach is to construct static networks from the temporal snapshots of trajectory data that cannot handle spatiotemporally changing data. This paper proposes
an efficient algorithm that successfully generates and maintains an ST network from raw trajectory data. The proposed method is based on a customized R-
Tree based constructor to keep track of object trajectories and their interactions. It avoids redundant updates as trajectories evolve over time, resulting in a
significant reduction in STN construction time. Based on the experiments we conducted, our method reduces the number of updates by 71.25% as compared
to the static naive STN construction method.
Graph Embedding and Field Based Detection of Non-Local Webs in Large Scale Free Networks
Michael E Franusich (Carnegie Mellon University)*; Franz Franchetti (CMU)
Scalable graph algorithms are usually discrete and heuristics-based and their performance may strongly depend on the topology and size of the targeted
graph. In this paper we investigate an alternative approach where we translate graph problems into a continuous space problem: we embed graphs into an
higher dimensional Euclidean space using an N-body approach. Then we define a synthetic field as function of the Euclidean space vertex locations and find
peaks in the field. These peaks are used to identify regions and thus vertices of interest, and are an emergent property. We test the approach to identify a low-
degree web that is connecting otherwise unrelated vertices in a power law graph. We implement the algorithm in CUDA and show good scalability to graphs
with 7 million nodes.

2-4: Data Intensive Computing Session (15:45-17:00)
Session Co-Chairs: Seung Woo Son & Tim Mattson

Reconfigurable Low-latency Memory System for Sparse Matricized Tensor Times Khatri-Rao Product on FPGA
Sasindu Wijeratne (University of Southern California)*; Rajgopal Kannan (USC); Viktor K Prasanna (Unversity of Southern California)
Tensor decomposition has become an essential tool in many applications in various domains, including machine learning. Sparse Matricized Tensor Times
Khatri-Rao Product (MTTKRP) is one of the most computationally expensive kernels in tensor computations. Despite having significant computational
parallelism, MTTKRP is a challenging kernel to optimize due to its irregular memory access characteristics. This paper focuses on a multi-faceted memory
system, which explores the spatial and temporal locality of the data structures of MTTKRP. Further, users can reconfigure our design depending on the behavior
of the compute units used in the FPGA accelerator. Our system efficiently accesses all the MTTKRP data structures while reducing the total memory access
time, using a distributed cache and Direct Memory Access (DMA) subsystem. Moreover, our work improves the memory access time by 3.5x compared with
commercial memory controller IPs. Also, our system shows 2x and 1.26x speedups compared with cache-only and DMA-only memory systems, respectively.
Non-Volatile Memory Accelerated Geometric Multi-Scale Resolution Analysis
Andrew E Wood (Boston University)*; Moshik Hershcovitch (IBM Research); Daniel G Waddington (IBM Research); Sarel Cohen (Hasso
Plattner Institute); Meredith G Wolf (Williams College); Hongjun Suh (Seoul National University); Weiyu Zong (Boston University); Sang
Chin (Boston University)
Dimensionality reduction algorithms are standard tools in a researcher's toolbox. Dimensionality reduction algorithms are frequently used to augment
downstream tasks such as machine learning, data science, and also are exploratory methods for understanding complex phenomena. For instance,
dimensionality reduction is commonly used in Biology as well as Neuroscience to understand data collected from biological subjects. However, dimensionality
reduction techniques are limited by the von-Neumann architectures that they execute on. Specifically, data intensive algorithms such as dimensionality reduction
techniques often require fast, high capacity, persistent memory which historically hardware has been unable to provide at the same time. In this paper, we
present a re-implementation of an existing dimensionality reduction technique called Geometric Multi-Scale Resolution Analysis (GMRA) which has been
accelerated via novel persistent memory technology called Memory Centric Active Storage (MCAS). Our implementation uses a specialized version of MCAS
called PyMM that provides native support for Python datatypes including NumPy arrays and PyTorch tensors. We compare our PyMM implementation against a
DRAM implementation, and show that when data fits in DRAM, PyMM offers competitive runtimes. When data does not fit in DRAM, our PyMM implementation
is still able to process the data.
Large Scale String Analytics In Arkouda
ZHIHUI DU (New Jersey Institute of Technology)*; Oliver A Alvarado Rodriguez (New Jersey Institute of Technology); David Bader (New
Jersey Institute of Technology)
Large scale data sets from the web, social networks, and bioinformatics are widely available and can often be represented by strings and suffix arrays are highly
efficient data structures enabling string analysis. But, our personal devices and corresponding exploratory data analysis (EDA) tools cannot handle big data sets
beyond the local memory. Arkouda is a framework under early development that brings together the productivity of Python at the user side with the high-
performance of Chapel at the server-side. In this paper, an efficient suffix array data structure design and integration method are given first. A suffix array
algorithm library integration method instead of one single suffix algorithm is presented to enable runtime performance optimization in Arkouda since different
suffix array algorithms may have very different practical performances for strings in various applications. A parallel suffix array construction algorithm framework
is given to further exploit hierarchical parallelism on multiple locales in Chapel. A corresponding benchmark is developed to evaluate the feasibility of the
provided suffix array integration method and measure the end-to-end performance. Experimental results show that the proposed solution can provide data
scientists an easy and efficient method to build suffix arrays with high performance in Python. All our codes are open source and available from GitHub
(https://github.com/Bader-Research/arkouda/tree/string-suffix-array-functionality).
Streaming Detection and Classification Performance of a POWER9 Edge Supercomputer
Wesley Brewer (HPCMP PET)*; Chris Geyer (Parsons Corporation); Dardo Kleiner (CIPS, Corp.); Connor Horne (Naval Research
Laboratory)
We present training and inference benchmarks on a POWER9 edge supercomputer called SCOUT using two production-level artificial intelligence systems: the
Video Processing Exploitation Framework (VPEF) and the Ordnance Threat Target Automated Recognition (OTTAR) system. For training benchmarks, we use
Horovod to train ResNet-50 on synthetic data for a baseline benchmark, and then train Faster R-CNN on an available WASABI dataset using SCOUT's six-way
V100 GPU nodes. For inference benchmarks, we use GStreamer to stream both synthetic and real Motion Imagery (MI) and then use Single-Shot Detector
trained on MobileNetV2 data with real and synthetic MI at four different resolutions (720p, 1080p, 1280p, and 2160p) while measuring average inference time.
We also test reduced-precision inference performance using TensorRT and distributed inference performance using the Inference Benchmark ``iBench''
framework. We compare our results with equivalent work on x86\_64 systems and provide suggestions on tuning for optimal performance. We find that V100s
work well for training and offline inferencing in batches, while T4 GPUs outperform V100 GPUs only in very specific usage scenarios, such as streaming
detection at reduced precision.
Scaling of Evolutionary Search of Algorithm Space to Speed-Up Scientific Image Understanding Workflows
Nicholas J Grabill (University of Michigan); Kai Pinckard (Reed College); Dirk Colbry (Michigan State University)*
Scientific image understanding is an integral part of many research workflows. Studies in everything from self-driving cars to the makeup of cells rely on image
understanding through computer vision techniques. Unfortunately, almost every new scientific question requires a new algorithm to be developed by
researchers. Exploring the possible algorithms (and their parameters) to find one that fits a particular scientific problem requires time and a broad understanding
of the many options available. For this reason, many scientists resort to making measurements "by hand" instead of making the considerable investment
required to develop a tool that can automate their specialized workflow. This paper explores the scaling of a tool (SEE-Segment) that searches the "algorithm
space" of possible image segmentation algorithms and their parameters for solutions to unique scientific problems. The goal is to have the researchers do this
manual annotation of their images ("measure by hand") using a Graphical User Interface front end while the SEE-Segment program searches through the
algorithms and their parameters on the back end. In the best case, the SEE-Segment tool finds a good candidate algorithm that can be used on the remaining
images in the researcher's data-set. In the worst case, the researchers annotate their images as they would without the tool. Unfortunately, the search space for
different segmentation algorithms and their parameters is quite large and non-linear. This research leverages the pleasantly parallel nature of Genetic Algorithms
and explores the use of large-scale computing to speed up the search both on a local HPC and on the cloud using Kubernetes.
2-S1: GraphBLAS BoF Special (17:30-19:30)
Organizer(s): Tim Mattson & Scott McMillan
Invited Talk: The GraphBLAS C++ Interface
Ben Brock (UC Berkeley)
Invited Talk: Binary Storage Formats
Erik Welsh (Anaconda)
Lighning Talks: Updates/news from the GraphBLAS implementors
2-S2: Remote Sensing for Disaster Relief Special (17:30-19:30)
Organizer(s): John Aldridge & Dan Dumanis & Andrew Weinert
Invited Talk: TBD
TBD
Invited Talk: TBD
Matthew Lengel
Invited Talk: TBD
Dr. Brian Vegetabile (RAND)
Invited Talk: Augmented Reality for Rapid Disaster Response
Capt Victor “Salsa” Lopez (USAF Al Accelerator)
Invited Talk: Multimodal Vision for Synthetic Aperture Radar
TSgt Armando Cabrera (USAF Al Accelerator)
Supercomputing Enabled Deployable Analytics for Disaster Response
Jeremy Kepner (MIT Lincoln Laboratory); Kaira M Samuel (MIT)*
First responders and other forward deployed essential workers can benefit from advanced analytics. Limited network access and software security requirements
prevent the usage of standard cloud based microservice analytic platforms that are typically used in industry. One solution is to precompute a wide range of
analytics as files that can be used with standard preinstalled software that does not require network access or additional software and can run on a wide range
of legacy hardware. In response to the COVID-19 pandemic, this approach was tested for providing geo-spatial census data to allow quick analysis of
demographic data for better responding to emergencies. These data were processed using the MIT SuperCloud to create several thousand Google Earth and
Microsoft Excel files representative of many advanced analytics. The fast mapping of census data using Google Earth and Microsoft Excel has the potential to
give emergency responders a powerful tool to improve emergency preparedness. Our approach displays relevant census data (total population, population
under 15, population over 65, median age) per census block, sorted by county, through a Microsoft Excel spreadsheet (xlsx file) and Google Earth map (kml file).
The spreadsheet interface includes features that allow users to convert between different longitude and latitude coordinate units. For the Google Earth files, a
variety of absolute and relative colors maps of population density have been explored to provide an intuitive and meaningful interface. Using several hundred
cores on the MIT SuperCloud, new analytics can be generated in a few minutes.First responders and other forward deployed essential workers can benefit from
advanced analytics. Limited network access and software security requirements prevent the usage of standard cloud based microservice analytic platforms that
are typically used in industry. One solution is to precompute a wide range of analytics as files that can be used with standard preinstalled software that does not
require network access or additional software and can run on a wide range of legacy hardware. In response to the COVID-19 pandemic, this approach was
tested for providing geo spatial census data to allow quick analysis of demographic data for better responding to emergencies. These data were processed using
the MIT SuperCloud to create several thousand Google Earth and Microsoft Excel files representative of many advanced analytics. The fast mapping of census
data using Google Earth and Microsoft Excel has the potential to give emergency responders a powerful tool to improve emergency preparedness. Our
approach displays relevant census data (total population, population under 15, population over 65, median age) per census block, sorted by county, through a
Microsoft Excel spreadsheet (xlsx file) and Google Earth map (kml file). The spreadsheet interface includes features that allow users to convert between different
longitude and latitude coordinate units. For the Google Earth files, a variety of absolute and relative colors maps of population density have been explored to
provide an intuitive and meaningful interface. Using several hundred cores on the MIT SuperCloud, new analytics can be generated in a few minutes.

Invited Talk: What do we want to know about the internet
Dr. David Clark (MIT CSAIL)

Invited Talk: Bot-Match: Social Bot Detection with Semi-Supervised
Recursove Nearest Neighbors Search
Ltc Dr. David Beskow (Army Cyber)

2021 Abstract Book