2022
IEEE High Performance Extreme Computing
Virtual Conference
19 - 23 September 2022
Tuesday, September 20
2-V: Keynote Session (10:30-11:00)
Co-Chairs: Jeremy Kepner & Albert Reuther
Reflections on a Career in Computer Science
Prof. Barbara Liskov (MIT CSAIL)
2-1: Graph Analytics & Network Science 1 Session (11:00-12:15)
Co-Chairs: John Gilbert & Chris Long
Invited Talk: The NSF Computing and Information Science and Engineering Landscape: A Look Forward
Dr. Almadena Chtchelkanova (NSF)
GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic [Outstanding Paper Award]
Michael S Jones; Jeremy Kepner (MIT LLSC); Daniel Andersen (CAIDA); Aydın Buluc ̧(LBNL); Chansup Byun (MIT LLSC); K Claffy
(CAIDA); Timothy Davis (Texas A&M); William Arcand (MIT LLSC); Jonathan Bernays (MIT Lincoln Laboratory); David Bestor; William
Bergeron; Vijay Gadepally; Micheal Houle; Matthew Hubbell; Hayden Jananthan; Anna Klein (MIT LLSC); Chad Meiners (MIT Lincoln
Laboratory); Lauren Milechin (MIT); Julie Mullen (MIT LLSC); Sandeep Pisharody (MIT Lincoln Laboratory); Andrew Prout; Albert
Reuther; Antonio Rosa; Siddharth Samsi (MIT LLSC); Jon Sreekanth (Accolade Technology); Doug Stetson (MIT Lincoln Laboratory);
Charles Yee; Peter Michaleas (MIT LLSC)
Long range detection is a cornerstone of defense in many operating domains (land, sea, undersea, air, space, ..,). In the cyber
domain, long range detection requires the analysis of significant network traffic from a variety of observatories and outposts.
Construction of anonymized hypersparse traffic matrices on edge network devices can be a key enabler by providing significant data
compression in a rapidly analyzable format that protects privacy. GraphBLAS is ideally suited for both constructing and analyzing
anonymized hypersparse traffic matrices. The performance of GraphBLAS on an Accolade Technologies edge network device is
demonstrated on a near worse case traffic scenario using a continuous stream of CAIDA Telescope darknet packets. The performance
for varying numbers of traffic buffers, threads, and processor cores is explored. Anonymized hypersparse traffic matrices can be
constructed at a rate of over 50,000,000 packets per second; exceeding a typical 400 Gigabit network link. This performance
demonstrates that anonymized hypersparse traffic matrices are readily computable on edge network devices with minimal compute
resources and can be a viable data product for such devices.
Analyzing Multi-trillion Edge Graphs on Large GPU Clusters: A Case Study with PageRank [Outstanding Paper Award]
Seunghwa Kang; Joseph Nke; Brad Rees (NVIDIA)
We previously reported PageRank performance results on a cluster with 32 A100 GPUs. This paper extends the previous work to
2048 GPUs. The previous implementation performs well as long as the number of GPUs is small relative to the square of the average
vertex degree but its scalability deteriorates as the number of GPUs further increases. We updated our previous implementation with
the following objectives: 1) enable analyzing a P times larger graph with P times more GPUs up to P = 2048, 2) achieve reasonably
good weak scaling, and 3) integrate the improvements to the open-source data science ecosystem (i.e. RAPIDS cuGraph). While we
evaluate the updates with PageRank in this paper, they improve the scalability of a broader set of algorithms in cuGraph.
To be more specific, we updated our 2D edge partitioning scheme; implemented the PDCSC (partially doubly compressed sparse
column) format which is a hybrid data structure that combines CSC (compressed sparse column) and DCSC (doubly compressed
sparse column); adopted (key, value) pairs to store edge source vertex property values; and improved the reduction communication
strategy. The 32 GPU cluster has A100 GPUs (40 GB HBM per GPU) connected with NVLink. We ran the updated implementation on
the Selene supercomputer which uses InfiniBand for inter-node communication and NVLink for intra-node communication. Each
Selene node has eight A100 GPUs (80 GB HBM per GPU). Analyzing the web crawl graph (3.563 billion vertices and 128.7 billion
edges, 32 bit vertex ID, unweighted, average vertex degree: 36.12) took 0.187 second per PageRank iteration on the 32 GPU cluster.
Computing PageRank scores of a scale 38 R-mat graph (274.9 billion vertices and 4.398 trillion edges, 64 bit vertex ID, 32 bit edge
weight, average vertex degree: 16) took 1.54 second per PageRank iteration on the Selene supercomputer with 2048 GPUs. We
conclude this paper discussing potential network system enhancements to improve the scaling.
Achieving Speedups for Distributed Graph Biconnectivity
Ian Bogle; George M. Slota (RPI)
As data scales continue to increase, studying the porting and implementation of shared memory parallel algorithms for distributed
memory architectures becomes increasingly important. We consider the problem of biconnectivity for this current study, which
identifies cut vertices and cut edges in a graph. As part of our study, we implemented and optimized a shared memory biconnectivity
algorithm of Slota and Madduri within a distributed memory context. This algorithm is neither work nor time efficient. However, when
we compare to distributed implementations of theoretically efficient algorithms, we find that simple non-optimal algorithms can greatly
outperform time-efficient algorithms in practice when implemented for real
distributed-memory environments and real data. Overall, our distributed implementation for computing graph biconnectivity
demonstrates an average strong scaling speedup of 15× across 64 MPI ranks on a suite of irregular real-world inputs. We also note
an average of 11× and 7.3× speedup relative to the optimal serial algorithm and fastest shared-memory implementation for the
biconnectivity problem, respectively.
Generating Permutations Using Hash Tables
Oded Green; Corey Nolet; Joe Eaton (NVIDIA)
Given a set of $N$ distinct values, the operation of shuffling those elements and creating a random order is used in a wide range
range of applications, including (but not limited to) statistical analysis, machine learning, games, and bootstrapping. The operation of
shuffling elements is equivalent to generating a random permutation and applying the permutation. For example, the random
permutation of an input allows splitting into two or more subsets without bias. This operation is repeated for machine learning
applications when both a train and test data set are needed. In this paper we describe a new method for creating random
permutations that is scalable, efficient, and simple. We show that the operation of generating a random permutation shares traits with
building a hashtable.Our method uses a fairly new hash table, called HashGraph, to generate the permutation. HashGraph's unique
data-structure ensures easy generation and retrieval of the permutation. HashGraph is one of the fastest known hash-tables for the
GPU and also outperforms many leading CPU hash-tables. We show the performance of our new permutation generation scheme
using both Python and CUDA versions of HashGraph. Our CUDA implementation is roughly $10\%$ faster than our Python
implementation. In contrast to the $shuffle$ operation in NVIDIA's Thrust and cuPy frameworks, our new permutation generation
algorithm is $2.6\times$ and $1.73\times$ faster, respectively and up to $150\times$ faster than numPy.
Poster Session: 2-P (12:15-14:15): Poster Session 2
Chair(s)/Host(s): Siddarth Samsi Yehia Arafa
ProtoX: A First Look
Het Mankad; Sanil Rao (Carnegie Mellon Univ.); Phillip Colella; Brian Van Straalen (Lawrence Berkeley National Laboratory); Franz
Franchetti (Carnegie Mellon Univ.)
We present a first look at ProtoX, a code generation framework for stencil operation that occurs in the numerical solution of partial
differential equations. ProtoX is derived from Proto, a C++ based domain specific library which optimizes the algorithms used to
compute the numerical solution of partial differential equations and SPIRAL, a code generation system that focuses on generating
highly optimized target code. We demonstrate the construction of ProtoX by considering the 2D Poisson equation as a model problem
and using the Jacobi method to solve it. Some of the code generated for this problem specification is shown along with initial speedup
result.
Magic Memory: A Programming Model For Big Data Analytics
Eric Tang; Franz Franchetti (Carnegie Mellon Univ.)
Big data analysis is a difficult task because the analysis is often memory intensive. Current solutions involve iteratively streaming data
chunks into the compute engine and recomputing the analytical algorithm. To simplify this process, this work proposes a new
programming model called Magic Memory. In this model, persistent invariants or functional dependencies can be established between
memory regions. These functional dependencies will always be held true when this memory region is being read. Recent
technological advancements enable Magic Memory at the hardware level, providing performance that is hard to achieve with a
software-only solution. Our ongoing work seeks to explore an implementation of Magic Memory on a CPU-FPGA system, where the
CPU runs the host code while the FPGA provides hardware acceleration. The CPU allocates memory on the FPGA and declares an
invariant for the FPGA to uphold on this region of memory. We demonstrate how an application such as PageRank can utilize Magic
Memory to recalculate its output as the input graph is modified automatically.
Approximating Manifolds and Geodesics with Curved Surfaces
Peter Oostema; Franz Franchetti (Carnegie Mellon Univ.)
Approximating manifolds with spheres gives a representation that can effectively be used to find geodesics. This method takes a point
cloud of data and finds a
manifold representation and geodesic paths between any two points. This allows for graph embedding in non-Euclidean spaces.
Graph embedding gives a spatial representation of a graph which provides a new view of the data structure useful for applications in
AI.
Network Automation in Lab Deployment Using Ansible and Python
V Andal Priyadharshini; Anumalasetty Yashwanth Nath (SRM Institute of Science and Technology)
Network automation has evolved into a solution that ensures efficiency in all areas. The age-old technique to configure common
software defined networking protocols is inefficient as it requires a box-by-box approach that needs to be repeated often and is prone
to manual errors. Network automation assists network administrators in automating and verifying the protocol configuration to ensure
consistent configurations. This paper implemented network automation using python and Ansible to configure different protocols and
configurations in the container lab virtual environment. Ansible can help network administrators minimize human mistakes, reduce
time consumption, and enable device visibility across the network environment.
Optimizations to Increase JDBC Driver Performance in Spark
Deeptaanshu Kumar (Carnegie Mellon Univ.); Suxi Li ( Univ. of Miami)
At “The Arena Group”, we are building an Enterprise Data Lake on AWS using the Databricks Lakehouse Platform. Since we are
ingesting data into our Data Lake using Spark clusters that connect to source databases with JDBC drivers, we have been doing
performance tests with different Spark JDBC driver parameters and configurations to see which set of parameters/configurations gives
us the best read/write performance in terms of execution time. As a result, we are sharing our performance test execution times in this
Extended Abstract, so that our tests can serve as a case study that other engineering teams can use to further optimize the JDBC
driver configurations in their own Spark clusters.
Triangle Centrality in Arkouda
Joseph T Patchett; Zhihui Du; Fuhuan Li; David Bader (New Jersey Inst. of Tech.)
"There are a wide number of graph centrality metrics. Further, the performance of each can vary widely depending on the type of
implementation. We develop triangle centrality in Arkouda with several different triangle counting methods. Triangle Centrality is a
robust metric that captures the centrality of a vertex through both a vertex’s own connectedness and that of its neighbors. Arkouda is a
distributed system framework for data science at the scale of terabytes and beyond. These methods are compared against each other
and another shared memory implementation."
Image Recognition Using Machine Learning For Forbidden Items Detection In Airports
Alaa Atef; Abde-ljalil Naser; Mahmoud Mohamed; Mariam Safwat; Menna Tulla Ayman; Mohamed Mostafa; Salma Hesham (Ain
Shams University); Khaled Salah (Siemens)
One of the problems that faces us at the airport or other places, is if the traveler’s luggage contains sharp weapons or guns that
threatens others’ lives. Here, we work on the idea of searching the luggage by visioning the images taken in an x-ray format and
examining them to find if it contains any of the forbidden items or not. These forbidden items are: guns, knives, wrenches, pliers and
scissors. We use the concepts of machine learning and neural networks to deploy our model. The response time for the Mobile App is
0.13 sec and for the Desktop App is 0.54 sec. For the accuracy, the Mobile App’s is 79 % and the Desktop App’s is 91%.
2-2: Graph Analytics & Network Science 2 Session (12:30-13:45)
Co-Chairs: John Gilbert & Chris Long
Hypersparse Network Flow Analysis of Packets with GraphBLAS
Tyler Trigg; Chad Meiners; Sandeep Pisharody (MIT Lincoln Laboratory); Hayden Jananthan; Michael Jones (MIT LLSC); Adam
Michaleas (MIT Lincoln Laboratory); Timothy Davis (Texas A&M Univ.); Erik Welch (NVIDIA); William Arcand; David Bestor; William
Bergeron; Chansup Byun; Vijay Gadepally; Micheal Houle; Matthew Hubbell; Anna Klein; Peter Michaleas (MIT LLSC); Lauren
Milechin (MIT); Julie Mullen; Andrew Prout; Albert Reuther; Antonio Rosa; Siddharth Samsi (MIT LLSC); Doug Stetson (MIT Lincoln
Laboratory); Charles Yee; Jeremy Kepner (MIT LLSC)
Internet analysis is a major challenge due to the volume and rate of network traffic. In lieu of analyzing traffic as raw packets, network
analysts often rely on compressed network flows (netflows) that contain the start time, stop time, source, destination, and number of
packets in each direction. However, many traffic analyses benefit from temporal aggregation of multiple simultaneous netflows, which
can be computationally challenging. To alleviate this concern, a novel netflow compression and resampling method has been
developed leveraging GraphBLAS hyperspace traffic matrices that preserve anonymization while enabling subrange analysis.
Standard multitemporal spatial analyses are then performed on each subrange
to generate detailed statistical aggregates of the source packets, source fan-out, unique links, destination fan-in, and destination
packets of each subrange which can then be used for background modeling and anomaly detection. A simple file format based on
GraphBLAS sparse matrices is developed for storing these statistical aggregates. This method is scale tested on the MIT SuperCloud
using a 50 trillion packet netflow corpus from several hundred sites collected over several months. The resulting compression
achieved is significant (<0.1 bit per packet) enabling
extremely large netflow analyses to be stored and transported. The single node parallel performance is analyzed in terms of both
processors and threads showing that a single node can perform hundreds of simultaneous analyses at over a million packets/sec
(roughly equivalent to a 10 Gigabit link).
SHARP: Software Hint-Assisted Memory Access Prediction for Graph Analytics
Pengmiao Zhang (Univ. of Southern California); Rajgopal Kannan (US Army Research Lab-West); Xiangzhi Tong (Xi’an Jiaotong-
Liverpool Univ.); Anant V Nori (Intel Labs); Viktor K Prasanna (Univ. of Southern California)
Memory system performance is a major bottleneck in large-scale graph analytics. Data prefetching can hide memory latency; this
relies on accurate prediction of memory accesses. While recent machine learning approaches have performed well on memory
access prediction, they are restricted to building general models, ignoring the shift of memory access patterns following the change of
processing phases in software. We propose SHARP: a novel Software Hint-Assisted memoRy access Prediction approach for graph
analytics under Scatter-Gather paradigm on multi-core shared-memory platforms. We introduce software hints, generated from
programmer insertion, that explicitly indicate the processing phase of a graph processing program, i.e., Scatter or Gather. Assisted by
the software hints, we develop phase-specific prediction models that use attention-based neural networks, trained by memory traces
with rich context information. We use three widely-used graph algorithms and a variety of datasets for evaluation. With respect to F1-
score, SHARP outperforms the widely-used Delta-LSTM model by 16.45%--18.93% for the scatter phase and 9.50%--22.25% for the
gather phase, and outperforms the state-of-the-art TransFetch model by 3.66%--7.48% for the Scatter phase and 2.69%--7.59% for
the Gather phase.
Fast Graph Algorithms for Superpixel Segmentation
Dimitris Floros (Aristotle Univ. of Thessaloniki); Tiancheng Liu (Duke Univ.); Nikos P Pitsianis (Aristotle Univ. of Thessaloniki, Duke
Univ.); Xiaobai Sun (Duke Univ.)
We introduce the novel graph-based algorithm SLAM (simultaneous local assortative mixing) for fast and high-quality superpixel
segmentation of any large color image. Superpixels are compact semantic image elements; superpixel segmentation is fundamental
to a broad range of vision tasks in existing and emerging applications, especially, to safety-critical and time-critical applications. SLAM
leverages a graph representation of the image, which encodes the pixel features and similarities, for its rich potential in implicit feature
transformation and extra means for feature differentiation and association at multiple resolution scales. We demonstrate, with our
experimental results on 500 benchmark images, that SLAM outperforms the state-of-art algorithms in superpixel quality, by multiple
measures, within the same time frame. The contributions are at least twofold: SLAM breaks down the long-standing speed barriers in
graph-based algorithms for superpixel segmentation; it lifts the fundamental limitations in the feature-point-based algorithms.
Explicit Ordering Refinement for Accelerating Irregular Graph Analysis
Michael Mandulak; Ruochen Hu; George M Slota (RPI)
Vertex reordering for efficient memory access in extreme-scale graph-based data analysis shows considerable improvement to the
cache efficiency and runtimes of widely used graph analysis algorithms. Despite this, modern efficient ordering methods are often
heuristic-based and do not directly optimize some given metrics. Thus, this paper conducts an experimental study into explicit metric-
based vertex ordering op- timization. We introduce a universal graph partitioning-inspired approach focusing on CPU shared-memory
parallelism to the vertex ordering problem through the explicit refinement of low- degree vertices using the Linear Gap Arrangement
and Log Gap Arrangement problems as comprehensive metrics for ordering improvement. This degree-based refinement method is
evaluated upon a number of initial orderings with timing and cache efficiency results relative to three shared-memory graph analytic
algorithms: PageRank, Louvain and the Multistep algorithm. Applying refinement, we observe runtime improvements of up to 15x on
the ClueWeb09 graph and up to 4x improvements to cache efficiency on a variety of network types and initial orderings,
demonstrating the feasibility of an optimization approach to the vertex ordering problem at a large scale.
Efficient Calculation of Triangle Centrality in Big Data Networks
Wali Mohammad Abdullah; David Awosoga; Shahadat Hossain (Univ. of Lethbridge)
The notion of “centrality” within graph analytics has led to the creation of well-known metrics such as Google’s PageRank [1], which is
an extension of eigenvector centrality [2]. Triangle centrality is a related metric [3] that utilizes the presence of triangles, which play an
important role in network analysis, to quantitatively determine the relative “importance” of a node in a network. Efficiently counting and
enumerating these triangles are a major backbone to understanding network characteristics, and linear algebraic methods have
utilized the correspondence between sparse adjacency matrices and graphs to perform such calculations, with sparse matrix-matrix
multiplication as the main computational kernel. In this paper, we use an intersection representation of graph data implemented as a
sparse matrix, and engineer an algorithm to compute the triangle centrality of each vertex within a graph. The main computational task
of calculating these sparse matrix-vector products is carefully crafted by employing compressed vectors as accumulators. As with
other state-of-the-art algorithms [4], our method avoids redundant work by counting and enumerating each triangle exactly once. We
present results from extensive computational experiments on large-scale real-world and synthetic graph instances that demonstrate
good scalability of our method. We also present a shared memory parallel implementation of our algorithm.
2-3: Data Intensive Computing Session (14:15-15:30)
Co-Chairs: Xiaobai Sun & Nikos Pitsianis
Enabling Novel In-Memory Computation Algorithms to Address Next-Generation Throughput Constraints on SWaP-Limited
Platforms
Jessica M Ray; Chad Meiners (MIT Lincoln Laboratory)
The Department of Defense relies heavily on filtering and selection applications to help manage the overwhelming amount of data
constantly received at the tactical edge. Filtering and selection are both latency and throughput constrained, and systems at the
tactical edge must heavily optimize their SWaP (size, weight, and power) usage, which can reduce overall computation and memory
performance. In-memory computation (IMC) provides a promising solution to the latency and throughput issues, as it helps enable the
efficient processing of data as it is received, helping eliminate the memory bottleneck imposed by traditional Von Neumann
architectures.
In this paper, we discuss a specific type of IMC accelerator known as a Content Addressable Memory (CAM), which effectively
operates as a hardware-based associative array, allowing fast lookup and match operations. In particular, we consider ternary CAMs
(TCAMs) and their use within string matching, which are an important component of many filtering and selection applications. Despite
the benefits gained with TCAMs, designing applications that utilize them remains a difficult task. Straightforward questions, such as
“how large should my TCAM be?” and “what is the expected throughput?” are difficult to answer due to the many factors that go into
effectively mapping data into a TCAM. This work aims to help answer these types of questions with a new framework called Stardust-
Chicken. Stardust-Chicken supports generating and simulating TCAMs, and implements state-of-the-art algorithms and data
representations that can effectively map data into TCAMs. With Stardust-Chicken, users can explore the tradeoff space that comes
with TCAMs and better understand how to utilize them in their applications.
Towards Fast Crash-Consistent Cluster Checkpointing
Andrew E Wood (Boston Univ.); Moshik Hershcovitch (IBM Research); Ilias Ennmouri (IBM); Weiyu Zong; Saurav Chennuri (Boston
Univ.); Sarel Cohen (The Academic College of Tel Aviv-Yaffo); Swaminathan Sundararaman (IBM); Daniel G Waddington (IBM
Research); Peter Chin (Dartmouth Univ.)
Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times.
Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of
work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in
machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective
strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and
the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support
for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while
maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes
to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach
with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets: MNIST and MusicNet. Through evaluation, we
show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM
against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional
approaches cannot.
Automatic Generation of Matrix-Vector Code Using SPIRAL for the Power10 ISA
James Nguyen; Sanil Rao (Carnegie Mellon Univ.); Jose Moreira (IBM); Franz Franchetti (Carnegie Mellon Univ.)
We present SPIRAL based automatic program generation utilizing Power10's novel Matrix Multiply Assist (MMA) instructions. These
MMA instructions allow for acceleration for matrix vector multiplication. SPIRAL allows for the generation of linear transform programs
that take advantage of these instructions to make it more efficient for developers to update their linear transform libraries.
Towards Hardware Accelerated Garbage Collection with Near-Memory Processing
Samuel Thomas; Jiwon Choe (Brown Univ.); Ofir Gordon; Erez Petrank (Technion Inst.); Tali Moreshet (Boston Univ.); Maurice Herlihy
(Brown Univ.); Ruth Iris Bahar (Colorado School of Mines)
Garbage collection is widely available in popular programming languages, yet it may incur high performance overheads in
applications. Prior works have proposed specialized hardware acceleration implementations to offload garbage collection overheads
off the main processor, but these solutions have yet to be implemented in practice. In this paper, we propose using off-the-shelf
hardware to accelerate off-the-shelf garbage collection algorithms. Furthermore, our work is latency oriented as opposed to other
works that focus on bandwidth. We demonstrate that we can get a 2× performance improvement in some workloads and a 2.3×
reduction in LLC traffic by integrating generic Near-Memory Processing (NMP) into the built-in Java garbage collector. We will discuss
architectural implications of these results and consider directions for future work.
RaiderSTREAM: Adapting the STREAM Benchmark to Modern HPC Systems
Michael Beebe; Brody Williams; Stephen Devaney (Texas Tech Univ.); John Leidel (Tactical Computing Laboratories); Yong Chen
(Texas Tech Univ.); Steve Poole (Los Alamos National Lab)
Sustaining high memory bandwidth utilization is a common bottleneck to maximizing the performance of scientific applications, with
the dominating factor of the runtime being the speed at which data can be loaded from memory into the CPU and results can be
written back to memory, particularly for increasingly critical data-intensive workloads. The prevalence of irregular memory access
patterns within these applications, exemplified by kernels such as those found in sparse matrix and graph applications, significantly
degrade the achievable performance of a system’s memory hierarchy. As such, it is highly desirable to be able to accurately measure
a given memory hierarchy's sustainable memory bandwidth when designing applications as well as future high-performance
computing (HPC) systems. STREAM is a de facto standard benchmark for measuring sustained memory bandwidth and has garnered
widespread adoption. In this work, we discuss current limitations of the STREAM benchmark in the context of high-performance and
scientific computing. We then introduce a new version of STREAM, called RaiderSTREAM, built on the OpenSHMEM and MPI
programming models in tandem with OpenMP, that include additional kernels which better model irregular memory access patterns in
order to address these shortcomings.
2-4: Scaling HPC Education Session (15:45-17:00)
Co-Chairs: Julie Mullen, Lauren Milechin & Hayden Jananthan
Invited Talk: TBD
Dr. Dhruva Chakravorty (Texas A&M Univ.)
Invited Talk: TBD
Dennis Milechin (Boston Univ.)
Invited Talk: TBD
Dr. Eric Coulter (Georgia Tech)
2-S1: GraphBLAS BoF Special (17:30-19:30)
Organizers: Tim Mattson & Scott McMillan
2022 Abstract Book