IEEE High Performane Extreme Computing

2022 IEEE High Performance Extreme Computing Virtual Conference 19 - 23 September 2022

Welcome

Organizers

Advisory Board

Technical Committee

Tuesday, September 20 2-V: Keynote Session (10:30-11:00) Co-Chairs: Jeremy Kepner & Albert Reuther Reflections on a Career in Computer Science Prof. Barbara Liskov (MIT CSAIL) 2-1: Graph Analytics & Network Science 1 Session (11:00-12:15) Co-Chairs: John Gilbert & Chris Long Invited Talk: The NSF Computing and Information Science and Engineering Landscape: A Look Forward Dr. Almadena Chtchelkanova (NSF) GraphBLAS on the Edge: Anonymized High Performance Streaming of Network Traffic [Outstanding Paper Award] Michael S Jones; Jeremy Kepner (MIT LLSC); Daniel Andersen (CAIDA); Aydın Buluc ̧(LBNL); Chansup Byun (MIT LLSC); K Claffy (CAIDA); Timothy Davis (Texas A&M); William Arcand (MIT LLSC); Jonathan Bernays (MIT Lincoln Laboratory); David Bestor; William Bergeron; Vijay Gadepally; Micheal Houle; Matthew Hubbell; Hayden Jananthan; Anna Klein (MIT LLSC); Chad Meiners (MIT Lincoln Laboratory); Lauren Milechin (MIT); Julie Mullen (MIT LLSC); Sandeep Pisharody (MIT Lincoln Laboratory); Andrew Prout; Albert Reuther; Antonio Rosa; Siddharth Samsi (MIT LLSC); Jon Sreekanth (Accolade Technology); Doug Stetson (MIT Lincoln Laboratory); Charles Yee; Peter Michaleas (MIT LLSC) Long range detection is a cornerstone of defense in many operating domains (land, sea, undersea, air, space, ..,). In the cyber domain, long range detection requires the analysis of significant network traffic from a variety of observatories and outposts. Construction of anonymized hypersparse traffic matrices on edge network devices can be a key enabler by providing significant data compression in a rapidly analyzable format that protects privacy. GraphBLAS is ideally suited for both constructing and analyzing anonymized hypersparse traffic matrices. The performance of GraphBLAS on an Accolade Technologies edge network device is demonstrated on a near worse case traffic scenario using a continuous stream of CAIDA Telescope darknet packets. The performance for varying numbers of traffic buffers, threads, and processor cores is explored. Anonymized hypersparse traffic matrices can be constructed at a rate of over 50,000,000 packets per second; exceeding a typical 400 Gigabit network link. This performance demonstrates that anonymized hypersparse traffic matrices are readily computable on edge network devices with minimal compute resources and can be a viable data product for such devices. Analyzing Multi-trillion Edge Graphs on Large GPU Clusters: A Case Study with PageRank [Outstanding Paper Award] Seunghwa Kang; Joseph Nke; Brad Rees (NVIDIA) We previously reported PageRank performance results on a cluster with 32 A100 GPUs. This paper extends the previous work to 2048 GPUs. The previous implementation performs well as long as the number of GPUs is small relative to the square of the average vertex degree but its scalability deteriorates as the number of GPUs further increases. We updated our previous implementation with the following objectives: 1) enable analyzing a P times larger graph with P times more GPUs up to P = 2048, 2) achieve reasonably good weak scaling, and 3) integrate the improvements to the open-source data science ecosystem (i.e. RAPIDS cuGraph). While we evaluate the updates with PageRank in this paper, they improve the scalability of a broader set of algorithms in cuGraph. To be more specific, we updated our 2D edge partitioning scheme; implemented the PDCSC (partially doubly compressed sparse column) format which is a hybrid data structure that combines CSC (compressed sparse column) and DCSC (doubly compressed sparse column); adopted (key, value) pairs to store edge source vertex property values; and improved the reduction communication strategy. The 32 GPU cluster has A100 GPUs (40 GB HBM per GPU) connected with NVLink. We ran the updated implementation on the Selene supercomputer which uses InfiniBand for inter-node communication and NVLink for intra-node communication. Each Selene node has eight A100 GPUs (80 GB HBM per GPU). Analyzing the web crawl graph (3.563 billion vertices and 128.7 billion edges, 32 bit vertex ID, unweighted, average vertex degree: 36.12) took 0.187 second per PageRank iteration on the 32 GPU cluster. Computing PageRank scores of a scale 38 R-mat graph (274.9 billion vertices and 4.398 trillion edges, 64 bit vertex ID, 32 bit edge weight, average vertex degree: 16) took 1.54 second per PageRank iteration on the Selene supercomputer with 2048 GPUs. We conclude this paper discussing potential network system enhancements to improve the scaling. Achieving Speedups for Distributed Graph Biconnectivity Ian Bogle; George M. Slota (RPI) As data scales continue to increase, studying the porting and implementation of shared memory parallel algorithms for distributed memory architectures becomes increasingly important. We consider the problem of biconnectivity for this current study, which identifies cut vertices and cut edges in a graph. As part of our study, we implemented and optimized a shared memory biconnectivity algorithm of Slota and Madduri within a distributed memory context. This algorithm is neither work nor time efficient. However, when we compare to distributed implementations of theoretically efficient algorithms, we find that simple non-optimal algorithms can greatly outperform time-efficient algorithms in practice when implemented for real distributed-memory environments and real data. Overall, our distributed implementation for computing graph biconnectivity demonstrates an average strong scaling speedup of 15× across 64 MPI ranks on a suite of irregular real-world inputs. We also note an average of 11× and 7.3× speedup relative to the optimal serial algorithm and fastest shared-memory implementation for the biconnectivity problem, respectively. Generating Permutations Using Hash Tables Oded Green; Corey Nolet; Joe Eaton (NVIDIA) Given a set of $N$ distinct values, the operation of shuffling those elements and creating a random order is used in a wide range range of applications, including (but not limited to) statistical analysis, machine learning, games, and bootstrapping. The operation of shuffling elements is equivalent to generating a random permutation and applying the permutation. For example, the random permutation of an input allows splitting into two or more subsets without bias. This operation is repeated for machine learning applications when both a train and test data set are needed. In this paper we describe a new method for creating random permutations that is scalable, efficient, and simple. We show that the operation of generating a random permutation shares traits with building a hashtable.Our method uses a fairly new hash table, called HashGraph, to generate the permutation. HashGraph's unique data-structure ensures easy generation and retrieval of the permutation. HashGraph is one of the fastest known hash-tables for the GPU and also outperforms many leading CPU hash-tables. We show the performance of our new permutation generation scheme using both Python and CUDA versions of HashGraph. Our CUDA implementation is roughly $10\%$ faster than our Python implementation. In contrast to the $shuffle$ operation in NVIDIA's Thrust and cuPy frameworks, our new permutation generation algorithm is $2.6\times$ and $1.73\times$ faster, respectively and up to $150\times$ faster than numPy. Poster Session: 2-P (12:15-14:15): Poster Session 2 Chair(s)/Host(s): Siddarth Samsi Yehia Arafa ProtoX: A First Look Het Mankad; Sanil Rao (Carnegie Mellon Univ.); Phillip Colella; Brian Van Straalen (Lawrence Berkeley National Laboratory); Franz Franchetti (Carnegie Mellon Univ.) We present a first look at ProtoX, a code generation framework for stencil operation that occurs in the numerical solution of partial differential equations. ProtoX is derived from Proto, a C++ based domain specific library which optimizes the algorithms used to compute the numerical solution of partial differential equations and SPIRAL, a code generation system that focuses on generating highly optimized target code. We demonstrate the construction of ProtoX by considering the 2D Poisson equation as a model problem and using the Jacobi method to solve it. Some of the code generated for this problem specification is shown along with initial speedup result. Magic Memory: A Programming Model For Big Data Analytics Eric Tang; Franz Franchetti (Carnegie Mellon Univ.) Big data analysis is a difficult task because the analysis is often memory intensive. Current solutions involve iteratively streaming data chunks into the compute engine and recomputing the analytical algorithm. To simplify this process, this work proposes a new programming model called Magic Memory. In this model, persistent invariants or functional dependencies can be established between memory regions. These functional dependencies will always be held true when this memory region is being read. Recent technological advancements enable Magic Memory at the hardware level, providing performance that is hard to achieve with a software-only solution. Our ongoing work seeks to explore an implementation of Magic Memory on a CPU-FPGA system, where the CPU runs the host code while the FPGA provides hardware acceleration. The CPU allocates memory on the FPGA and declares an invariant for the FPGA to uphold on this region of memory. We demonstrate how an application such as PageRank can utilize Magic Memory to recalculate its output as the input graph is modified automatically. Approximating Manifolds and Geodesics with Curved Surfaces Peter Oostema; Franz Franchetti (Carnegie Mellon Univ.) Approximating manifolds with spheres gives a representation that can effectively be used to find geodesics. This method takes a point cloud of data and finds a manifold representation and geodesic paths between any two points. This allows for graph embedding in non-Euclidean spaces. Graph embedding gives a spatial representation of a graph which provides a new view of the data structure useful for applications in AI. Network Automation in Lab Deployment Using Ansible and Python V Andal Priyadharshini; Anumalasetty Yashwanth Nath (SRM Institute of Science and Technology) Network automation has evolved into a solution that ensures efficiency in all areas. The age-old technique to configure common software defined networking protocols is inefficient as it requires a box-by-box approach that needs to be repeated often and is prone to manual errors. Network automation assists network administrators in automating and verifying the protocol configuration to ensure consistent configurations. This paper implemented network automation using python and Ansible to configure different protocols and configurations in the container lab virtual environment. Ansible can help network administrators minimize human mistakes, reduce time consumption, and enable device visibility across the network environment. Optimizations to Increase JDBC Driver Performance in Spark Deeptaanshu Kumar (Carnegie Mellon Univ.); Suxi Li ( Univ. of Miami) At “The Arena Group”, we are building an Enterprise Data Lake on AWS using the Databricks Lakehouse Platform. Since we are ingesting data into our Data Lake using Spark clusters that connect to source databases with JDBC drivers, we have been doing performance tests with different Spark JDBC driver parameters and configurations to see which set of parameters/configurations gives us the best read/write performance in terms of execution time. As a result, we are sharing our performance test execution times in this Extended Abstract, so that our tests can serve as a case study that other engineering teams can use to further optimize the JDBC driver configurations in their own Spark clusters. Triangle Centrality in Arkouda Joseph T Patchett; Zhihui Du; Fuhuan Li; David Bader (New Jersey Inst. of Tech.) "There are a wide number of graph centrality metrics. Further, the performance of each can vary widely depending on the type of implementation. We develop triangle centrality in Arkouda with several different triangle counting methods. Triangle Centrality is a robust metric that captures the centrality of a vertex through both a vertex’s own connectedness and that of its neighbors. Arkouda is a distributed system framework for data science at the scale of terabytes and beyond. These methods are compared against each other and another shared memory implementation." Image Recognition Using Machine Learning For Forbidden Items Detection In Airports Alaa Atef; Abde-ljalil Naser; Mahmoud Mohamed; Mariam Safwat; Menna Tulla Ayman; Mohamed Mostafa; Salma Hesham (Ain Shams University); Khaled Salah (Siemens) One of the problems that faces us at the airport or other places, is if the traveler’s luggage contains sharp weapons or guns that threatens others’ lives. Here, we work on the idea of searching the luggage by visioning the images taken in an x-ray format and examining them to find if it contains any of the forbidden items or not. These forbidden items are: guns, knives, wrenches, pliers and scissors. We use the concepts of machine learning and neural networks to deploy our model. The response time for the Mobile App is 0.13 sec and for the Desktop App is 0.54 sec. For the accuracy, the Mobile App’s is 79 % and the Desktop App’s is 91%. 2-2: Graph Analytics & Network Science 2 Session (12:30-13:45) Co-Chairs: John Gilbert & Chris Long Hypersparse Network Flow Analysis of Packets with GraphBLAS Tyler Trigg; Chad Meiners; Sandeep Pisharody (MIT Lincoln Laboratory); Hayden Jananthan; Michael Jones (MIT LLSC); Adam Michaleas (MIT Lincoln Laboratory); Timothy Davis (Texas A&M Univ.); Erik Welch (NVIDIA); William Arcand; David Bestor; William Bergeron; Chansup Byun; Vijay Gadepally; Micheal Houle; Matthew Hubbell; Anna Klein; Peter Michaleas (MIT LLSC); Lauren Milechin (MIT); Julie Mullen; Andrew Prout; Albert Reuther; Antonio Rosa; Siddharth Samsi (MIT LLSC); Doug Stetson (MIT Lincoln Laboratory); Charles Yee; Jeremy Kepner (MIT LLSC) Internet analysis is a major challenge due to the volume and rate of network traffic. In lieu of analyzing traffic as raw packets, network analysts often rely on compressed network flows (netflows) that contain the start time, stop time, source, destination, and number of packets in each direction. However, many traffic analyses benefit from temporal aggregation of multiple simultaneous netflows, which can be computationally challenging. To alleviate this concern, a novel netflow compression and resampling method has been developed leveraging GraphBLAS hyperspace traffic matrices that preserve anonymization while enabling subrange analysis. Standard multitemporal spatial analyses are then performed on each subrange to generate detailed statistical aggregates of the source packets, source fan-out, unique links, destination fan-in, and destination packets of each subrange which can then be used for background modeling and anomaly detection. A simple file format based on GraphBLAS sparse matrices is developed for storing these statistical aggregates. This method is scale tested on the MIT SuperCloud using a 50 trillion packet netflow corpus from several hundred sites collected over several months. The resulting compression achieved is significant (<0.1 bit per packet) enabling extremely large netflow analyses to be stored and transported. The single node parallel performance is analyzed in terms of both processors and threads showing that a single node can perform hundreds of simultaneous analyses at over a million packets/sec (roughly equivalent to a 10 Gigabit link). SHARP: Software Hint-Assisted Memory Access Prediction for Graph Analytics Pengmiao Zhang (Univ. of Southern California); Rajgopal Kannan (US Army Research Lab-West); Xiangzhi Tong (Xi’an Jiaotong- Liverpool Univ.); Anant V Nori (Intel Labs); Viktor K Prasanna (Univ. of Southern California) Memory system performance is a major bottleneck in large-scale graph analytics. Data prefetching can hide memory latency; this relies on accurate prediction of memory accesses. While recent machine learning approaches have performed well on memory access prediction, they are restricted to building general models, ignoring the shift of memory access patterns following the change of processing phases in software. We propose SHARP: a novel Software Hint-Assisted memoRy access Prediction approach for graph analytics under Scatter-Gather paradigm on multi-core shared-memory platforms. We introduce software hints, generated from programmer insertion, that explicitly indicate the processing phase of a graph processing program, i.e., Scatter or Gather. Assisted by the software hints, we develop phase-specific prediction models that use attention-based neural networks, trained by memory traces with rich context information. We use three widely-used graph algorithms and a variety of datasets for evaluation. With respect to F1- score, SHARP outperforms the widely-used Delta-LSTM model by 16.45%--18.93% for the scatter phase and 9.50%--22.25% for the gather phase, and outperforms the state-of-the-art TransFetch model by 3.66%--7.48% for the Scatter phase and 2.69%--7.59% for the Gather phase. Fast Graph Algorithms for Superpixel Segmentation Dimitris Floros (Aristotle Univ. of Thessaloniki); Tiancheng Liu (Duke Univ.); Nikos P Pitsianis (Aristotle Univ. of Thessaloniki, Duke Univ.); Xiaobai Sun (Duke Univ.) We introduce the novel graph-based algorithm SLAM (simultaneous local assortative mixing) for fast and high-quality superpixel segmentation of any large color image. Superpixels are compact semantic image elements; superpixel segmentation is fundamental to a broad range of vision tasks in existing and emerging applications, especially, to safety-critical and time-critical applications. SLAM leverages a graph representation of the image, which encodes the pixel features and similarities, for its rich potential in implicit feature transformation and extra means for feature differentiation and association at multiple resolution scales. We demonstrate, with our experimental results on 500 benchmark images, that SLAM outperforms the state-of-art algorithms in superpixel quality, by multiple measures, within the same time frame. The contributions are at least twofold: SLAM breaks down the long-standing speed barriers in graph-based algorithms for superpixel segmentation; it lifts the fundamental limitations in the feature-point-based algorithms. Explicit Ordering Refinement for Accelerating Irregular Graph Analysis Michael Mandulak; Ruochen Hu; George M Slota (RPI) Vertex reordering for efficient memory access in extreme-scale graph-based data analysis shows considerable improvement to the cache efficiency and runtimes of widely used graph analysis algorithms. Despite this, modern efficient ordering methods are often heuristic-based and do not directly optimize some given metrics. Thus, this paper conducts an experimental study into explicit metric- based vertex ordering op- timization. We introduce a universal graph partitioning-inspired approach focusing on CPU shared-memory parallelism to the vertex ordering problem through the explicit refinement of low- degree vertices using the Linear Gap Arrangement and Log Gap Arrangement problems as comprehensive metrics for ordering improvement. This degree-based refinement method is evaluated upon a number of initial orderings with timing and cache efficiency results relative to three shared-memory graph analytic algorithms: PageRank, Louvain and the Multistep algorithm. Applying refinement, we observe runtime improvements of up to 15x on the ClueWeb09 graph and up to 4x improvements to cache efficiency on a variety of network types and initial orderings, demonstrating the feasibility of an optimization approach to the vertex ordering problem at a large scale. Efficient Calculation of Triangle Centrality in Big Data Networks Wali Mohammad Abdullah; David Awosoga; Shahadat Hossain (Univ. of Lethbridge) The notion of “centrality” within graph analytics has led to the creation of well-known metrics such as Google’s PageRank [1], which is an extension of eigenvector centrality [2]. Triangle centrality is a related metric [3] that utilizes the presence of triangles, which play an important role in network analysis, to quantitatively determine the relative “importance” of a node in a network. Efficiently counting and enumerating these triangles are a major backbone to understanding network characteristics, and linear algebraic methods have utilized the correspondence between sparse adjacency matrices and graphs to perform such calculations, with sparse matrix-matrix multiplication as the main computational kernel. In this paper, we use an intersection representation of graph data implemented as a sparse matrix, and engineer an algorithm to compute the triangle centrality of each vertex within a graph. The main computational task of calculating these sparse matrix-vector products is carefully crafted by employing compressed vectors as accumulators. As with other state-of-the-art algorithms [4], our method avoids redundant work by counting and enumerating each triangle exactly once. We present results from extensive computational experiments on large-scale real-world and synthetic graph instances that demonstrate good scalability of our method. We also present a shared memory parallel implementation of our algorithm. 2-3: Data Intensive Computing Session (14:15-15:30) Co-Chairs: Xiaobai Sun & Nikos Pitsianis Enabling Novel In-Memory Computation Algorithms to Address Next-Generation Throughput Constraints on SWaP-Limited Platforms Jessica M Ray; Chad Meiners (MIT Lincoln Laboratory) The Department of Defense relies heavily on filtering and selection applications to help manage the overwhelming amount of data constantly received at the tactical edge. Filtering and selection are both latency and throughput constrained, and systems at the tactical edge must heavily optimize their SWaP (size, weight, and power) usage, which can reduce overall computation and memory performance. In-memory computation (IMC) provides a promising solution to the latency and throughput issues, as it helps enable the efficient processing of data as it is received, helping eliminate the memory bottleneck imposed by traditional Von Neumann architectures. In this paper, we discuss a specific type of IMC accelerator known as a Content Addressable Memory (CAM), which effectively operates as a hardware-based associative array, allowing fast lookup and match operations. In particular, we consider ternary CAMs (TCAMs) and their use within string matching, which are an important component of many filtering and selection applications. Despite the benefits gained with TCAMs, designing applications that utilize them remains a difficult task. Straightforward questions, such as “how large should my TCAM be?” and “what is the expected throughput?” are difficult to answer due to the many factors that go into effectively mapping data into a TCAM. This work aims to help answer these types of questions with a new framework called Stardust- Chicken. Stardust-Chicken supports generating and simulating TCAMs, and implements state-of-the-art algorithms and data representations that can effectively map data into TCAMs. With Stardust-Chicken, users can explore the tradeoff space that comes with TCAMs and better understand how to utilize them in their applications. Towards Fast Crash-Consistent Cluster Checkpointing Andrew E Wood (Boston Univ.); Moshik Hershcovitch (IBM Research); Ilias Ennmouri (IBM); Weiyu Zong; Saurav Chennuri (Boston Univ.); Sarel Cohen (The Academic College of Tel Aviv-Yaffo); Swaminathan Sundararaman (IBM); Daniel G Waddington (IBM Research); Peter Chin (Dartmouth Univ.) Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets: MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot. Automatic Generation of Matrix-Vector Code Using SPIRAL for the Power10 ISA James Nguyen; Sanil Rao (Carnegie Mellon Univ.); Jose Moreira (IBM); Franz Franchetti (Carnegie Mellon Univ.) We present SPIRAL based automatic program generation utilizing Power10's novel Matrix Multiply Assist (MMA) instructions. These MMA instructions allow for acceleration for matrix vector multiplication. SPIRAL allows for the generation of linear transform programs that take advantage of these instructions to make it more efficient for developers to update their linear transform libraries. Towards Hardware Accelerated Garbage Collection with Near-Memory Processing Samuel Thomas; Jiwon Choe (Brown Univ.); Ofir Gordon; Erez Petrank (Technion Inst.); Tali Moreshet (Boston Univ.); Maurice Herlihy (Brown Univ.); Ruth Iris Bahar (Colorado School of Mines) Garbage collection is widely available in popular programming languages, yet it may incur high performance overheads in applications. Prior works have proposed specialized hardware acceleration implementations to offload garbage collection overheads off the main processor, but these solutions have yet to be implemented in practice. In this paper, we propose using off-the-shelf hardware to accelerate off-the-shelf garbage collection algorithms. Furthermore, our work is latency oriented as opposed to other works that focus on bandwidth. We demonstrate that we can get a 2× performance improvement in some workloads and a 2.3× reduction in LLC traffic by integrating generic Near-Memory Processing (NMP) into the built-in Java garbage collector. We will discuss architectural implications of these results and consider directions for future work. RaiderSTREAM: Adapting the STREAM Benchmark to Modern HPC Systems Michael Beebe; Brody Williams; Stephen Devaney (Texas Tech Univ.); John Leidel (Tactical Computing Laboratories); Yong Chen (Texas Tech Univ.); Steve Poole (Los Alamos National Lab) Sustaining high memory bandwidth utilization is a common bottleneck to maximizing the performance of scientific applications, with the dominating factor of the runtime being the speed at which data can be loaded from memory into the CPU and results can be written back to memory, particularly for increasingly critical data-intensive workloads. The prevalence of irregular memory access patterns within these applications, exemplified by kernels such as those found in sparse matrix and graph applications, significantly degrade the achievable performance of a system’s memory hierarchy. As such, it is highly desirable to be able to accurately measure a given memory hierarchy's sustainable memory bandwidth when designing applications as well as future high-performance computing (HPC) systems. STREAM is a de facto standard benchmark for measuring sustained memory bandwidth and has garnered widespread adoption. In this work, we discuss current limitations of the STREAM benchmark in the context of high-performance and scientific computing. We then introduce a new version of STREAM, called RaiderSTREAM, built on the OpenSHMEM and MPI programming models in tandem with OpenMP, that include additional kernels which better model irregular memory access patterns in order to address these shortcomings. 2-4: Scaling HPC Education Session (15:45-17:00) Co-Chairs: Julie Mullen, Lauren Milechin & Hayden Jananthan Invited Talk: TBD Dr. Dhruva Chakravorty (Texas A&M Univ.) Invited Talk: TBD Dennis Milechin (Boston Univ.) Invited Talk: TBD Dr. Eric Coulter (Georgia Tech) 2-S1: GraphBLAS BoF Special (17:30-19:30) Organizers: Tim Mattson & Scott McMillan

2022 Abstract Book