2018 IEEE High Performance
Extreme Computing Conference
(HPEC ‘18)
Twenty-second Annual HPEC Conference
25 - 27 September 2018
Westin Hotel, Waltham, MA USA
Designing Algorithms for the EMU Migrating-threads-based Architecture
Mehmet E Belviranli (ORNL)*; Seyong Lee (ORNL); Jeffrey Vetter (Oak Ridge National Laboratory)
The decades-old memory bottleneck problem for data-intensive applications is getting worse as the processor core counts continue
to increase. Workloads with sparse memory access characteristics only achieve a fraction of a system's total memory bandwidth.
EMU architecture provides a radical approach to the issue by migrating the computational threads to the location where the data
resides. The system enables access to a large PGAS-type memory for hundreds of nodes via a Cilk-based multi-threaded execution
scheme. EMU architecture brings brand new challenges in application design and development. Data distribution and thread
creation strategies play a crucial role in achieving optimal performance in the EMU platform. In this work, we identify several design
considerations that need to be taken care of while developing applications for the new architecture and we evaluate their
performance effects on the EMU-chick hardware. We also present a modified BFS algorithm for the EMU system and give
experimental results for its execution on the platform.
A Distributed Framework for Low-Latency OpenVX over the RDMA NoC of a Clustered Manycore
Julien Hascoët (INSA Rennes, IETR / Kalray)*; Benoît Dupont de Dinechin (Kalray); Karol Desnos (INSA Rennes, IETR); Jean-
François Nezan (INSA Rennes, IETR)
OpenVX is a standard proposed by the Khronos group for cross-platform acceleration of computer vision and deep learning
applications. OpenVX abstracts the target proces- sor architecture complexity and automates the implementation of processing
pipelines through high-level optimizations. While highly efficient OpenVX implementations exist for shared mem- ory multi-core
processors, targeting OpenVX to clustered many- core processors appears challenging. Indeed, such processors comprise multiple
compute units or clusters, each fitted with an on-chip local memory shared by several cores. This paper describes an efficient
implementation of OpenVX that targets clustered manycore processors. We propose a frame- work that includes computation graph
analysis, kernel fusion techniques, RDMA-based tiling into local memories, optimization passes, and a distributed execution runtime.
This framework is implemented and evaluated on the 2nd-generation Kalray MPPA R clustered manycore processor. Experimental
results show that super-linear speed-ups are obtained for multi-cluster execution by leveraging the bandwidth of on-chip memories
and the capabilities of asynchronous RDMA engines.
GoblinCore-64: A RISC-V Based Architecture for Data Intensive Computing
John D Leidel (Tactical Computing Laboratories)*; Xi Wang (Texas Tech University); Yong Chen (Texas Tech University)
Current microprocessor architectures rely upon multi-level data caches and low degrees of concurrency to solve a wide range of
applications. These architectures are well suited to efficiently executing applications that support memory access patterns with
spatial and/or temporal locality. However, data intensive applications often access memory in an irregular manner that prevents
optimal use of the memory hierarchy. In this work, we introduce GoblinCore-64 (GC64), a novel architecture that supports large-scale
data intensive high performance computing workloads using a unique memory hierarchy coupled to a latency-hiding micro
architecture. The GC64 infrastructure is a hierarchical set of modules designed to support concurrency and latency hiding. The
memory hierarchy is constructed using an on-chip scratchpad and Hybrid Memory Cube 3D memories. The RISC-V based
instruction set includes support for scatter/gather memory operations, task concurrency and task management. We demonstrate
GC64 using standard benchmarks that include NAS, HPCG, BOTS and the GAP Benchmark Suite. We find that GC64 accelerates
these workloads by up to 14X per core and improves bandwidth by 3.5X.
Performance Portability of a Fluidized Bed Solver
V M Krushnarao Kotteda (The University of Texas at El Paso)*; Vinod Kumar (The University of Texas at El Paso); William Spotz
(Sandia National Laboratories); Daniel Sunderland (Sandia National Laboratories)
Performance portability is a challenge for application developers as the source code needs to be executed and performant on
various hybrid computing architectures. The linear iterative solvers implemented in most applications consume more than 70\% of the
runtime. This paper presents the results of a linear solver in Trilinos for fluidized bed applications. The linear solver implemented in
our code is based on the Kokkos programming model in Trilinos, which uses a library approach to provide performance portability
across diverse devices with different memory models. For large scale problems, the numerical experiments on Xeon Phi and Kepler
GPU architectures show good performance over the results on Haswell computing architectures.
Implementing the Jaccard Index on the Migratory Memory-Side Processing Emu Architecture
Geraud P Krawezik (Emu Technology)*
We present an implementation of the Jaccard Index for graphs on the Migratory Memory-Side Processing Emu architecture. This
index was designed to find similarities between different vertices in a graph, and is often used to identify communities. The Emu
architecture is a parallel system based on a partitioned global address space, with threads automatically migrating inside the
memory. We introduce the parallel programming model used to exploit it, detail our implementation of the algorithm, and analyze
simulated performance results as well as early hardware tests. We discuss its application to large scale problems.
Wednesday September 26, 2018
ManyCore
1:00-2:40 in Eden Vale C1/C2
Chair: Michel Kinsy / Boston University