2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA
Designing Algorithms for the EMU Migrating-threads-based Architecture Mehmet E Belviranli (ORNL)*; Seyong Lee (ORNL); Jeffrey Vetter (Oak Ridge National Laboratory) The decades-old memory bottleneck problem for data-intensive applications is getting worse as the processor core counts continue to increase. Workloads with sparse memory access characteristics only achieve a fraction of a system's total memory bandwidth. EMU architecture provides a radical approach to the issue by migrating the computational threads to the location where the data resides. The system enables access to a large PGAS-type memory for hundreds of nodes via a Cilk-based multi-threaded execution scheme.   EMU architecture brings brand new challenges in application design and development. Data distribution and thread creation strategies play a crucial role in achieving optimal performance in the EMU platform. In this work, we identify several design considerations that need to be taken care of while developing applications for the new architecture and we evaluate their performance effects on the EMU-chick hardware. We also present a modified BFS algorithm for the EMU system and give experimental results for its execution on the platform. A Distributed Framework for Low-Latency OpenVX over the RDMA NoC of a Clustered Manycore Julien Hascoët (INSA Rennes, IETR / Kalray)*; Benoît Dupont de Dinechin (Kalray); Karol Desnos (INSA Rennes, IETR); Jean- François Nezan (INSA Rennes, IETR) OpenVX is a standard proposed by the Khronos group for cross-platform acceleration of computer vision and deep learning applications. OpenVX abstracts the target proces- sor architecture complexity and automates the implementation of processing pipelines through high-level optimizations. While highly efficient OpenVX implementations exist for shared mem- ory multi-core processors, targeting OpenVX to clustered many- core processors appears challenging. Indeed, such processors comprise multiple compute units or clusters, each fitted with an on-chip local memory shared by several cores. This paper describes an efficient implementation of OpenVX that targets clustered manycore processors. We propose a frame- work that includes computation graph analysis, kernel fusion techniques, RDMA-based tiling into local memories, optimization passes, and a distributed execution runtime. This framework is implemented and evaluated on the 2nd-generation Kalray MPPA R clustered manycore processor. Experimental results show that super-linear speed-ups are obtained for multi-cluster execution by leveraging the bandwidth of on-chip memories and the capabilities of asynchronous RDMA engines. GoblinCore-64: A RISC-V Based Architecture for Data Intensive Computing John D Leidel (Tactical Computing Laboratories)*; Xi Wang (Texas Tech University); Yong Chen (Texas Tech University) Current microprocessor architectures rely upon multi-level data caches and low degrees of concurrency to solve a wide range of applications. These architectures are well suited to efficiently executing applications that support memory access patterns with spatial and/or temporal locality. However, data intensive applications often access memory in an irregular manner that prevents optimal use of the memory hierarchy. In this work, we introduce GoblinCore-64 (GC64), a novel architecture that supports large-scale data intensive high performance computing workloads using a unique memory hierarchy coupled to a latency-hiding micro architecture.  The GC64 infrastructure is a hierarchical set of modules designed to support concurrency and latency hiding. The memory hierarchy is constructed using an on-chip scratchpad and Hybrid Memory Cube 3D memories. The RISC-V based instruction set includes support for scatter/gather memory operations, task concurrency and task management. We demonstrate GC64 using standard benchmarks that include NAS, HPCG, BOTS and the GAP Benchmark Suite. We find that GC64 accelerates these workloads by up to 14X per core and improves bandwidth by 3.5X. Performance Portability of a Fluidized Bed Solver V M Krushnarao Kotteda (The University of Texas at El Paso)*; Vinod Kumar (The University of Texas at El Paso); William Spotz (Sandia National Laboratories); Daniel  Sunderland (Sandia National Laboratories) Performance portability is a challenge for application developers as the source code needs to be executed and performant on various hybrid computing architectures. The linear iterative solvers implemented in most applications consume more than 70\% of the runtime. This paper presents the results of a linear solver in Trilinos for fluidized bed applications. The linear solver implemented in our code is based on the Kokkos programming model in Trilinos, which uses a library approach to provide performance portability across diverse devices with different memory models. For large scale problems, the numerical experiments on Xeon Phi and Kepler GPU architectures show good performance over the results on Haswell computing architectures. Implementing the Jaccard Index on the Migratory Memory-Side Processing Emu Architecture Geraud P Krawezik (Emu Technology)* We present an implementation of the Jaccard Index for graphs on the Migratory Memory-Side Processing Emu architecture. This index was designed to find similarities between different vertices in a graph, and is often used to identify communities. The Emu architecture is a parallel system based on a partitioned global address space, with threads automatically migrating inside the memory. We introduce the parallel programming model used to exploit it, detail our implementation of the algorithm, and analyze simulated performance results as well as early hardware tests. We discuss its application to large scale problems.
Wednesday September 26, 2018
ManyCore 1:00-2:40 in Eden Vale C1/C2 Chair: Michel Kinsy / Boston University
Designing Algorithms for the EMU Migrating-threads-based Architecture Mehmet E Belviranli (ORNL)*; Seyong Lee (ORNL); Jeffrey Vetter (Oak Ridge National Laboratory) The decades-old memory bottleneck problem for data-intensive applications is getting worse as the processor core counts continue to increase. Workloads with sparse memory access characteristics only achieve a fraction of a system's total memory bandwidth. EMU architecture provides a radical approach to the issue by migrating the computational threads to the location where the data resides. The system enables access to a large PGAS-type memory for hundreds of nodes via a Cilk-based multi-threaded execution scheme.   EMU architecture brings brand new challenges in application design and development. Data distribution and thread creation strategies play a crucial role in achieving optimal performance in the EMU platform. In this work, we identify several design considerations that need to be taken care of while developing applications for the new architecture and we evaluate their performance effects on the EMU-chick hardware. We also present a modified BFS algorithm for the EMU system and give experimental results for its execution on the platform. A Distributed Framework for Low-Latency OpenVX over the RDMA NoC of a Clustered Manycore Julien Hascoët (INSA Rennes, IETR / Kalray)*; Benoît Dupont de Dinechin (Kalray); Karol Desnos (INSA Rennes, IETR); Jean-François Nezan (INSA Rennes, IETR) OpenVX is a standard proposed by the Khronos group for cross- platform acceleration of computer vision and deep learning applications. OpenVX abstracts the target proces- sor architecture complexity and automates the implementation of processing pipelines through high-level optimizations. While highly efficient OpenVX implementations exist for shared mem- ory multi-core processors, targeting OpenVX to clustered many- core processors appears challenging. Indeed, such processors comprise multiple compute units or clusters, each fitted with an on-chip local memory shared by several cores. This paper describes an efficient implementation of OpenVX that targets clustered manycore processors. We propose a frame- work that includes computation graph analysis, kernel fusion techniques, RDMA-based tiling into local memories, optimization passes, and a distributed execution runtime. This framework is implemented and evaluated on the 2nd-generation Kalray MPPA R clustered manycore processor. Experimental results show that super- linear speed-ups are obtained for multi-cluster execution by leveraging the bandwidth of on-chip memories and the capabilities of asynchronous RDMA engines. GoblinCore-64: A RISC-V Based Architecture for Data Intensive Computing John D Leidel (Tactical Computing Laboratories)*; Xi Wang (Texas Tech University); Yong Chen (Texas Tech University) Current microprocessor architectures rely upon multi-level data caches and low degrees of concurrency to solve a wide range of applications. These architectures are well suited to efficiently executing applications that support memory access patterns with spatial and/or temporal locality. However, data intensive applications often access memory in an irregular manner that prevents optimal use of the memory hierarchy. In this work, we introduce GoblinCore-64 (GC64), a novel architecture that supports large-scale data intensive high performance computing workloads using a unique memory hierarchy coupled to a latency-hiding micro architecture.  The GC64 infrastructure is a hierarchical set of modules designed to support concurrency and latency hiding. The memory hierarchy is constructed using an on-chip scratchpad and Hybrid Memory Cube 3D memories. The RISC-V based instruction set includes support for scatter/gather memory operations, task concurrency and task management. We demonstrate GC64 using standard benchmarks that include NAS, HPCG, BOTS and the GAP Benchmark Suite. We find that GC64 accelerates these workloads by up to 14X per core and improves bandwidth by 3.5X. Performance Portability of a Fluidized Bed Solver V M Krushnarao Kotteda (The University of Texas at El Paso)*; Vinod Kumar (The University of Texas at El Paso); William Spotz (Sandia National Laboratories); Daniel  Sunderland (Sandia National Laboratories) Performance portability is a challenge for application developers as the source code needs to be executed and performant on various hybrid computing architectures. The linear iterative solvers implemented in most applications consume more than 70\% of the runtime. This paper presents the results of a linear solver in Trilinos for fluidized bed applications. The linear solver implemented in our code is based on the Kokkos programming model in Trilinos, which uses a library approach to provide performance portability across diverse devices with different memory models. For large scale problems, the numerical experiments on Xeon Phi and Kepler GPU architectures show good performance over the results on Haswell computing architectures. Implementing the Jaccard Index on the Migratory Memory-Side Processing Emu Architecture Geraud P Krawezik (Emu Technology)* We present an implementation of the Jaccard Index for graphs on the Migratory Memory-Side Processing Emu architecture. This index was designed to find similarities between different vertices in a graph, and is often used to identify communities. The Emu architecture is a parallel system based on a partitioned global address space, with threads automatically migrating inside the memory. We introduce the parallel programming model used to exploit it, detail our implementation of the algorithm, and analyze simulated performance results as well as early hardware tests. We discuss its application to large scale problems.
Wednesday September 26, 2018
ManyCore 1:00-2:40 in Eden Vale C1/C2 Chair: Michel Kinsy / Boston University
HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA