2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
Wednesday September 25, 2019 Manycore 1:00-2:40 in Eden Vale A3 Chair: David Cousins / TBD Improving Scheduling for Irregular Applications with Logarithmic Radix Binning James Fox (Georgia Tech), Alok Tripathy (Georgia Tech), and Oded Green (Georgia Tech, Nvidia) Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is  unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parallel prefix sum (PPS) array over the expected amount of work per task, followed by a partitioning of tasks to threads. While sufficient for many systems, it is not ideal for massively multi-threaded systems with SIMD/SIMT execution, such as GPUs. More fine-grained load-balancing is needed to effectively utilize SIMD/SIMT units. In this paper we introduce Logarithmic Radix Binning (LRB) as a more suitable alternative to parallel prefix summation for load-balancing on such systems.  We show that LRB has better scalability than PPS for high thread counts on Intel's Knight's Landing processor and comparable scalability on NVIDIA Volta GPUs. On the application side, we show how LRB improves the performance of PageRank up to 1.75X using the branch-avoiding model. We also show how to better load-balance segmented sort and improve performance on the GPU. An Efficient and Composable Parallel Task Programming Library Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, Martin D. F. Wong (UIUC) Composability is a key component to improve programmers' productivity in writing fast market-expanding applications such as parallel machine learning algorithms and big data analytics. These applications exhibit both regular and irregular compute patterns, and are often combined with other functions or libraries to compose a larger program. However, composable parallel processing has taken a back seat in many existing parallel programming libraries, making it difficult to achieve modularity in large-scale parallel programs. In this paper, we introduce a new parallel task programming library using composable tasking graphs. Our library efficiently supports task parallelism together with an intuitive task graph construction and flexible execution API set to enable reusable and composable task dependency graphs. Developers can quickly compose a large parallel program from small and modular parallel building blocks, and easily deploy the program on a multicore machine. We have evaluated our library on real-world applications.  Experimental results showed our library can achieve comparable performance to Intel Threading Building Blocks with less coding effort. MeXT: A Flow for Multiprocessor Exploration Christophe Bobda (Univ. Florida), Harold Ishebabi (Univ. Potsdam), Philipp Mahr (Univ. Potsdam), Joel Mandebi Mbongue (Univ. Florida) This paper presents an extended design approach for heterogeneous multiprocessor systems. The goal in this particular design exploration approach is to ease the implementation of an adaptive multiprocessor system by creating components such as processing nodes or memories from an application. A program is profiled and analysed to gather information about task precedence, communication cost or computational patterns for hardware accelerator generation. This information is then used to solve an optimization problem using Integer Linear Programming or Answer Set Programming with the goal of 1) creating suitable multiprocessor hardware architecture and 2) mapping of tasks onto the processors. A lightweight message-passing library for on-chip communication of parallel programs is provided. The resulting abstract architecture is further processed using the vendor tool-chain to generate the target platform’s configuration. Two real-world case studies are used to show the feasibility of our design-space exploration approach. Optimizing Xeon Phi for Interactive Data Analysis Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, William Bergeron, Matthew Hubbell, Vijay Gadepally, Michael Houle, Michael Jones, Anne Klein, Lauren Milechin, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Siddharth Samsi, Charles Yee, Albert Reuther (MIT) The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results in a number of real-world applications. Automatic Parallelization to Asynchronous Task-Based Runtimes Through a Generic Runtime Layer Charles Jin, Muthu Baskaran, Benoit Meister, Jonathan Springer (Reservoir Labs) With the end of Moore's law, asynchronous task-based parallelism has seen growing support as a parallel programming paradigm, with the runtime system offering such advantages as dynamic load balancing, locality, and scalability. However, there has been a proliferation of such programming systems in recent years, each of which presents different performance tradeoffs and runtime semantics. Developing applications on top of these systems thus requires not only application expertise but also deep familiarity with the runtime, exacerbating the perennial problems of programmability and portability.  This work makes three main contributions to this growing landscape. First, we extend a polyhedral optimizing compiler with techniques to extract task-based parallelism and data management for a broad class of asynchronous task-based runtimes. Second, we introduce a generic runtime layer for asynchronous task-based systems with representations of data and tasks that are sparse and tiled by default, which serves as an abstract target for the compiler backend. Finally, we implement this generic layer using OpenMP and Legion, demonstrating the flexibility and viability of the generic layer and delivering an end-to-end path for automatic parallelization to asynchronous task-based runtimes. Using a wide range of applications from deep learning to scientific kernels, we obtain geometric mean speedups of 23.0x (OpenMP) and 9.5x (Legion) using 64 threads.
Wednesday September 25, 2019 Manycore 1:00-2:40 in Eden Vale A3 Chair: David Cousins / TBD Improving Scheduling for Irregular Applications with Logarithmic Radix Binning James Fox (Georgia Tech), Alok Tripathy (Georgia Tech), and Oded Green (Georgia Tech, Nvidia) Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is  unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parallel prefix sum (PPS) array over the expected amount of work per task, followed by a partitioning of tasks to threads. While sufficient for many systems, it is not ideal for massively multi-threaded systems with SIMD/SIMT execution, such as GPUs. More fine-grained load-balancing is needed to effectively utilize SIMD/SIMT units. In this paper we introduce Logarithmic Radix Binning (LRB) as a more suitable alternative to parallel prefix summation for load-balancing on such systems.  We show that LRB has better scalability than PPS for high thread counts on Intel's Knight's Landing processor and comparable scalability on NVIDIA Volta GPUs. On the application side, we show how LRB improves the performance of PageRank up to 1.75X using the branch- avoiding model. We also show how to better load-balance segmented sort and improve performance on the GPU. An Efficient and Composable Parallel Task Programming Library Chun-Xun Lin, Tsung-Wei Huang, Guannan Guo, Martin D. F. Wong (UIUC) Composability is a key component to improve programmers' productivity in writing fast market-expanding applications such as parallel machine learning algorithms and big data analytics. These applications exhibit both regular and irregular compute patterns, and are often combined with other functions or libraries to compose a larger program. However, composable parallel processing has taken a back seat in many existing parallel programming libraries, making it difficult to achieve modularity in large-scale parallel programs. In this paper, we introduce a new parallel task programming library using composable tasking graphs. Our library efficiently supports task parallelism together with an intuitive task graph construction and flexible execution API set to enable reusable and composable task dependency graphs. Developers can quickly compose a large parallel program from small and modular parallel building blocks, and easily deploy the program on a multicore machine. We have evaluated our library on real-world applications.  Experimental results showed our library can achieve comparable performance to Intel Threading Building Blocks with less coding effort. MeXT: A Flow for Multiprocessor Exploration Christophe Bobda (Univ. Florida), Harold Ishebabi (Univ. Potsdam), Philipp Mahr (Univ. Potsdam), Joel Mandebi Mbongue (Univ. Florida) This paper presents an extended design approach for heterogeneous multiprocessor systems. The goal in this particular design exploration approach is to ease the implementation of an adaptive multiprocessor system by creating components such as processing nodes or memories from an application. A program is profiled and analysed to gather information about task precedence, communication cost or computational patterns for hardware accelerator generation. This information is then used to solve an optimization problem using Integer Linear Programming or Answer Set Programming with the goal of 1) creating suitable multiprocessor hardware architecture and 2) mapping of tasks onto the processors. A lightweight message-passing library for on-chip communication of parallel programs is provided. The resulting abstract architecture is further processed using the vendor tool-chain to generate the target platform’s configuration. Two real-world case studies are used to show the feasibility of our design-space exploration approach. Optimizing Xeon Phi for Interactive Data Analysis Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, William Bergeron, Matthew Hubbell, Vijay Gadepally, Michael Houle, Michael Jones, Anne Klein, Lauren Milechin, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Siddharth Samsi, Charles Yee, Albert Reuther (MIT) The Intel Xeon Phi manycore processor is designed to provide high performance matrix computations of the type often performed in data analysis. Common data analysis environments include Matlab, GNU Octave, Julia, Python, and R. Achieving optimal performance of matrix operations within data analysis environments requires tuning the Xeon Phi OpenMP settings, process pinning, and memory modes. This paper describes matrix multiplication performance results for Matlab and GNU Octave over a variety of combinations of process counts and OpenMP threads and Xeon Phi memory modes. These results indicate that using KMP_AFFINITY=granlarity=fine, taskset pinning, and all2all cache memory mode allows both Matlab and GNU Octave to achieve 66% of the practical peak performance for process counts ranging from 1 to 64 and OpenMP threads ranging from 1 to 64. These settings have resulted in generally improved performance across a range of applications and has enabled our Xeon Phi system to deliver significant results in a number of real-world applications. Automatic Parallelization to Asynchronous Task-Based Runtimes Through a Generic Runtime Layer Charles Jin, Muthu Baskaran, Benoit Meister, Jonathan Springer (Reservoir Labs) With the end of Moore's law, asynchronous task-based parallelism has seen growing support as a parallel programming paradigm, with the runtime system offering such advantages as dynamic load balancing, locality, and scalability. However, there has been a proliferation of such programming systems in recent years, each of which presents different performance tradeoffs and runtime semantics. Developing applications on top of these systems thus requires not only application expertise but also deep familiarity with the runtime, exacerbating the perennial problems of programmability and portability.  This work makes three main contributions to this growing landscape. First, we extend a polyhedral optimizing compiler with techniques to extract task-based parallelism and data management for a broad class of asynchronous task-based runtimes. Second, we introduce a generic runtime layer for asynchronous task-based systems with representations of data and tasks that are sparse and tiled by default, which serves as an abstract target for the compiler backend. Finally, we implement this generic layer using OpenMP and Legion, demonstrating the flexibility and viability of the generic layer and delivering an end-to- end path for automatic parallelization to asynchronous task-based runtimes. Using a wide range of applications from deep learning to scientific kernels, we obtain geometric mean speedups of 23.0x (OpenMP) and 9.5x (Legion) using 64 threads.