2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA
Exploiting GPU with 3D Stacked Memory to Boost Performance for Data-Intensive Applications Hao Wen (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)* An increasing number of applications are using GPUs for acceleration. Due to the massive number of memory accesses, the traditional DRAM becomes a bandwidth bottleneck. The 3D stacked memory gives the potential to alleviate the bandwidth bottleneck by using through silicon vias (TSVs) to deliver much higher on-chip bus width than the traditional off-chip interface. In this paper, we evaluate the latency and bandwidth benefits of 3D stacked memory on GPUs. In addition, we take advantage of the DRAM row buffer locality to merge memory requests to further improve the performance. An Access-Pattern-Aware On-Chip Vector Memory System  with Automatic Loading for SIMD Architectures tong geng (Boston University)*; Erkan Diken (Eindhoven University of Technology); Tianqi Wang (University of Science and Technology of China); Lech Jozwiak (Eindhoven University of Technology); Martin Herbordt (Boston University) Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications involving Data-Level Parallelism (DLP); the on-chip memory system facilitates the communication between Processing Elements (PE) and on-chip vector memory.  It is observed that inefficiency of the on-chip memory system is often a computational bottleneck. In this paper, we describe the design and implementation of an efficient vector data memory system. The proposed memory system consists of two novel parts: an access-pattern-aware memory controller and an automatic loading mechanism. The memory controller reduces the data reorganization overheads. The automatic loading mechanism loads data automatically according to the access patterns without load instructions.  This eliminates overhead of fetching and decoding. The proposed design is implemented and synthesized with Cadence tools. Experimental results demonstrate that our design improves the performance of 8 application kernels by 44% and reduces the energy consumption by 26%, on average. Scalable RMA-based Communication Library Featuring Node-local NVMs Ryo Matsumiya (Tokyo Institute of Technology / AIST-TokyoTech Real World Big-Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology)*; Toshio Endo (Tokyo Institute of Technology / AIST- TokyoTech Real World Big-Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology) Remote Memory Access (RMA) is a useful communication interface to develop high-performance applications with complicated communication patterns. However, the data scales of such applications are still limited by the totally available main memory capacity. To accommodate extreme scale executions of those applica- tions, we developed vGASNet, which is an RMA-based communication library that exploits the capacity of non-volatile memory (NVM) on each node. With vGASNet, NVM devices on nodes compose a large shared address space. Under this model, the key for good application performance is to reduce bandwidth bottlenecks. First, since NVM is much slower than DRAM, reducing the amounts of NVM accesses is important. For this purpose, vGASNet regards DRAM of each computation node as a cache of NVM. Next, one of bottleneck sources in RMA is caused by access contention. In order to mitigate its effects, vGASNet adopts cooperative cache mechanism, which make multiple caches of an object on several nodes. Our evaluation using vGASNet shows the above cache mechanism improves the scalability of RMA.
Thursday, September 27, 2018
ASIC & FPGA 2 1:00-2:40 in Eden Vale A3 Chair: Paul Monticiollo / MIT
Exploiting GPU with 3D Stacked Memory to Boost Performance for Data-Intensive Applications Hao Wen (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)* An increasing number of applications are using GPUs for acceleration. Due to the massive number of memory accesses, the traditional DRAM becomes a bandwidth bottleneck. The 3D stacked memory gives the potential to alleviate the bandwidth bottleneck by using through silicon vias (TSVs) to deliver much higher on-chip bus width than the traditional off-chip interface. In this paper, we evaluate the latency and bandwidth benefits of 3D stacked memory on GPUs. In addition, we take advantage of the DRAM row buffer locality to merge memory requests to further improve the performance. An Access-Pattern-Aware On-Chip Vector Memory System  with Automatic Loading for SIMD Architectures tong geng (Boston University)*; Erkan Diken (Eindhoven University of Technology); Tianqi Wang (University of Science and Technology of China); Lech Jozwiak (Eindhoven University of Technology); Martin Herbordt (Boston University) Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications involving Data-Level Parallelism (DLP); the on-chip memory system facilitates the communication between Processing Elements (PE) and on-chip vector memory.  It is observed that inefficiency of the on-chip memory system is often a computational bottleneck. In this paper, we describe the design and implementation of an efficient vector data memory system. The proposed memory system consists of two novel parts: an access- pattern-aware memory controller and an automatic loading mechanism. The memory controller reduces the data reorganization overheads. The automatic loading mechanism loads data automatically according to the access patterns without load instructions.  This eliminates overhead of fetching and decoding. The proposed design is implemented and synthesized with Cadence tools. Experimental results demonstrate that our design improves the performance of 8 application kernels by 44% and reduces the energy consumption by 26%, on average. Scalable RMA-based Communication Library Featuring Node- local NVMs Ryo Matsumiya (Tokyo Institute of Technology / AIST-TokyoTech Real World Big-Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology)*; Toshio Endo (Tokyo Institute of Technology / AIST-TokyoTech Real World Big- Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and Technology) Remote Memory Access (RMA) is a useful communication interface to develop high-performance applications with complicated communication patterns. However, the data scales of such applications are still limited by the totally available main memory capacity. To accommodate extreme scale executions of those applica- tions, we developed vGASNet, which is an RMA-based communication library that exploits the capacity of non-volatile memory (NVM) on each node. With vGASNet, NVM devices on nodes compose a large shared address space. Under this model, the key for good application performance is to reduce bandwidth bottlenecks. First, since NVM is much slower than DRAM, reducing the amounts of NVM accesses is important. For this purpose, vGASNet regards DRAM of each computation node as a cache of NVM. Next, one of bottleneck sources in RMA is caused by access contention. In order to mitigate its effects, vGASNet adopts cooperative cache mechanism, which make multiple caches of an object on several nodes. Our evaluation using vGASNet shows the above cache mechanism improves the scalability of RMA.
Thursday, September 27, 2018
ASIC & FPGA 2 1:00-2:40 in Eden Vale A3 Chair: Paul Monticiollo / MIT
HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA