2018 IEEE High Performance
Extreme Computing Conference
(HPEC ‘18)
Twenty-second Annual HPEC Conference
25 - 27 September 2018
Westin Hotel, Waltham, MA USA
Exploiting GPU with 3D Stacked Memory to Boost Performance for Data-Intensive Applications
Hao Wen (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)*
An increasing number of applications are using GPUs for acceleration. Due to the massive number of memory accesses, the
traditional DRAM becomes a bandwidth bottleneck. The 3D stacked memory gives the potential to alleviate the bandwidth
bottleneck by using through silicon vias (TSVs) to deliver much higher on-chip bus width than the traditional off-chip interface. In
this paper, we evaluate the latency and bandwidth benefits of 3D stacked memory on GPUs. In addition, we take advantage of
the DRAM row buffer locality to merge memory requests to further improve the performance.
An Access-Pattern-Aware On-Chip Vector Memory System with Automatic Loading for SIMD Architectures
tong geng (Boston University)*; Erkan Diken (Eindhoven University of Technology); Tianqi Wang (University of Science and
Technology of China); Lech Jozwiak (Eindhoven University of Technology); Martin Herbordt (Boston University)
Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications involving Data-Level Parallelism
(DLP); the on-chip memory system facilitates the communication between Processing Elements (PE) and on-chip vector
memory. It is observed that inefficiency of the on-chip memory system is often a computational bottleneck. In this paper, we
describe the design and implementation of an efficient vector data memory system. The proposed memory system consists of
two novel parts: an access-pattern-aware memory controller and an automatic loading mechanism. The memory controller
reduces the data reorganization overheads. The automatic loading mechanism loads data automatically according to the access
patterns without load instructions. This eliminates overhead of fetching and decoding. The proposed design is implemented and
synthesized with Cadence tools. Experimental results demonstrate that our design improves the performance of 8 application
kernels by 44% and reduces the energy consumption by 26%, on average.
Scalable RMA-based Communication Library Featuring Node-local NVMs
Ryo Matsumiya (Tokyo Institute of Technology / AIST-TokyoTech Real World Big-Data Computation Open Innovation Laboratory,
National Institute of Advanced Industrial Science and Technology)*; Toshio Endo (Tokyo Institute of Technology / AIST-
TokyoTech Real World Big-Data Computation Open Innovation Laboratory, National Institute of Advanced Industrial Science and
Technology)
Remote Memory Access (RMA) is a useful communication interface to develop high-performance applications with complicated
communication patterns. However, the data scales of such applications are still limited by the totally available main memory
capacity. To accommodate extreme scale executions of those applica- tions, we developed vGASNet, which is an RMA-based
communication library that exploits the capacity of non-volatile memory (NVM) on each node. With vGASNet, NVM devices on
nodes compose a large shared address space. Under this model, the key for good application performance is to reduce
bandwidth bottlenecks. First, since NVM is much slower than DRAM, reducing the amounts of NVM accesses is important. For
this purpose, vGASNet regards DRAM of each computation node as a cache of NVM. Next, one of bottleneck sources in RMA is
caused by access contention. In order to mitigate its effects, vGASNet adopts cooperative cache mechanism, which make
multiple caches of an object on several nodes. Our evaluation using vGASNet shows the above cache mechanism improves the
scalability of RMA.
Thursday, September 27, 2018
ASIC & FPGA 2
1:00-2:40 in Eden Vale A3
Chair: Paul Monticiollo / MIT