2015 IEEE High Performane Extreme Computing (HPEC 2015)

Designed and maintained by Ballos Associates Web Design

IEEE Nondiscrimination Policy

2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA

Advanced ASIC & FPGA Technologies 10:20-12:00 in Eden Vale A1-A2 Chair: David Cousins / BBN [Best Student Paper Finalist] Hardware-Efficient Compressed Sensing Encoder Designs for ECG Jiayi Sheng, Chen Yang, Martin C. Herbordt, Boston University Implanted sensors, as might be used with wireless body sensor networks, must have minimal size and power consumption. In this work we examine digital-based compressed sensing encoders for WBSN-enable ECG monitoring, an area that has received much recent attention. We have two major contributions. The first is using a random Binary Toeplitz matrix rather than Bernoulli. The second is reducing the number of accumulators thereby trading off space for operating frequency. Compared with previous implementations, our new design consumes 1-to-2 orders of magnitude less area and power while still meeting timing constraints and achieving comparable recovery quality. Coarse Grain Reconfigurable ASIC through Multiplexer Based Switches Karen Gettings, Marc Burke, Jeremy Muldavin, Michael Vai, MIT Lincoln Laboratory We present an ASIC architecture with coarse-grain reconfigurability that uses accelerators to improve performance over fine- grain reconfigurable architectures. A reconfigurable FFT ASIC was built as a proof of concept, and it successfully demonstrated valid switch operation for reconfiguration. Performance and Productivity Evaluation of Hybrid-Threading HLS versus HDLs Gongyu Wang, Herman Lam, Alan George, University of Florida; Glen Edwards, Convey Computer Corporation FPGA-based reconfigurable computing is finding its way into a wide range of application areas in which high performance and low-power consumption are paramount. However, FPGA-application development using hardware-description languages (HDLs) faces many productivity challenges that limit its wide adoption, including a steep learning curve and lengthy compilation. High- level synthesis (HLS) languages and tools aim to overcome these challenges by providing familiar high-level languages and tools for FPGA-application development. In using HLS, however, an important consideration is the cost-benefit tradeoff for performance and productivity. Hybrid-threading (HT) is a new open-source HLS toolset from Convey Computer, Corp. that features a programming language based on C/C++ and a set of tools for efficient compilation, verification, and implementation. In this paper, we present a performance and productivity tradeoff study of HT HLS versus HDLs using three RC-amenable kernels, each chosen for their distinctive computational requirements. Our results show that for all three kernels, HT achieved over 80% performance for a fraction of development time, in comparison to corresponding optimized HDL-based designs. Aparapi-UCores: A High Level Programming Framework for Unconventional Cores Oren Segal, Philip Colangelo, Nasibeh Nasiri, Zhuo Qian, Martin Margala University of Massachusetts Lowell Combining several types of devices and architectures is at the heart of heterogeneous computing's power efficiency advantage, but the strength of heterogeneous systems is also their Achilles heel, i.e. the diversity of the devices and ecosystems needed to maintain them present major technological challenges. Some of the biggest challenges are in the realm of system programing. We believe that for heterogeneous systems computing to become a mainstream system design choice, high level and standard system design flows need to be adopted in order to achieve transparency when dealing with diverse devices and architectures. In this paper we present an open source high level framework and design flow that allows working with any type of device that supports OpenCL. In addition we test our design flow and framework on an N-body simulation across multiple device types and show how such high level framework and heterogeneous system design can deliver a more power efficient solution when compared to a single general purpose device and dual CPU+GPU device type approach. High Performance User Space Sockets on Low Power System on a Chip Platforms Catherine H. Crawford, Piotr Padkowski, Tomasz Baranski, Angela Czubak, ukasz Raszka IBM Research With the introduction of low power System on a Chip (SoC) processor architectures in enterprise server configurations, there is a growing need to develop the software that will support scale-out, data intensive cloud applications that are deployed in data centers today. In this paper, we describe the design and implementation of a low latency user space fully compliant TCP/IP socket stack on a low power System on a Chip (SoC) architecture and demonstrate that this library can become the basis for “Big Data” applications that require both high throughput and low latency capabilities all on a power optimized system platform. For our work, we are specifically targeting cloud applications that are developed on runtimes which are seeing great growth in programmer communities and enterprise deployment as well as for which the I/O bottlenecks outweigh the compute requirements, e.g. memcached. On low-power embedded-class SoC servers, these I/O bottlenecks can be prohibitively expensive for performance and scaling requirements of such applications, even when the CPU efficiency and memory bandwidth are adequate. Our approach removes this bottleneck by leveraging available SoC integrated Network Interface Cards (NICs) as well as user space communication – thereby improving pathlength to data as well as preserving CPU cycles from context switching. Our experiments show that we can achieve sub 5 μsec ping-pong latency for 8B packets, and also provide substantive improvement to the memslap benchmark not just when compared to memcached running on the T4240 with the kernel stack (3.5 times better for 16B SETs) but also when compared to a standard x86 64 server with ConnectX 10GbE adapters when power based metrics are used (close to a factor of 2 improvement with power normalized metrics).

Wednesday September 16

Senior Advisory Board
Technical Committee
Program Commitee
Tuesday, Sept 15
Wednesday, Sept 16
Thursday, Sept 17

Designed and maintained by Ballos Associates Web Design

IEEE Nondiscrimination Policy

2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA

Advanced ASIC & FPGA Technologies 10:20-12:00 in Eden Vale A1-A2 Chair: David Cousins / BBN [Best Student Paper Finalist] Hardware-Efficient Compressed Sensing Encoder Designs for ECG Jiayi Sheng, Chen Yang, Martin C. Herbordt, Boston University Implanted sensors, as might be used with wireless body sensor networks, must have minimal size and power consumption. In this work we examine digital- based compressed sensing encoders for WBSN- enable ECG monitoring, an area that has received much recent attention. We have two major contributions. The first is using a random Binary Toeplitz matrix rather than Bernoulli. The second is reducing the number of accumulators thereby trading off space for operating frequency. Compared with previous implementations, our new design consumes 1-to-2 orders of magnitude less area and power while still meeting timing constraints and achieving comparable recovery quality. Coarse Grain Reconfigurable ASIC through Multiplexer Based Switches Karen Gettings, Marc Burke, Jeremy Muldavin, Michael Vai, MIT Lincoln Laboratory We present an ASIC architecture with coarse-grain reconfigurability that uses accelerators to improve performance over fine-grain reconfigurable architectures. A reconfigurable FFT ASIC was built as a proof of concept, and it successfully demonstrated valid switch operation for reconfiguration. Performance and Productivity Evaluation of Hybrid- Threading HLS versus HDLs Gongyu Wang, Herman Lam, Alan George, University of Florida; Glen Edwards, Convey Computer Corporation FPGA-based reconfigurable computing is finding its way into a wide range of application areas in which high performance and low-power consumption are paramount. However, FPGA-application development using hardware-description languages (HDLs) faces many productivity challenges that limit its wide adoption, including a steep learning curve and lengthy compilation. High-level synthesis (HLS) languages and tools aim to overcome these challenges by providing familiar high-level languages and tools for FPGA-application development. In using HLS, however, an important consideration is the cost-benefit tradeoff for performance and productivity. Hybrid-threading (HT) is a new open- source HLS toolset from Convey Computer, Corp. that features a programming language based on C/C++ and a set of tools for efficient compilation, verification, and implementation. In this paper, we present a performance and productivity tradeoff study of HT HLS versus HDLs using three RC- amenable kernels, each chosen for their distinctive computational requirements. Our results show that for all three kernels, HT achieved over 80% performance for a fraction of development time, in comparison to corresponding optimized HDL-based designs. Aparapi-UCores: A High Level Programming Framework for Unconventional Cores Oren Segal, Philip Colangelo, Nasibeh Nasiri, Zhuo Qian, Martin Margala University of Massachusetts Lowell Combining several types of devices and architectures is at the heart of heterogeneous computing's power efficiency advantage, but the strength of heterogeneous systems is also their Achilles heel, i.e. the diversity of the devices and ecosystems needed to maintain them present major technological challenges. Some of the biggest challenges are in the realm of system programing. We believe that for heterogeneous systems computing to become a mainstream system design choice, high level and standard system design flows need to be adopted in order to achieve transparency when dealing with diverse devices and architectures. In this paper we present an open source high level framework and design flow that allows working with any type of device that supports OpenCL. In addition we test our design flow and framework on an N-body simulation across multiple device types and show how such high level framework and heterogeneous system design can deliver a more power efficient solution when compared to a single general purpose device and dual CPU+GPU device type approach. High Performance User Space Sockets on Low Power System on a Chip Platforms Catherine H. Crawford, Piotr Padkowski, Tomasz Baranski, Angela Czubak, ukasz Raszka IBM Research With the introduction of low power System on a Chip (SoC) processor architectures in enterprise server configurations, there is a growing need to develop the software that will support scale-out, data intensive cloud applications that are deployed in data centers today. In this paper, we describe the design and implementation of a low latency user space fully compliant TCP/IP socket stack on a low power System on a Chip (SoC) architecture and demonstrate that this library can become the basis for “Big Data” applications that require both high throughput and low latency capabilities all on a power optimized system platform. For our work, we are specifically targeting cloud applications that are developed on runtimes which are seeing great growth in programmer communities and enterprise deployment as well as for which the I/O bottlenecks outweigh the compute requirements, e.g. memcached. On low-power embedded-class SoC servers, these I/O bottlenecks can be prohibitively expensive for performance and scaling requirements of such applications, even when the CPU efficiency and memory bandwidth are adequate. Our approach removes this bottleneck by leveraging available SoC integrated Network Interface Cards (NICs) as well as user space communication – thereby improving pathlength to data as well as preserving CPU cycles from context switching. Our experiments show that we can achieve sub 5 μsec ping-pong latency for 8B packets, and also provide substantive improvement to the memslap benchmark not just when compared to memcached running on the T4240 with the kernel stack (3.5 times better for 16B SETs) but also when compared to a standard x86 64 server with ConnectX 10GbE adapters when power based metrics are used (close to a factor of 2 improvement with power normalized metrics).

Wednesday September 16

Welcome Message
Senior Advisory Board
Technical Committee
Program Committee
Invited Speakers
Sept 15
Sept 16
Sept 17
Demos