2018 IEEE High Performane Extreme Computing (HPEC 2016)

Designed and maintained by Ballos Associates Web Design

2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA

Stripmap SAR Pulse Interleaved Scheduling John Terragnoli (Northeastern University); Miriam Leeser (Northeastern University)*; Paul Monticciolo (MIT Lincoln Laboratory) Stripmap SAR is a radar mode used to image terrain from an airborne platform; it does so by transmitting and receiving a series of electromagnetic pulses. Pulse interleaving attempts to execute two or more stripmap tasks simultaneously by transmitting pulses for additional tasks while waiting for the pulses from other tasks to return, where a task is simply an area on the ground to image. Other research has done this by dividing the front-end of the radar into separate sections and aiming the energy into different beams, each pointing in a different direction or devoted to a separate area to image. This thesis focuses on utilizing the same beam for pulse interleaved scheduling. This research identifies a method for creating a schedule using pulse interleaving when given a set of stripmap tasks. Scheduling is not done in real time, but is instead done before the schedule will be executed. Interleaving is performed on multiple tasks overlapping in execution time in the following way. The PRFs of the tasks are altered within their allowable limits so that they match. Then, the sent pulses and return envelopes, the times when the pulses might return based on the aircraft's distance to the target and its dimensions, are separated temporally by adding delays, where needed, to the tasks' transmitted pulses. Doing this removes the possibilities of transmitting a pulse for one task while receiving the pulse from another task, needing to transmit for multiple tasks at the same time, or receiving pulses from multiple tasks at the same time. This process allows multiple tasks to be scheduled to execute in the same block of time. To compare the results of the interleaved scheduler, several greedy algorithms in which no interleaving is permitted were also created. In every case, the interleaved scheduler significantly outperformed the greedy algorithms. Several situations were created to simulate different flight conditions as well as radars with varying duty factor limits and output power capability. The performance of the interleaved scheduler was consistent in all situations. This research demonstrates the benefits of pulse level interleaving when scheduling multiple stripmap SAR tasks in a short period of time. Tangram: Colocating HPC Applications with Oversubscription Qingqing Xiong (Boston university); Emre Ates (Boston University)*; Martin Herbordt (Boston University); Ayse Coskun (Boston University) In a cluster that is shared by many users, jobs often need to wait in the queue for a significant amount of time. Much research has been done to reduce this time with scheduling, including aggressive back-filling strategies and sharing nodes among different jobs. Although most resources are shared to some extent in HPC clusters, it is somewhat surprising that a well-known technique used on commercial clouds, i.e., oversubscribing nodes so that CPU cores are shared among jobs, is rather rare. This is partially due to concerns about interference. This paper presents Tangram, a framework for colocating applications in HPC clusters. Tangram uses prior knowledge of applications, such as whether they are I/O or CPU intensive, to predict whether potential colocations improve overall performance. To predict with sufficient accuracy, Tangram uses a combination of performance counter measurements, knowledge of past colocation performance, and machine learning. We show that Tangram can choose colocations to reduce makespan by 19% on average and by 55% in the best case, while limiting the performance degradation caused by colocation from 1598% to 26% in the worst case. Experiments on Data Center Participation in Demand Response Programs Yijia Zhang (Boston University)*; Ozan Tuncer (Boston University); Athanasios Tsiligkaridis (Boston University); Michael Caramanis (Boston University); Ioannis Paschalidis (Boston University); Ayse Coskun (Boston University) Regulation service reserves (RSR) are among the demand response programs in emerging power markets. These programs are beneficial to the stability of the power grid. RSR requires participants to regulate their power consumption in a near-real-time manner following a signal broadcast every several seconds, and in return, the participants' electricity cost is reduced as a reward. Data centers are good candidates to participate in RSR owing to their flexibility in managing power consumption and computational demand. In this paper, we demonstrate and evaluate data center participation in RSR through experiments on a 13-server cluster. We implement two regulation policies: a policy that primarily tracks the given power signal closely, and another one that prioritizes quality-of-service (QoS) of the workloads running on the data center. This paper, with policies achieving acceptable accuracy and good performance even on a small cluster, indicates a promising future where full-scale data centers participate in demand response programs. Server-class devices for Space Time Adaptive Processing Jonas Larsson (Mercury Systems, Inc.)*; Robert McGrail (Mercury Systems) Space-Time Adaptive Processing (STAP) is known to efficiently reduce the effects of clutter and jamming. Computing the adaptive weights in real-time is an intensive process; computational burden can be reduced by selecting computations most suitable for the processing technology. Even if the most suitable approach is selected for a high-performance multi-channel STAP system, the processing demand can be substantial. As such, previous generations compute solutions were unable to meet size, weight and power (SWAP) requirements, making STAP difficult to deploy. A third-order Doppler-factored STAP provides performance approaching a fully adaptive system. This approach efficiently suppresses clutter and interference. The processing demand for this algorithm can be high, especially for a high channel count. The number of operations required to adaptively compute weights increases exponentially which in this case means that a doubling of either channels (L) or processing order (Q) will quadruple the operation rate. This study will examine a system with 22 channels. This paper examines if recent advances in technology focusing on Intel Xeon D/E with AVX2 and Intel Xeon SP with AVX512 can make it realistic for embedded applications to perform high-order STAP while still meeting size, weight and power requirements. Dynamic Deployment of Communication Applications to Different Hardware Platforms usingOntological Representations Yanji Chen (Northeastern University); Mehmet Gungor (Northeastern University); Shweta Singh (Northeastern University); Alex Tazin (Northeastern University); Mieczyslaw Kokar (Northeastern University); Miriam Leeser (Northeastern University)* We consider the problem of mapping communi- cations applications to heterogeneous hardware platforms dy- namically based on available hardware. Our approach uses an ontology to specify different communications designs and their implementation in software and hardware. Making use of the ontology and rules, we automatically generate implementations of different communications applications for FPGA hardware, CPU software, or a hybrid design that mixes the two. Designs are spec- ified in terms of “tasks” (or processing elements) and “conduits” (or connectors) for data transfer between tasks. Implementations use a library based approach where tasks are pre-designed for different platforms. Some tasks, such as controllers, are generated at run time and compiled for the appropriate target hardware. This style of specification-based deployment supports the easy migration between target hardware platforms. We present a method that uses ontological definitions for automatic code generation that allows for dynamic deployment of the application to different target hardware. We demonstrate this approach on an application consisting of two tasks: (1) spectrum sensing and (2) transmission. An automatically generated state machine implements the control flow, i.e., it decides what frequency to transmit on depending on the available spectrum. Our results show that this approach generates high quality designs with lower effort from the designer.

Parallel Real-Time Computing 10:20-12:00 in Eden Vale A3 Chair: David Cousins / BBN

Thursday, September 27, 2018

Designed and maintained by Ballos Associates Web Design

IEEE Nondiscrimination Policy

Stripmap SAR Pulse Interleaved Scheduling John Terragnoli (Northeastern University); Miriam Leeser (Northeastern University)*; Paul Monticciolo (MIT Lincoln Laboratory) Stripmap SAR is a radar mode used to image terrain from an airborne platform; it does so by transmitting and receiving a series of electromagnetic pulses. Pulse interleaving attempts to execute two or more stripmap tasks simultaneously by transmitting pulses for additional tasks while waiting for the pulses from other tasks to return, where a task is simply an area on the ground to image. Other research has done this by dividing the front-end of the radar into separate sections and aiming the energy into different beams, each pointing in a different direction or devoted to a separate area to image. This thesis focuses on utilizing the same beam for pulse interleaved scheduling. This research identifies a method for creating a schedule using pulse interleaving when given a set of stripmap tasks. Scheduling is not done in real time, but is instead done before the schedule will be executed. Interleaving is performed on multiple tasks overlapping in execution time in the following way. The PRFs of the tasks are altered within their allowable limits so that they match. Then, the sent pulses and return envelopes, the times when the pulses might return based on the aircraft's distance to the target and its dimensions, are separated temporally by adding delays, where needed, to the tasks' transmitted pulses. Doing this removes the possibilities of transmitting a pulse for one task while receiving the pulse from another task, needing to transmit for multiple tasks at the same time, or receiving pulses from multiple tasks at the same time. This process allows multiple tasks to be scheduled to execute in the same block of time. To compare the results of the interleaved scheduler, several greedy algorithms in which no interleaving is permitted were also created. In every case, the interleaved scheduler significantly outperformed the greedy algorithms. Several situations were created to simulate different flight conditions as well as radars with varying duty factor limits and output power capability. The performance of the interleaved scheduler was consistent in all situations. This research demonstrates the benefits of pulse level interleaving when scheduling multiple stripmap SAR tasks in a short period of time. Tangram: Colocating HPC Applications with Oversubscription Qingqing Xiong (Boston university); Emre Ates (Boston University)*; Martin Herbordt (Boston University); Ayse Coskun (Boston University) In a cluster that is shared by many users, jobs often need to wait in the queue for a significant amount of time. Much research has been done to reduce this time with scheduling, including aggressive back-filling strategies and sharing nodes among different jobs. Although most resources are shared to some extent in HPC clusters, it is somewhat surprising that a well-known technique used on commercial clouds, i.e., oversubscribing nodes so that CPU cores are shared among jobs, is rather rare. This is partially due to concerns about interference. This paper presents Tangram, a framework for colocating applications in HPC clusters. Tangram uses prior knowledge of applications, such as whether they are I/O or CPU intensive, to predict whether potential colocations improve overall performance. To predict with sufficient accuracy, Tangram uses a combination of performance counter measurements, knowledge of past colocation performance, and machine learning. We show that Tangram can choose colocations to reduce makespan by 19% on average and by 55% in the best case, while limiting the performance degradation caused by colocation from 1598% to 26% in the worst case. Experiments on Data Center Participation in Demand Response Programs Yijia Zhang (Boston University)*; Ozan Tuncer (Boston University); Athanasios Tsiligkaridis (Boston University); Michael Caramanis (Boston University); Ioannis Paschalidis (Boston University); Ayse Coskun (Boston University) Regulation service reserves (RSR) are among the demand response programs in emerging power markets. These programs are beneficial to the stability of the power grid. RSR requires participants to regulate their power consumption in a near-real-time manner following a signal broadcast every several seconds, and in return, the participants' electricity cost is reduced as a reward. Data centers are good candidates to participate in RSR owing to their flexibility in managing power consumption and computational demand. In this paper, we demonstrate and evaluate data center participation in RSR through experiments on a 13-server cluster. We implement two regulation policies: a policy that primarily tracks the given power signal closely, and another one that prioritizes quality-of-service (QoS) of the workloads running on the data center. This paper, with policies achieving acceptable accuracy and good performance even on a small cluster, indicates a promising future where full-scale data centers participate in demand response programs. Server-class devices for Space Time Adaptive Processing Jonas Larsson (Mercury Systems, Inc.)*; Robert McGrail (Mercury Systems) Space-Time Adaptive Processing (STAP) is known to efficiently reduce the effects of clutter and jamming. Computing the adaptive weights in real-time is an intensive process; computational burden can be reduced by selecting computations most suitable for the processing technology. Even if the most suitable approach is selected for a high-performance multi-channel STAP system, the processing demand can be substantial. As such, previous generations compute solutions were unable to meet size, weight and power (SWAP) requirements, making STAP difficult to deploy. A third-order Doppler-factored STAP provides performance approaching a fully adaptive system. This approach efficiently suppresses clutter and interference. The processing demand for this algorithm can be high, especially for a high channel count. The number of operations required to adaptively compute weights increases exponentially which in this case means that a doubling of either channels (L) or processing order (Q) will quadruple the operation rate. This study will examine a system with 22 channels. This paper examines if recent advances in technology focusing on Intel Xeon D/E with AVX2 and Intel Xeon SP with AVX512 can make it realistic for embedded applications to perform high-order STAP while still meeting size, weight and power requirements. Dynamic Deployment of Communication Applications to Different Hardware Platforms usingOntological Representations Yanji Chen (Northeastern University); Mehmet Gungor (Northeastern University); Shweta Singh (Northeastern University); Alex Tazin (Northeastern University); Mieczyslaw Kokar (Northeastern University); Miriam Leeser (Northeastern University)* We consider the problem of mapping communi- cations applications to heterogeneous hardware platforms dy- namically based on available hardware. Our approach uses an ontology to specify different communications designs and their implementation in software and hardware. Making use of the ontology and rules, we automatically generate implementations of different communications applications for FPGA hardware, CPU software, or a hybrid design that mixes the two. Designs are spec- ified in terms of “tasks” (or processing elements) and “conduits” (or connectors) for data transfer between tasks. Implementations use a library based approach where tasks are pre- designed for different platforms. Some tasks, such as controllers, are generated at run time and compiled for the appropriate target hardware. This style of specification-based deployment supports the easy migration between target hardware platforms. We present a method that uses ontological definitions for automatic code generation that allows for dynamic deployment of the application to different target hardware. We demonstrate this approach on an application consisting of two tasks: (1) spectrum sensing and (2) transmission. An automatically generated state machine implements the control flow, i.e., it decides what frequency to transmit on depending on the available spectrum. Our results show that this approach generates high quality designs with lower effort from the designer.

Parallel Real-Time Computing 10:20-12:00 in Eden Vale A3 Chair: David Cousins / BBN

Thursday, September 27, 2018

HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA