2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA
Manycore Computing 2
3:00-4:40 in Eden Vale A1 - A2
Chair: David Cousins / BBN
Heterogeneous Work-stealing across CPU and DSP
Cores
Vivek Kumar, Alina Sbîrlea, Zoran Budimlic, Deepak
Majeti, Vivek Sarkar, Rice University
Due to the increasing power constraints and higher
and higher performance demands, many vendors
have shifted their focus from designing high
performance computer nodes using powerful multi-
core general-purpose CPUs, to nodes containing a
smaller number of general-purpose CPUs aided by
a larger number of more power-efficient special
purpose processing units, such as GPUs, FPGAs
or DSPs. While offering a lower power-to-
performance ratio, unfortunately, such
heterogeneous systems are notoriously hard to
program, forcing the users to resort to lower-level
direct programming of the special purpose
processors and manually managing data transfer
and synchronization between the parts of the
program running on general-purpose CPUs and
special-purpose CPUs. In this paper, we present
HC-K2H, a programming model and runtime
system for the Texas Instruments Keystone II
Hawking platform, consisting of 4 ARM CPUs and
8 TI DSP processors. This System-on-a-Chip
(SoC) offers performance with a high Floating
Point Operations per second.We present the
design and implementation of a hybrid
programming model and work-stealing runtime that
allows for tasks to be created and executed on
both the ARM and DSP and enables the seamless
execution and synchronization of tasks regardless
of whether they are running on the ARM or DSP.
The design of our programming model and runtime
is based on the Habanero C programming system.
We evaluate our implementation using task-parallel
benchmarks on a Hawking board, and demonstrate
excellent scaling compared to sequential
implementations on a single ARM processor.
Achieving Low Latency, Reduced Memory
Footprint and Low Power Consumption with Data
Streaming
Olivier Bockenbach, ContextVision, Murtaza Ali, Texas
Instruments, Ian Wainwright, High Performance
Consulting, Mark Nadeski, Texas Instruments
In addition to its patient friendly properties,
Ultrasound Imaging has become attractive
because of its ability to provide images in real time.
This low latency implementations allows for fast
scanning and a quick time to establish a precise
diagnostic using medical imaging. This study
presents a framework aimed at the stream line
processing of images, the ultimate goal being
twofold. The first goal is to keep the latency as low
as possible by processing the data as soon as
there are enough samples available. The second
goal is to reduce the required processing power
per image. To achieve these goals, the framework
allows several images to be processed
simultaneously albeit in sequence. This allows
taking advantage of periods where the processor is
not fully loaded. This study shows how the latency
is kept at the strict minimum while the required
processing power is reduced when compared to a
traditional image based implementation. The
application runs a temporal adaptive filter on a
hardware platform based on a Digital Signal
Processors (DSP).
Embedded Second-Order Cone Programming with
Radar Applications
Paul Mountcastle, Tom Henretty, Aale Naqvi, Richard
Lethin, Reservoir Labs
Second-order cone programming (SOCP) is
required for the solution of under-determined
systems of linear equations with complex
coefficients, subject to the minimization of a
convex objective function. This type of
computational problem appears in compressed
radar sensing, where the goal is to reconstruct a
sparse image in a projective space whose
dimension is higher than the number of complex
measurements. In order to enforce sparsity in the
final rectified radar image, the sum of moduli of a
complex vector, called the L1-norm, must be
minimized. This norm differs from what is ordinarily
encountered in compressed sensing for digital
photographic data and video, in that the convex
optimization that must be performed involves an
SOCP rather than a linear program. We illustrate
the role of this type of optimization in radar signal
processing by means of examples. The examples
point to a significant generalization that
encompasses and unifies a wide class of radar
signal processing algorithms that can be
implemented in software by means of SOCP
solvers. Finally, we show how modern SOCP
solvers are optimized for efficient solution of these
problems in the context of embedded signal
processing on small autonomous platforms.
Efficient Parallelization of Path Planning Workload
on Single-chip Shared-memory Multicores
Masab Ahmad, Omer Khan, University of Connecticut
Path planning problems greatly arise in many
applications where the objective is to find the
shortest path from a given source to destination. In
this paper, we explore the comparison of
programming languages in the context of parallel
workload analysis. We implement and characterize
parallel versions of path planning algorithms, such
as the Dijkstra’s Algorithm, across C/C++ and
Python languages. Programming language
comparisons are done for a single-socket real
machine setup over shared memory to analyze
fine grained scalability and efficiency. Our results
show that a right parallelization strategy for path
planning yields scalability for C/C++ codes
executing on a commercial multicore CPU.
However, several shortcomings exist in the parallel
Python language that must be accounted for by
HPC researchers.
Monte Carlo Simulations on Intel Xeon Phi: Offl
oad and Native Mode
Bryar M. Shareef, Elise de Doncker, Western Michigan
University
In high performance computing, Monte Carlo
methods are widely used to solve problems in
various areas of computational physics, finance,
mathematics, electrical engineering and many
other fields. We present Monte Carlo methods for
the Intel Xeon Phi coprocessor, to compute
integrals for applications in high energy physics
and in stochastic geometry. The Intel Xeon Phi is
based on a Many Integrated Core (MIC)
architecture to gain extreme performance. We use
two modes, "offload" and "native", to implement the
simulations. In offload mode, the main program
resides on the host system and supporting
functions are executed on the MIC; in native mode,
the program is fully executed on the MIC card. We
compare the parallel performance of our
applications running on Intel Xeon Phi, in terms of
time and speedup, with a sequential execution on
the CPU. In addition, the applications are designed
in both single and double precision.
Wednesday September 16