2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA
Manycore Computing 1
1:00-2:40 in Eden Vale A1 - A2
Chair: Patrick Dreher / MIT
[Best Paper Finalist]
Boosting Irregular Array Reductions through In-
lined Block-ordering on Fast Processors
Jan Ciesko, Sergi Mateo, Xavier Teruel, Vicenc Beltran,
Xavier Martorell, Jesus Labarta, Barcelona
Supercomputing Center
Array-type reductions represent a frequently
occurring algorithmic pattern in many scientific
applications. A special case occurs if array
elements are accessed in an irregular, often
random manner, making their concurrent and
scalable execution difficult. In this work we
present a new approach that consists of
language- and runtime support and targets
popular parallel programming models such as
OpenMP. Its runtime support implements
Privatization with In-lined, Block-ordered
Reductions (PIBOR), a new approach that trades
processor cycles to increase locality and
bandwidth efficiency for such algorithms. A
reference implementation in OmpSs, a task-
parallel programming model, shows promising
results on current multi-core systems.
[Best Paper Finalist]
MAGMA Embedded: Towards a Dense Linear
Algebra Library for Energy Efficient Extreme
Computing
Azzam Haidar, Stanimire Tomov, Piotr Luszczek, Jack
Dongarra, University of Tennessee Knoxville
Embedded computing, not only in large systems
like drones and hybrid vehicles, but also in small
portable devices like smart phones and watches,
gets more extreme to meet ever increasing
demands for extended and improved
functionalities. This, combined with the typical
constrains for low power consumption and small
sizes, makes the design of numerical libraries for
embedded systems challenging. In this paper,
we present the design and implementation of
embedded system aware algorithms, that target
these challenges in the area of dense linear
algebra. We consider the fundamental problems
of solving linear systems of equations and least
squares problems, using the LU, QR, and
Cholesky factorizations, and illustrate our results,
both in terms of performance and energy
efficiency, on the Jetson TK1 development kit. We
developed performance optimizations for both
small and large problems. In contrast to the
corresponding LAPACK algorithms, the new
designs target the use of many-cores, readily
available now even in mobile devices like the
Jetson TK1, e.g., featuring $192$ CUDA cores.
The implementations presented will form the core
of a MAGMA Embedded library, to be released
as part of the MAGMA libraries.
[Best Paper Finalist]
Optimizing Space Time Adaptive Processing
Through Accelerating Memory-bounded
Operations
Tze Meng Low, Qi Guo, Franz Franchetti, Carnegie
Mellon University
Space-Time Adaptive Processing (STAP) is a
tech- nique for processing signals from multiple
antenna elements over multiple time periods for
target detection. As STAP algorithms are typical
run on airborne platforms, they need to be both
high performance and energy-efficient. Due to the
high rate of processing required, many existing
algorithms focus on reducing the dimensionality
of the data, or exploiting structure in the
underlying mathematical formulation in order to
reduce the total number of floating-point
operations (FLOPs), and conse- quently, the time
for computation. While such algorithms target the
FLOPs-intensive operations within the STAP
algorithm, a significant portion of the compute
time for most STAP algorithms is actually spent in
low-FLOPs, memory-bounded operations. In this
paper, we address the computation of these
memory-bounded operations within the STAP
algorithm using a 3D stacked Logic- in-Memory
system. The imminent arrival of 3D stacked
memory makes avail high memory bandwidth,
which opens up a new and othorgonal dimension
for optimizing STAP algorithms. We show that
more than 11x improvement in time, and 77x
improvement in energy efficiency can be
expected when a 3D stack is used together with
memory-side accelerators to target the memory-
bounded operations within STAP.
[Best Student Paper Finalist]
A Near-Real-Time, Parallel and Distributed
Adaptive Object Detection and Re-training
Framework based on AdaBoost Algorithm
Munther Abualkibash, Ausif Mahmood, Saeid
Moslehpour, University of Bridgeport
Object detection (e.g., face detection) using
supervised learning often requires extensive
training, resulting in long execution times. If the
system requires retraining to accommodate a
missed detection, waiting several hours or
even days in some cases before the system is
ready, may not be acceptable in practical
implementations. This paper presents a
generalized object detection framework such that
the system can efficiently adapt to misclassified
data and be retrained within a few minutes. The
methodology developed here is based on the
popular AdaBoost algorithm for object detection.
To reduce the learning time in object detection,
we develop a highly efficient, parallel, and
distributed AdaBoost algorithm that is able to
achieve a training execution time of only 1.4
seconds per feature on 25 workstations. Further,
we incorporate this parallel object detection
algorithm into an adaptive framework such that
a much smaller, optimized training subset is
used to yield high detection rates while further
reducing the retraining execution time. We
demonstrate the usefulness of our adaptive
framework on face and car detection.
Implementing Image Processing Algorithms for
the Epiphany Many-Core Coprocessor with
Threaded MPI
James Ross, U.S. Army Research Laboratory, David
Richie, Brown Deer Technology, Song Park, U.S. Army
Research Laboratory, Dale Shires, U.S. Army Research
Laboratory
The Adapteva Epiphany MIMD architecture is a
scalable 2D array of RISC cores with minimal un-
core functionality connected with a fast 2D mesh
Network-on-Chip (NoC). Each mesh nodes
contains a RISC CPU core, 32 KB of shared local
memory, a mesh network interface, and a dual-
channel DMA engine. The 16-core Epiphany III
coprocessor has been integrated into the
Parallella minicomputer platform where the RISC
array is supported by a dual-core ARM CPU and
asymmetric shared-memory access to off-chip
global memory. Peak single-precision
performance for the Epiphany III is 19.2 GFLOPS
with an energy efficiency of 32.3 GFLOPS per
watt. The raw performance of the Epiphany III is
relatively low compared to modern high-
performance CPUs and GPUs; however, the
Epiphany architecture provides greater energy
efficiency and is designed to be highly scalable.
The published road map specifies a scale-out of
the existing architecture to exceed 1,000 cores in
the near future. Within this context it is a
competitive processor technology comparable to
other emerging architectures. Processors based
on this architecture exhibit good energy efficiency
and scalability via the 2D mesh network, but
require a suitable programming model to fully
exploit the architecture. Key to performance with
the Epiphany architecture is data re-use,
requiring precise control of inter-core
communication since the architecture does not
provide a hardware cache at any level. The cores
can access off-chip mapped memory with a
significant performance penalty in both latency
and bandwidth relative to accessing neighboring
core memory. In previous work we have
demonstrated an efficient parallel programming
model for the Epiphany architecture based on the
Message Passing Interface (MPI) standard.
Using MPI exploits the similarities between the
Epiphany architecture and a conventional parallel
distributed cluster. Our approach enables MPI
code to execute on the RISC array processor with
little modification and achieve high performance.
For the Epiphany architecture, The MPI
programming model is a better choice for
Epiphany than APIs designed for SMP
processors, such as OpenMP and OpenCL, since
the latter APIs lack good semantics for controlling
inter-core data movement which is critical to
achieving high performance for anything but
trivially parallel applications on this processor.
Threaded MPI was developed to provide an
extremely lightweight implementation of MPI
appropriate for threads executing within the
restricted context of the Epiphany RISC cores.
Threaded MPI is distinguished from conventional
MPI implementations by two critical differences,
driven by the fact the device must be accessed
as a coprocessor and each core executes
threads within a highly constrained set of
resources. As a result, the cores are not capable
of supporting a full process image or program in
the conventional sense, and therefore the
conventional MPI model of associating MPI
processes to concurrently executing programs is
not possible. Instead, coprocessor offload
semantics must be used to launch concurrent
threads that will then employ conventional MPI
semantics for inter-thread communication.
Threaded MPI has exhibited the highest
performance reported to date for non-trivially
parallel algorithms using a standard programming
model for the Epiphany architecture. We apply
the threaded MPI programming for image
processing kernels including a 2D Fast Fourier
Transform (FFT) with high-pass filter for edge
detection, local operators for Gaussian blur and a
Sobel filter, Canny edge detection, and Harris
corner detection operations. Conventional MPI
parallelization is employed in the
implementations, demonstrating the applicability
of this parallel programming model for the
Epiphany architecture. Benchmark performance
is analyzed for understanding the relative
performance of computation and communication.
The impact of the results on performance
projections are discussed for RISC arrays on the
current Epiphany roadmap scaled to thousands of
cores.
Wednesday September 16