2018 IEEE High Performance
Extreme Computing Conference
(HPEC ‘18)
Twenty-second Annual HPEC Conference
25 - 27 September 2018
Westin Hotel, Waltham, MA USA
Computationally Efficient CP Tensor Decomposition Update Framework for Emerging Component Discovery in
Streaming Data
Pierre-David Letourneau (Reservoir Labs, Inc.)*; Muthu M Baskaran (Reservoir Labs); Thomas Henretty (Reservoir Labs);
Richard Lethin (Reservoir Labs); James Ezick (Reservoir Labs)
We present Streaming CP Update, an algorithmic framework for updating CP tensor decompositions that possesses the
capability of identifying emerging components and can produce decompositions of large, sparse tensors streaming along
multiple modes at a low computational cost. We discuss a large-scale implementation of the proposed scheme integrated
within the ENSIGN tensor analysis package, and we evaluate and demonstrate the performance of the framework, in terms of
computational efficiency and ability to discover emerging components, on a real cyber dataset.
Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis
Albert Reuther (MIT Lincoln Laboratory)*; Chansup Byun (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory);
Andrew Prout (MIT); Michael S Jones (MIT Lincoln Laboratory)
Interactive massively parallel computations are critical for machine learning and data analysis. These computations are a
staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive
supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis
environments, such as MATLAB/Octave, to tens of thousands of cores presents many technical challenges – in particular,
rapidly dispatching many tasks through a scheduler, such as Slurm, and starting many instances of applications with
thousands of dependencies. Careful tuning of launches and prepositioning of applications overcome these challenges and
allow the launching of thousands of tasks in seconds on a 40,000-core supercomputer. Specifically, this work demonstrates
launching 32,000 TensorFlow processes in 4 seconds and launching 262,000 Octave processes in 40 seconds. These
capabilities allow researchers to rapidly explore novel machine learning architecture and data analysis algorithms.
Sparse Dual of the Density Peaks Algorithm for Cluster Analysis of High-dimensional Data
Dimitris Floros (Aristotle University of Thessaloniki)*; Tiancheng Liu (Duke University); Nikos P Pitsianis (Aristotle University of
Thessaloniki); Xiaobai Sun (Duke University)
The density peaks (DP) algorithm for cluster analysis, introduced by Rodriguez and Laio in 2014, has proven empirically
competitive or superior in multiple aspects to other contemporary clustering algorithms. Yet, it suffers from certain drawbacks
and limitations when used for clustering high-dimensional data. We introduce SD-DP, the sparse dual version of DP. While
following the DP principle and maintaining its appealing properties, we find and use a sparse descriptor of local density as a
robust representation. By analyzing and exploiting the consequential properties, we are able to use sparse graph-matrix
expressions and operations throughout the clustering process. As a result, SD-DP has provably linear-scaling computation
complexity under practical conditions. We show with experimental results on several real-world high-dimensional datasets,
that SD-DP outperforms DP in robustness, accuracy, self-governess, and efficiency.
AC922 Data Movement for CORAL
Steven L Roberts (IBM)*
Recent publications have considered the challenge of movement in and out of the high bandwidth memory in attempt to
maximize GPU utilization and minimize overall application wall time. Previous work has identified challenges, simulate
software models, advocated optimization, and suggest design considerations. This contribution characterizes the data
movement innovations of the AC922 nodes IBM delivered to Oakridge National Labs and Lawerance Livermore National Labs
as part of the 2014 Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) joint procurement activity. With 200PF of
processing ability with access to 2.5PB of memory, these systems motivate a careful look at data movement. The AC922
POWER9 system with Nvidia GV100 GPUs and Mellanox CAPI/EDR HCAs have cache line granularity, more than double the
bandwidth of PCIe Gen3, and low latency interfaces. As such, the bandwidth and latency assumptoins from previous
simulations should be revisited and compared to characterization results on product hardware. Our characterization approach
attempts to leverage existing performance approaches, as applicable, to ease comparison and correlation. The results show
that it is possible to design efficient logically coherent heterogeneous systems and refocuses our on attention to the
interconnect between processor elements.
Thursday, September 27, 2018
High Performance Data Analysis 1
10:20 - 12:00 in Eden Vale C1/C2
Chair: Vijay Gadepally / MIT