2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA
Computationally Efficient CP Tensor Decomposition Update Framework for Emerging Component Discovery in Streaming Data Pierre-David Letourneau (Reservoir Labs, Inc.)*; Muthu M Baskaran (Reservoir Labs); Thomas Henretty (Reservoir Labs); Richard Lethin (Reservoir Labs); James Ezick (Reservoir Labs) We present Streaming CP Update, an algorithmic framework for updating CP tensor decompositions that possesses the capability of identifying emerging components and can produce decompositions of large, sparse tensors streaming along multiple modes at a low computational cost. We discuss a large-scale implementation of the proposed scheme integrated within the ENSIGN tensor analysis package, and we evaluate and demonstrate the performance of the framework, in terms of computational efficiency and ability to discover emerging components, on a real cyber dataset. Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis Albert Reuther (MIT Lincoln Laboratory)*; Chansup Byun (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory); Andrew Prout (MIT); Michael S Jones (MIT Lincoln Laboratory) Interactive massively parallel computations are critical for machine learning and data analysis.   These computations are a staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis environments, such as MATLAB/Octave, to tens of thousands of cores presents many technical challenges – in particular, rapidly dispatching many tasks through a scheduler, such as Slurm, and starting many instances of applications with thousands of dependencies.  Careful tuning of launches and prepositioning of applications overcome these challenges and allow the launching of thousands of tasks in seconds on a 40,000-core supercomputer.  Specifically, this work demonstrates launching 32,000 TensorFlow processes in 4 seconds and launching 262,000 Octave processes in 40 seconds.  These capabilities allow researchers to rapidly explore novel machine learning architecture and data analysis algorithms. Sparse Dual of the Density Peaks Algorithm for Cluster Analysis of High-dimensional Data Dimitris Floros (Aristotle University of Thessaloniki)*; Tiancheng Liu (Duke University); Nikos P Pitsianis (Aristotle University of Thessaloniki); Xiaobai Sun (Duke University) The density peaks (DP) algorithm for cluster analysis, introduced by Rodriguez and Laio in 2014, has proven empirically competitive or superior in multiple aspects to other contemporary clustering algorithms. Yet, it suffers from certain drawbacks and limitations when used for clustering high-dimensional data. We introduce SD-DP, the sparse dual version of DP. While following the DP principle and maintaining its appealing properties, we find and use a sparse descriptor of local density as a robust representation. By analyzing and exploiting the consequential properties, we are able to use sparse graph-matrix expressions and operations throughout the clustering process. As a result, SD-DP has provably linear-scaling computation complexity under practical conditions. We show with experimental results on several real-world high-dimensional datasets, that SD-DP outperforms DP in robustness, accuracy, self-governess, and efficiency. AC922 Data Movement for CORAL Steven L Roberts (IBM)* Recent publications have considered the challenge of movement in and out of the high bandwidth memory in attempt to maximize GPU utilization and minimize overall application wall time. Previous work has identified challenges, simulate software models, advocated optimization, and suggest design considerations.  This contribution characterizes the data movement innovations of the AC922 nodes IBM delivered to Oakridge National Labs and Lawerance Livermore National Labs as part of the 2014 Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) joint procurement activity.  With 200PF of processing ability with access to 2.5PB of memory, these systems motivate a careful look at data movement.  The AC922 POWER9 system with Nvidia GV100 GPUs and Mellanox CAPI/EDR HCAs have cache line granularity, more than double the bandwidth of PCIe Gen3, and low latency interfaces.  As such, the bandwidth and latency assumptoins from previous simulations should be revisited and compared to characterization results on product hardware.  Our characterization approach attempts to leverage existing performance approaches, as applicable, to ease comparison and correlation.  The results show that it is possible to design efficient logically coherent heterogeneous systems and refocuses our on attention to the interconnect between processor elements.
Thursday, September 27, 2018
High Performance Data Analysis 1 10:20 - 12:00 in Eden Vale C1/C2 Chair: Vijay Gadepally / MIT
Computationally Efficient CP Tensor Decomposition Update Framework for Emerging Component Discovery in Streaming Data Pierre-David Letourneau (Reservoir Labs, Inc.)*; Muthu M Baskaran (Reservoir Labs); Thomas Henretty (Reservoir Labs); Richard Lethin (Reservoir Labs); James Ezick (Reservoir Labs) We present Streaming CP Update, an algorithmic framework for updating CP tensor decompositions that possesses the capability of identifying emerging components and can produce decompositions of large, sparse tensors streaming along multiple modes at a low computational cost. We discuss a large-scale implementation of the proposed scheme integrated within the ENSIGN tensor analysis package, and we evaluate and demonstrate the performance of the framework, in terms of computational efficiency and ability to discover emerging components, on a real cyber dataset. Interactive Supercomputing on 40,000 Cores for Machine Learning and Data Analysis Albert Reuther (MIT Lincoln Laboratory)*; Chansup Byun (MIT Lincoln Laboratory); Jeremy Kepner (MIT Lincoln Laboratory); Andrew Prout (MIT); Michael S Jones (MIT Lincoln Laboratory) Interactive massively parallel computations are critical for machine learning and data analysis.   These computations are a staple of the MIT Lincoln Laboratory Supercomputing Center (LLSC) and has required the LLSC to develop unique interactive supercomputing capabilities. Scaling interactive machine learning frameworks, such as TensorFlow, and data analysis environments, such as MATLAB/Octave, to tens of thousands of cores presents many technical challenges – in particular, rapidly dispatching many tasks through a scheduler, such as Slurm, and starting many instances of applications with thousands of dependencies.  Careful tuning of launches and prepositioning of applications overcome these challenges and allow the launching of thousands of tasks in seconds on a 40,000-core supercomputer.  Specifically, this work demonstrates launching 32,000 TensorFlow processes in 4 seconds and launching 262,000 Octave processes in 40 seconds.  These capabilities allow researchers to rapidly explore novel machine learning architecture and data analysis algorithms. Sparse Dual of the Density Peaks Algorithm for Cluster Analysis of High-dimensional Data Dimitris Floros (Aristotle University of Thessaloniki)*; Tiancheng Liu (Duke University); Nikos P Pitsianis (Aristotle University of Thessaloniki); Xiaobai Sun (Duke University) The density peaks (DP) algorithm for cluster analysis, introduced by Rodriguez and Laio in 2014, has proven empirically competitive or superior in multiple aspects to other contemporary clustering algorithms. Yet, it suffers from certain drawbacks and limitations when used for clustering high-dimensional data. We introduce SD-DP, the sparse dual version of DP. While following the DP principle and maintaining its appealing properties, we find and use a sparse descriptor of local density as a robust representation. By analyzing and exploiting the consequential properties, we are able to use sparse graph-matrix expressions and operations throughout the clustering process. As a result, SD-DP has provably linear-scaling computation complexity under practical conditions. We show with experimental results on several real-world high-dimensional datasets, that SD-DP outperforms DP in robustness, accuracy, self-governess, and efficiency. AC922 Data Movement for CORAL Steven L Roberts (IBM)* Recent publications have considered the challenge of movement in and out of the high bandwidth memory in attempt to maximize GPU utilization and minimize overall application wall time. Previous work has identified challenges, simulate software models, advocated optimization, and suggest design considerations.  This contribution characterizes the data movement innovations of the AC922 nodes IBM delivered to Oakridge National Labs and Lawerance Livermore National Labs as part of the 2014 Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) joint procurement activity.  With 200PF of processing ability with access to 2.5PB of memory, these systems motivate a careful look at data movement.  The AC922 POWER9 system with Nvidia GV100 GPUs and Mellanox CAPI/EDR HCAs have cache line granularity, more than double the bandwidth of PCIe Gen3, and low latency interfaces.  As such, the bandwidth and latency assumptoins from previous simulations should be revisited and compared to characterization results on product hardware.  Our characterization approach attempts to leverage existing performance approaches, as applicable, to ease comparison and correlation.  The results show that it is possible to design efficient logically coherent heterogeneous systems and refocuses our on attention to the interconnect between processor elements.
Thursday, September 27, 2018
High Performance Data Analysis 1 10:20 - 12:00 in Eden Vale C1/C2 Chair: Vijay Gadepally / MIT
HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA