2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA
Too Many Secants: A Hierarchical Approach to Secant-based Dimensionality Reduction on Large Data Sets Henry Kvinge (Colorado State University)*; Elin R Farnell (Colorado State University); Michael Kirby (Colorado State University); Chris Peterson (Colorado State University) A fundamental question in many data analysis settings is the problem of discerning the ``natural'' dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of Secant-Avoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, one may infer the dimension of the manifold of origination. That is, one may learn the dimension at which it is possible to construct a diffeomorphic copy of the data in a lower-dimensional Euclidean space. Using Whitney's embedding theorem, we can relate this information to the natural dimension of the data. A drawback of the SAP algorithm is that a data set with $T$ points has $O(T^2)$ secants, making the computation and storage of all secants infeasible for very large data sets. In this paper, we propose a novel algorithm that generalizes the SAP algorithm with an emphasis on addressing this issue. That is, we propose a hierarchical secant-based dimensionality-reduction method, which can be employed for data sets where explicitly calculating all secants is not feasible. Regression Based WCET Analysis For Sampling Based Motion Planning Hao Wen (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)* Motion planning is one of the most critical tasks in a self-driving vehicle system. Sampling based motion planning earns popularity due to its capability of providing quick and effective answers to planning queries. Since motion planning is a safety critical piece of software, it is important to know the Worst-Case Execution Time (WCET) of this task in the system. Traditional static WCET analysis techniques do not consider the dynamic behavior of the interaction between the sampling algorithm and the environment. Measurement-based WCET estimation focuses on an individual task, and therefore has no prediction capability when the start and goal positions change. We propose regression models to predict safe upper bound of WCET for the Rapidly-Exploring Random Tree (RRT), a widely used sampling based motion planning algorithm. A Novel 1D-Convolution Accelerator for Low-Power Real-time CNN Processing on the Edge Justin Sanchez (UNCC)*; Nasim Soltani (The University of North Carolina at Charlotte); Ramachandra VIkas Chamarthi (The University of North Carolina at Charlotte); Adarsh Sawant (The University of North Carolina at Charlotte); Hamed Tabkhi (The University of North Carolina at Charlotte) With the rise of deep learning, the demand for real-time edge intelligence is greater than ever. Current algorithm and hardware realizations often focus on the cloud paradigm and maintain the assumption that the entire frame’s data is available in large batches. As a result, obtaining real-time AI inference at the edge has been a tough goal due to tight-latency awareness as well as streaming nature of the data.  There is an inherent need for novel architectures that can realize latency-aware agile deep learning algorithms at the edge. This paper introduces a novel joint algorithm architecture approach to enable real-time low- power Convolutional Neural Network (CNN) processing on edge devices. The core of the proposed approach is utilizing 1D dimensional convolution with an architecture that can truly benefit from the algorithm optimization. On the algorithm side, we present a novel training and inference based on 1D convolution. On the architecture side, we present a novel data flow architecture with the capability of performing on-the-fly 1D convolution over the pixel stream. Our results on Xilinx Zynq-7000 FPGA for SqueezeNet demonstrates only 2% lost in accuracy while maintaining real-time processing of 60 frames per second with only 1.73W power consumption. The Dynamic power consumption is 7.3X lower than regular 2D convolution CNN for performing the same frame rate, and 4.3X less than Nvidia Jetson TX2  total power, performing only 30 frame per second. Energy-Efficient DNN Computing on GPUs Through Register File Management Xin Wang (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)* The Deep Neural Networks (DNNs) are state-of-theart approaches to draw knowledge from a huge amount of data with remarkable accuracies. Currently, the size of the data in the real world increases from Gigabytes to Terabytes and even Petabytes, leading to high computational complexity for training DNNs, which can range from days to weeks. Current DNNs that involve a mass of matrix multiplications and other similar operations can be well paralleled and thus accelerated by GPUs. However, energy consumption is still a big concern for DNN, which can limit the scalability of performance increase.  In this paper, instead of pruning the complexity of DNN models, we propose to utilize the specific micro-architectures of GPUs and the DNN application characteristics to improve energy efficiency. A huge register file (RF) is often necessary for modern GPUs to hold contexts of thousands of concurrent threads. Consequently, the GPU RF which is constructed with high leakage transistors contributes significantly to GPU’s total energy consumption and thus smart RF management strategies can help GPUs to reduce energy consumption when scaling up the hardware resources for enhanced performance. First, based on the observation that there are a large fraction of narrow-width operands in DNNs, we propose to use a GPU register packing scheme to use the RF more efficiently. Second, we introduce the drowsy RF with a simple policy to decrease the leakage energy consumption. Finally, we attempt to further improve RF energy efficiency by taking advantage of the cooperation of drowsy RF and register packing techniques. We evaluate the effectiveness of our GPU RF management schemes on energy reduction using AlexNet which is a state-of-the-art DNN model. The experimental results show that the combination of the register packing and drowsy techniques achieves the most total GPU energy consumption reduction, up to 11% and 10.3% on average.
Thursday, September 27,  2018
Machine Learning 2 1:00-2:40 in Eden Vale A1/2 Chair: Sadasivan Shankar / Harvard
Too Many Secants: A Hierarchical Approach to Secant-based Dimensionality Reduction on Large Data Sets Henry Kvinge (Colorado State University)*; Elin R Farnell (Colorado State University); Michael Kirby (Colorado State University); Chris Peterson (Colorado State University) A fundamental question in many data analysis settings is the problem of discerning the ``natural'' dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of Secant-Avoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, one may infer the dimension of the manifold of origination. That is, one may learn the dimension at which it is possible to construct a diffeomorphic copy of the data in a lower-dimensional Euclidean space. Using Whitney's embedding theorem, we can relate this information to the natural dimension of the data. A drawback of the SAP algorithm is that a data set with $T$ points has $O(T^2)$ secants, making the computation and storage of all secants infeasible for very large data sets. In this paper, we propose a novel algorithm that generalizes the SAP algorithm with an emphasis on addressing this issue. That is, we propose a hierarchical secant-based dimensionality-reduction method, which can be employed for data sets where explicitly calculating all secants is not feasible. Regression Based WCET Analysis For Sampling Based Motion Planning Hao Wen (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)* Motion planning is one of the most critical tasks in a self-driving vehicle system. Sampling based motion planning earns popularity due to its capability of providing quick and effective answers to planning queries. Since motion planning is a safety critical piece of software, it is important to know the Worst-Case Execution Time (WCET) of this task in the system. Traditional static WCET analysis techniques do not consider the dynamic behavior of the interaction between the sampling algorithm and the environment. Measurement-based WCET estimation focuses on an individual task, and therefore has no prediction capability when the start and goal positions change. We propose regression models to predict safe upper bound of WCET for the Rapidly-Exploring Random Tree (RRT), a widely used sampling based motion planning algorithm. A Novel 1D-Convolution Accelerator for Low-Power Real-time CNN Processing on the Edge Justin Sanchez (UNCC)*; Nasim Soltani (The University of North Carolina at Charlotte); Ramachandra VIkas Chamarthi (The University of North Carolina at Charlotte); Adarsh Sawant (The University of North Carolina at Charlotte); Hamed Tabkhi (The University of North Carolina at Charlotte) With the rise of deep learning, the demand for real-time edge intelligence is greater than ever. Current algorithm and hardware realizations often focus on the cloud paradigm and maintain the assumption that the entire frame’s data is available in large batches. As a result, obtaining real-time AI inference at the edge has been a tough goal due to tight-latency awareness as well as streaming nature of the data.  There is an inherent need for novel architectures that can realize latency-aware agile deep learning algorithms at the edge. This paper introduces a novel joint algorithm architecture approach to enable real- time low-power Convolutional Neural Network (CNN) processing on edge devices. The core of the proposed approach is utilizing 1D dimensional convolution with an architecture that can truly benefit from the algorithm optimization. On the algorithm side, we present a novel training and inference based on 1D convolution. On the architecture side, we present a novel data flow architecture with the capability of performing on-the-fly 1D convolution over the pixel stream. Our results on Xilinx Zynq-7000 FPGA for SqueezeNet demonstrates only 2% lost in accuracy while maintaining real-time processing of 60 frames per second with only 1.73W power consumption. The Dynamic power consumption is 7.3X lower than regular 2D convolution CNN for performing the same frame rate, and 4.3X less than Nvidia Jetson TX2  total power, performing only 30 frame per second. Energy-Efficient DNN Computing on GPUs Through Register File Management Xin Wang (Virginia Commonwealth University); Wei Zhang (Virginia Commonwealth University)* The Deep Neural Networks (DNNs) are state-of-theart approaches to draw knowledge from a huge amount of data with remarkable accuracies. Currently, the size of the data in the real world increases from Gigabytes to Terabytes and even Petabytes, leading to high computational complexity for training DNNs, which can range from days to weeks. Current DNNs that involve a mass of matrix multiplications and other similar operations can be well paralleled and thus accelerated by GPUs. However, energy consumption is still a big concern for DNN, which can limit the scalability of performance increase.  In this paper, instead of pruning the complexity of DNN models, we propose to utilize the specific micro-architectures of GPUs and the DNN application characteristics to improve energy efficiency. A huge register file (RF) is often necessary for modern GPUs to hold contexts of thousands of concurrent threads. Consequently, the GPU RF which is constructed with high leakage transistors contributes significantly to GPU’s total energy consumption and thus smart RF management strategies can help GPUs to reduce energy consumption when scaling up the hardware resources for enhanced performance. First, based on the observation that there are a large fraction of narrow-width operands in DNNs, we propose to use a GPU register packing scheme to use the RF more efficiently. Second, we introduce the drowsy RF with a simple policy to decrease the leakage energy consumption. Finally, we attempt to further improve RF energy efficiency by taking advantage of the cooperation of drowsy RF and register packing techniques. We evaluate the effectiveness of our GPU RF management schemes on energy reduction using AlexNet which is a state-of-the-art DNN model. The experimental results show that the combination of the register packing and drowsy techniques achieves the most total GPU energy consumption reduction, up to 11% and 10.3% on average.
Thursday, September 27,  2018
Machine Learning 2 1:00-2:40 in Eden Vale A1/2 Chair: Sadasivan Shankar / Harvard
HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA