2019 IEEE High Performance Extreme Computing Conference (HPEC ‘19) Twenty-third Annual HPEC Conference 24 - 26 September 2019 Westin Hotel, Waltham, MA USA
Wednesday September 25, 2019
HPC 10:20-12:00 in Eden Vale C3 Chair: Seung Woo Son / UMass Lowell Scalable Solvers for Cone Complementarity Problems in Frictional Multibody Dynamics Saibal De (Univ. Michigan), Eduardo Corona (NYIT), Paramsothy Jayakumar (US Army), Shravan Veerapaneni (Univ. Michigan) We present an efficient, hybrid MPI/OpenMP framework for the cone complementarity formulation of large-scale rigid body dynamics problems with frictional contact. Data is partitioned among MPI processes using a Morton encoding to promote data locality and minimize communication. We parallelize the state-of-the-art first and second-order solvers for the resulting cone complementarity optimization problems. Our approach is highly scalable, enabling the solution of dense, large-scale multibody problems; a sedimentation simulation involving 256 million particles (~324 million contacts on average) was resolved using 512 cores in less than half-hour per time-step. Large Scale Parallelization Using File-Based Communications Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Siddharth Samsi, Charles Yee, Albert Reuther (MIT-LL) In this paper, we present a novel and new file-based communication architecture using the local file system for large scale parallelization. This new approach eliminates the issues with file system overload and resource contention when using the central file system for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI_Bcast() when using the local file system.  Furthermore, since the security for transferring message files is handled entirely by using the secure copy protocol (scp) and the file system permissions, no additional security measures or ports are required other than those that are typically required on an HPC system. Fast Large-Scale Algorithm for Electromagnetic Wave Propagation in 3D Media Mitchell Harris, M. Harper Langston, Pierre-David Letourneau, James Ezick, Richard Lethin (Reservoir Labs) We present a fast, large-scale algorithm for the simulations of electromagnetic waves (Maxwell's equations) in three-dimensional inhomogeneous media. The algorithm has a complexity of $O(N \log(N))$ and runs in parallel. Numerical simulations show the rapid treatment of problems with tens of millions of unknowns on a small shared-memory cluster ($\leq 16$ cores). Towards Improving Rate-Distortion Performance of Transform-Based Lossy Compression for HPC Datasets Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, Seung Woo Son (UMass Lowell) As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non-dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average. A Parallel Simulation Approach to ACAS X Development Adam Gjersvik and Robert J. Moss (MIT-LL) With a rapidly growing and evolving National Airspace System (NAS), ACAS X is intended to be the next-generation airborne collision avoidance system that can meet the demands its predecessor could not. The ACAS X algorithms are developed in the Julia programming language and are exercised in simulation environments tailored to test different characteristics of the system. Massive parallelization of these simulation environments has been implemented on the Lincoln Laboratory Supercomputing Center cluster in order to expedite the design and performance optimization of the system. This work outlines the approach to parallelization of one of our simulation tools and presents the resulting simulation speedups as well as a discussion on how it will enhance system characterization and design. Parallelization has made our simulation environment 33 times faster, which has greatly sped up the development process of ACAS X.
Wednesday September 25, 2019
HPC 10:20-12:00 in Eden Vale C3 Chair: Seung Woo Son / UMass Lowell Scalable Solvers for Cone Complementarity Problems in Frictional Multibody Dynamics Saibal De (Univ. Michigan), Eduardo Corona (NYIT), Paramsothy Jayakumar (US Army), Shravan Veerapaneni (Univ. Michigan) We present an efficient, hybrid MPI/OpenMP framework for the cone complementarity formulation of large-scale rigid body dynamics problems with frictional contact. Data is partitioned among MPI processes using a Morton encoding to promote data locality and minimize communication. We parallelize the state-of- the-art first and second-order solvers for the resulting cone complementarity optimization problems. Our approach is highly scalable, enabling the solution of dense, large-scale multibody problems; a sedimentation simulation involving 256 million particles (~324 million contacts on average) was resolved using 512 cores in less than half-hour per time-step. Large Scale Parallelization Using File-Based Communications Chansup Byun, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Vijay Gadepally, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Siddharth Samsi, Charles Yee, Albert Reuther (MIT-LL) In this paper, we present a novel and new file-based communication architecture using the local file system for large scale parallelization. This new approach eliminates the issues with file system overload and resource contention when using the central file system for large parallel jobs. The new approach incurs additional overhead due to inter-node message file transfers when both the sending and receiving processes are not on the same node. However, even with this additional overhead cost, its benefits are far greater for the overall cluster operation in addition to the performance enhancement in message communications for large scale parallel jobs. For example, when running a 2048-process parallel job, it achieved about 34 times better performance with MPI_Bcast() when using the local file system.  Furthermore, since the security for transferring message files is handled entirely by using the secure copy protocol (scp) and the file system permissions, no additional security measures or ports are required other than those that are typically required on an HPC system. Fast Large-Scale Algorithm for Electromagnetic Wave Propagation in 3D Media Mitchell Harris, M. Harper Langston, Pierre-David Letourneau, James Ezick, Richard Lethin (Reservoir Labs) We present a fast, large-scale algorithm for the simulations of electromagnetic waves (Maxwell's equations) in three-dimensional inhomogeneous media. The algorithm has a complexity of $O(N \log(N))$ and runs in parallel. Numerical simulations show the rapid treatment of problems with tens of millions of unknowns on a small shared-memory cluster ($\leq 16$ cores). Towards Improving Rate-Distortion Performance of Transform- Based Lossy Compression for HPC Datasets Jialing Zhang, Aekyeung Moon, Xiaoyan Zhuo, Seung Woo Son (UMass Lowell) As the size and amount of data produced by high-performance computing (HPC) applications grow exponentially, an effective data reduction technique is becoming critical to mitigating time and space burden. Lossy compression techniques, which have been widely used in image and video compression, hold promise to fulfill such data reduction need. However, they are seldom adopted in HPC datasets because of their difficulty in quantifying the amount of information loss and data reduction. In this paper, we explore a lossy compression strategy by revisiting the energy compaction properties of discrete transforms on HPC datasets. Specifically, we apply block-based transforms to HPC datasets, obtain the minimum number of coefficients containing the maximum energy (or information) compaction rate, and quantize remaining non- dominant coefficients using a binning mechanism to minimize information loss expressed in a distortion measure. We implement the proposed approach and evaluate it using six real-world HPC datasets. Our experimental results show that, on average, only 6.67 bits are required to preserve an optimal energy compaction rate on our evaluated datasets. Moreover, our knee detection algorithm improves the distortion in terms of peak signal-to-noise ratio by 2.46 dB on average. A Parallel Simulation Approach to ACAS X Development Adam Gjersvik and Robert J. Moss (MIT-LL) With a rapidly growing and evolving National Airspace System (NAS), ACAS X is intended to be the next-generation airborne collision avoidance system that can meet the demands its predecessor could not. The ACAS X algorithms are developed in the Julia programming language and are exercised in simulation environments tailored to test different characteristics of the system. Massive parallelization of these simulation environments has been implemented on the Lincoln Laboratory Supercomputing Center cluster in order to expedite the design and performance optimization of the system. This work outlines the approach to parallelization of one of our simulation tools and presents the resulting simulation speedups as well as a discussion on how it will enhance system characterization and design. Parallelization has made our simulation environment 33 times faster, which has greatly sped up the development process of ACAS X.