2018 IEEE High Performance Extreme Computing Conference (HPEC ‘18) Twenty-second Annual HPEC Conference 25 - 27 September 2018 Westin Hotel, Waltham, MA USA
PageRank Acceleration for Large Graphs with Scalable Hardware and Two-Step SpMV Fazle Sadi (CMU)* PageRank is an important vertex ranking algorithm that suffers from poor performance and efficiency due to notorious memory access behavior. Furthermore, when graphs become bigger and sparser, PageRank applications are inhibited as most current solutions profoundly rely on large random access fast memory, which is not easily scalable. In this paper we present a 16nm ASIC based shared memory platform for PageRank implementation that fundamentally accelerates Sparse Matrix dense Vector multiplication (SpMV), the core kernel of PageRank. This accelerator is scalable, guarantees full DRAM streaming and reduces off- chip communication. More importantly, it is capable of handling very large graphs (~2 billion vertices) despite using significantly less fast random access memory than current solutions. Experimental results show that our proposed accelerator is able to yield order of magnitude improvement in both energy efficiency and performance over state of the art shared memory commercial off-the-shelf (COTS) solutions. Scalable Community Detection Using Vite SAYAN GHOSH (WASHINGTON STATE UNIVERSITY)*; Mahantesh Halappanavar (Pacific Northwest National Laboratory); Antonino Tumeo (Pacific Northwest National Laboratory); Ananth Kalyanaraman (Washington State University); Assefaw H Gebremedhin (Washington State University) Graph clustering, popularly known as community detection, is a fundamental graph operation used in several applications of relevance for network analysis and cybersecurity. Community detection is an operation of partitioning a network into an arbitrary number of “communities” such that each com- munity represents a tightly-knit group of nodes with relatively sparser connections to the rest of the nodes in the network. The need to compute clustering on large-scale networks necessitates the development of efficient parallel algorithms capable of ex- ploiting modern parallel architectures. However, due to their irregular and inherently sequential nature, many of the current algorithms for community detection are challenging to parallelize. In response to the 2018 Graph Challenge, we present Vite which is a distributed-memory parallel community detection method. Vite is a parallelization of the algorithmic template provided by a widely-used serial method (Louvain). Alongside, it uses several heuristics aimed at significantly improving parallelism and performance, while preserving output quality. Using the inputs from the 2018 Graph Challenge (static and streaming), we demonstrate superior performance and high quality solutions based on four parallelization heuristics. Triangle Counting and Truss Decomposition using FPGA Sitao Huang (University of Illinois at Urbana-Champaign)*; Mohamed El-Hadedy (University of Illinois at Urbana-Champaign); Cong Hao (University of Illinois at Urbana-Champaign); Qin Li (University of Illinois at Urbana-Champaign); Vikram Sharma Mailthody (University of Illinois at Urbana-Champaign); Ketan Date (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J. Watson Research Center); Deming Chen (University of Illinois at Urbana-Champaign); Rakesh Nagi (University of Illinois at Urbana- Champaign); Wen-Mei Hwu (University of Illinois at Urbana-Champaign) Triangle counting and truss decomposition are two essential procedures in graph analysis. As the scale of graphs grows larger, designing highly efficient graph analysis systems with less power demand becomes more and more urgent. In this paper, we present triangle counting and truss decomposition using a Field-Programmable Gate Array (FPGA). We leverage the flexibility of FPGAs and achieve low-latency high-efficiency implementations. Evaluation on SNAP dataset shows that our triangle counting and truss decomposition implementations achieve 43.5x on average (up to 757.7x) and 6.4x on average (up to 64.9x) higher performance per Watt respectively over GPU solutions. Fast Stochastic Block Partition for Streaming Graphs Ahsen J Uppal (The George Washington University)*; H. Howie Huang (The George Washington University) The graph partition problem continues to be challenging, particularly for streaming graph data. Although optimal graph partitioning is NP-hard, stochastic methods can provide approximate solutions in reasonable time. However, such methods are optimized for static, not dynamic graph data. In this paper, we describe a new efficient algorithm we have developed for stochastic block partitioning on time-varying, streaming graph data. Our algorithm is a refinement of the baseline algorithm of the IEEE HPEC Graph Challenge. Our incremental algorithm efficiently updates its previous internal state as new pieces are streamed in, and generates a complete partition at every time step.  Compared to the naive baseline which performs a complete partitioning from scratch at every time step, our algorithm offers speedups between 1.96x for N=500 and 3.56x for N=20k overall, for a graph streamed over 10 parts, with similar accuracy. At the margin, the speedup in processing time for additional streaming pieces over the baseline is between 7.1x for N=500 to 25.1x for N=20k. Estimating Edge-Local Triangle Count Heavy Hitters in Edge-Linear Time and Almost-Vertex-Linear Space Benjamin W Priest (Dartmouth College)*; Roger Pearce (Lawrence Livermore National Laboroatry); Geoffrey Sanders (LLNL) We describe a methodology for estimating edge-local triangle counts using HyperLogLog cardinality approximation sketches. While the approach does not guarantee relative error bounds, we will show that it preserves triangle count heavy hitters - the edges incident upon the largest number of triangles - well in practice. Furthermore, we provide empirical evidence that the sum of edge- local estimations yield reasonable estimates of the global triangle count for free. In this paper we describe a two-pass algorithms for estimating edge-local triangle count heavy hitters. This algorithms require time linear in the number of edges, memory almost linear in the number of vertices, and is easy to parallelize. We provide results on dozens of real-world and synthetic graphs. Triangle Counting with A Multi-Core Computer Evan Donato (  University of Massachusetts Boston); Ming Ouyang (University of Massachusetts Boston)*; Cristian Peguero- Isalguez (  University of Massachusetts Boston) The forward algorithm for counting the number of triangles in a sparse graph is  fast in practice.  This article studies how to prepare data and data structures for implementing the algorithm on a two-socket computer.  Specifically, a data structure is designed to increase cache efficiency for the forward algorithm, and a scheme for tiebreaking during the sorting of vertices is designed to increase sequential access to the data structure.  The Graph Challenge provides a collection of data sets for researchers to compare the performance of their implementations.  The performance metrics include the edge rate (the number of edges in the graph divided by runtime) and the triangle rate (the number of triangles divided by runtime).  Herein the highest edge rate achieved is 1.406 billion edges per second, and the highest triangle rate is 3.495 billion triangles per second. Parallel Counting of Triangles in Large Graphs: Pruning and Hierarchical Clustering Algorithms Chun-Yen Kuo (City University of Hong Kong)*; Ching Nam Hang (City University of Hong Kong); Pei-Duo Yu (City University of Hong Kong); Chee Wei Tan (City University of Hong Kong, China) As part of the 2018 Graph Challenge on subgraph isomorphism, we propose a novel joint hierarchical clustering and parallel counting technique called the PHC algorithm that can compute the exact number of triangles in large graphs. The PHC algorithm consists of first pruning followed by hierarchical clustering based on geodesic distance and then subgraph counting in parallel. This allows scalable software framework such as MapReduce/Hadoop to count triangles inside each cluster as well as those straddling between clusters in parallel. We characterize the performance of the PHC algorithm mathematically, and its performance evaluation using representative graphs including random graphs demonstrate its computational efficiency over other existing techniques. Characterizing I/O optimization opportunities for array-centric applications on HDFS Donghe Kang (The Ohio State University)*; Vedang   Patel (The Ohio State University); Spyros Blanas (The Ohio State University); Yang Wang (The Ohio State University); Srinivasan Parthasarathy (The Ohio State University); Kalyan Khandrika (The Ohio State Univeristy) There is an impedance mismatch between the increasing sophistication of array-centric analytics and the bytestream-based POSIX interface of parallel file systems. This mismatch is particularly acute in data-intensive scientific applications.  This paper examines performance bottlenecks and describes optimizations to alleviate them in the context of computational astronomy pipelines and the Hadoop distributed file system (HDFS).  We find that fast data ingestion and intelligent object consolidation promise to accelerate I/O performance by two orders of magnitude. A SoPC Based Fixed Point System for Spaceborne SAR Real-Time Imaging Processing Bingyi Li (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Changjin Li ( Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Yizhuang Xie (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology)*; Liang Chen (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Hao Shi (Tsinghua University); Yi Deng (Virginia Polytechnic Institute and State University) With the development of satellite load technology and very large-scale integrated (VLSI) circuit technology, on-board real-time synthetic aperture radar (SAR) imaging systems have facilitated rapid response to disasters. The state-of-the-art System-on- Programmable-Chip (SoPC) technique, associated with embedded processor and other function modules, provides a potential solution to satisfy all constraints. However, with the improvement of processing efficiency and imagery granularity, to implement an entire SAR imaging processing using floating-point arithmetic is unaffordable. Data fixed-pointing is an effective solution, and the core issue is the finite word length optimization under the condition of trading-off hardware resource and processing precision. In this paper, we analyze the finite word length computing error for SAR imaging system using Chirp Scaling (CS) algorithm, and propose a mathematical computing error model. Then, the empirical formula of the system’s output noise-to-signal ratio is derived. Guiding by the software simulation result, we implement and verify the proposed method into a Zynq+NetFPGA platform. The run-time results show that the proposed method can achieve a decent image quality assessed by Integrated Side Lobe Ratio (ISLR), Peak Side Lobe Ratio (PSLR) and Relative Mean Square Deviation (RMSD). An Ensemble Classifier Based on Feature Selection Using Ant Colony Optimization Jianjun Cao (NUDT - China), Guojun Lv, (Army Engineering University - China) Yuling Shang, (Army Engineering University - China) Nianfeng Weng, (Army Engineering University - China) Chen Chang, (Army Engineering University - China) Yi Liu (Army Engineering University - China) An ensemble classifier based on feature selection is proposed to provide an effective methodology for classifying device states. Support Vector Machine (SVM) was used as base classifiers, the classification performance and similarity measure were defined to construct models using serial and parallel classifiers associated with the feature selection. We put forward Ant Colony Optimization (ACO) algorithms with three models, and verified the performance of the proposed ensemble classifier on vibration signals with five working states from a certain engine’s cylinder head Efficient and Flexible 2-D Data Controller for SAR Imaging System Tianyuan Sun (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Bingyi Li (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Xiaoning Liu (DFH Satellite Co,.Ltd.); Yizhuang Xie (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology)*; He Chen ( Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Liang Chen (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology) Controlling of two-dimensional data is a very critical part in synthetic aperture radar (SAR) imaging systems. This paper presents a field-programmable gate array (FGPA)-based Double-Data-Rate three Synchronous Dynamic Random Access Memory (DDR3 SDRAM) data controller which can efficiently access arbitrary length data points along any one-dimensional direction from any position in two-dimensional data. To improve the efficiency of transposition, the address of data in DDR3 is obtained by sub-matrix cross-mapping method. A complete SAR imaging system with the proposed controller is implemented and validated in a XC7VX690T FPGA platform as the experiment. Our experimental results demonstrate that this design can improve system parallelism by reducing the use of RAM resources. And 5.83s is required to get the complete imaging result with the SAR raw data of 16384×16384. Multi-Player Generative Adversarial Networks Samarth Gupta (MIT)* In unsupervised learning, Generative Adversarial Networks (GANs) have become one of the most popular techniques to implicitly learn the underlying distribution of a particular dataset. The problem is formulated as a two-player zero-sum game between a Generator network G and a Discriminator network D. The Generator network aims to generate synthetic(fake) images from random noise and the Discriminator aims to correctly classify between synthetic and natural images. In this study we aim to explore the possibility of using generalized zero-sum games or multi-player zero-sum games for formulating and training GANs. Our results indicate that using a multi-player setup, the severity of mode collapse problem in GANs can be significantly reduced to generate a diverse set of synthetic images. We demonstrate this on different synthetic toy distributions and real-world datasets like MNIST & CIFAR10. Evaluating Mode Collapse in Generative Adversarial Networks Sayeri Lala (MIT)*; Maha Shady (MIT); Anastasiya Belyaeva (Massachusetts Institute of Technology); Molei Liu (Harvard) Generative Adversarial Networks (GANs) suffer from mode collapse, in which the generator produces samples from only a few of the modes present in the true distribution. Several algorithms modifying the original GAN have recently been proposed to address this. However, there is little work comparing their performance systematically. This study compares the performance of several novel GAN methods, in particular AdaGAN [1], Unrolled GAN [2], VEEGAN [3], and Wasserstein GAN [4], along various metrics and on datasets of different distributions. Our results indicate that AdaGAN performs consistently well on nearly all datasets while the other GAN methods exhibit inconsistent performance.  Metrics: Several metrics have been proposed to evaluate mode collapse in GANs. Based on insights on strengths and weaknesses of various metrics [5], we compute the Wasserstein distance, Coverage, and Birthday Paradox based estimation of a distribution’s support size. We also measure the number of captured modes and sample quality based on the sample’s distance from the mode, as in [3]. For synthetic datasets, we directly compute these quantities since the modes are known a priori. For MNIST, these quantities are computed using a pretrained MNIST classifier, where the modes are assumed to be the digits 0-9.  Datasets, Architectures, and Training Parameters: Mode collapse was evaluated on synthetic and real datasets. We implemented different GAN architectures and training parameters for each dataset type to account for the dataset complexity. To ensure a fair comparison, the study generally implemented the same architectures and training parameters across the different GAN methods.  Synthetic datasets: The synthetic datasets are a two-dimensional mixture of 8 Gaussians arranged in a ring, a two-dimensional mixture of 25 Gaussians arranged in a grid, and an 800-dimensional dataset embedding a mixture of 9 500- dimensional Gaussians [3]. Each dataset comprised of 500,000 training and 100,000 test examples. On synthetic datasets, the architecture and training parameters in [3] was used.   For 2D ring and 2D grid, AdaGAN obtained high scores across the metrics. On 2D ring, it had Coverage 0.88, 97% samples were high quality, all modes were captured, and support size was equal to the estimated support size of the training set. On 2D grid, it had coverage 1.0, 89% samples were high quality, all modes were captured, and support size was equal to the estimated support size of the training set.   Unrolled GAN, VEEGAN, and WGAN generally performed worse. For 2D ring, Unrolled GAN had Coverage 0.91, 95% high quality samples, all modes captured, and smaller support size. VEEGAN had Coverage 1.0, 88% high quality samples, all modes captured, and larger support size. For 2D grid, Unrolled GAN had Coverage 0.55, 88% high quality samples, 14/25 modes captured, and smaller support size. VEEGAN had Coverage 0.96, 66% high quality samples, all modes captured, and larger support size. WGAN consistently produced low quality samples i.e., % high quality samples < 20% and distributions with larger support size. All methods achieved Wasserstein distance of 0.0.   For the high dimensional dataset, there was no clear winner. AdaGAN had Wasserstein distance of 1.97, 39% high quality samples, all modes captured, and support larger than the true support size. VEEGAN had Wasserstein distance of 1.15, 66% high quality samples, 6/9 modes captured, with smaller support size. Unrolled GAN and WGAN had higher Wasserstein distances, significantly low % high quality samples i.e., <15%, 0-1 modes captured, with support size significantly larger.   Real dataset: The real dataset used is MNIST. The GAN architecture and training parameters in [1] was used. AdaGAN had Wasserstein distance of 7.48 * 10^-6, Coverage 0.4, 45% high quality samples, and all modes captured. Unrolled GAN had Wasserstein distance of 0.0007, Coverage 0.47, 53% high quality samples, and all modes captured. VEEGAN and WGAN had low Coverage i.e., < 0.15 and low % high quality samples i.e., < 15%.       Conclusion: Our results show that AdaGAN performs consistently well on nearly all datasets and under several different metrics. This could be because AdaGAN learns a mixture of GANs, where each GAN is trained to improve the accuracy of the mixture model.  References: 1. Tolstikhin, Ilya O., et al. "Adagan: Boosting generative models." Advances in Neural Information Processing Systems. 2017. 2. Metz, Luke, et al. "Unrolled generative adversarial networks." arXiv preprint arXiv:1611.02163 (2016). 3. Srivastava, Akash, et al. "Veegan: Reducing mode collapse in gans using implicit variational learning." Advances in Neural Information Processing Systems. 2017. 4. Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017). 5. Borji, Ali. "Pros and Cons of GAN Evaluation Measures." arXiv preprint arXiv:1802.03446 (2018). Alternative Algorithms to Remove Ringing Artifacts in Bio-image with Big Data Using Numerical and Computational Analysis Andrew Kyung (Northern Valley Regional HS at Demarest); Rachel Chung (Harvard University – Extension School ); Elise Kang (New York University); Richard Kyung (CRG(Choice Research Group)-NJ)* In this research, numerical and computational simulations were carried out with several modified low pass filters to reduce the ringing effect and improve the resolution of digital images to a degree, and finally proposed an efficient function as a new low pass filter. In the digital imaging process, when the domain of the low pass filter over the frequency domain is narrow, it produces blurry images due to the insufficient amount of frequency data obtained from K-space. The main purpose of this research was to develop better algorithms that would both enhance the quality of the final image and decrease the amount of processing time taken to produce images. Also non-Gaussian behavior was studied in this analysis. Enhanced Iterative Method Solving Computational Mechanics with a Large Number of Equations Richard Kyung (CRG(Choice Research Group)-NJ)*; Daniel Lee (CRG(Choice Research Group)-NJ) This paper examines the calculation of Eigenvalues in a large number of equations in computational mechanics such as physical and mechanical engineering problems. Eigenvalue problems are of the greatest importance in interactive dynamic problems in computational sciences and mathematical fields. To assess the validity of the optimized process, computed results such as accuracy and operational counts of the improved algorithm and the Jacobi method are compared. The analysis of the comparison shows that the values of off-diagonal elements are not pertinent in obtaining rapid convergence, while the Jacobi method is usually useful only for the diagonally dominant matrix. Therefore, the improved algorithm consists of fewer computations than the original model in solving eigenvalue problems, and it yields reduced computational cost and increased efficiency.
Posters
Thursday, September 27, 2018
PageRank Acceleration for Large Graphs with Scalable Hardware and Two-Step SpMV Fazle Sadi (CMU)* PageRank is an important vertex ranking algorithm that suffers from poor performance and efficiency due to notorious memory access behavior. Furthermore, when graphs become bigger and sparser, PageRank applications are inhibited as most current solutions profoundly rely on large random access fast memory, which is not easily scalable. In this paper we present a 16nm ASIC based shared memory platform for PageRank implementation that fundamentally accelerates Sparse Matrix dense Vector multiplication (SpMV), the core kernel of PageRank. This accelerator is scalable, guarantees full DRAM streaming and reduces off-chip communication. More importantly, it is capable of handling very large graphs (~2 billion vertices) despite using significantly less fast random access memory than current solutions. Experimental results show that our proposed accelerator is able to yield order of magnitude improvement in both energy efficiency and performance over state of the art shared memory commercial off-the-shelf (COTS) solutions. Scalable Community Detection Using Vite SAYAN GHOSH (WASHINGTON STATE UNIVERSITY)*; Mahantesh Halappanavar (Pacific Northwest National Laboratory); Antonino Tumeo (Pacific Northwest National Laboratory); Ananth Kalyanaraman (Washington State University); Assefaw H Gebremedhin (Washington State University) Graph clustering, popularly known as community detection, is a fundamental graph operation used in several applications of relevance for network analysis and cybersecurity. Community detection is an operation of partitioning a network into an arbitrary number of “communities” such that each com- munity represents a tightly-knit group of nodes with relatively sparser connections to the rest of the nodes in the network. The need to compute clustering on large-scale networks necessitates the development of efficient parallel algorithms capable of ex- ploiting modern parallel architectures. However, due to their irregular and inherently sequential nature, many of the current algorithms for community detection are challenging to parallelize. In response to the 2018 Graph Challenge, we present Vite which is a distributed-memory parallel community detection method. Vite is a parallelization of the algorithmic template provided by a widely-used serial method (Louvain). Alongside, it uses several heuristics aimed at significantly improving parallelism and performance, while preserving output quality. Using the inputs from the 2018 Graph Challenge (static and streaming), we demonstrate superior performance and high quality solutions based on four parallelization heuristics. Triangle Counting and Truss Decomposition using FPGA Sitao Huang (University of Illinois at Urbana-Champaign)*; Mohamed El-Hadedy (University of Illinois at Urbana-Champaign); Cong Hao (University of Illinois at Urbana-Champaign); Qin Li (University of Illinois at Urbana-Champaign); Vikram Sharma Mailthody (University of Illinois at Urbana-Champaign); Ketan Date (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J. Watson Research Center); Deming Chen (University of Illinois at Urbana-Champaign); Rakesh Nagi (University of Illinois at Urbana-Champaign); Wen-Mei Hwu (University of Illinois at Urbana-Champaign) Triangle counting and truss decomposition are two essential procedures in graph analysis. As the scale of graphs grows larger, designing highly efficient graph analysis systems with less power demand becomes more and more urgent. In this paper, we present triangle counting and truss decomposition using a Field- Programmable Gate Array (FPGA). We leverage the flexibility of FPGAs and achieve low-latency high-efficiency implementations. Evaluation on SNAP dataset shows that our triangle counting and truss decomposition implementations achieve 43.5x on average (up to 757.7x) and 6.4x on average (up to 64.9x) higher performance per Watt respectively over GPU solutions. Fast Stochastic Block Partition for Streaming Graphs Ahsen J Uppal (The George Washington University)*; H. Howie Huang (The George Washington University) The graph partition problem continues to be challenging, particularly for streaming graph data. Although optimal graph partitioning is NP- hard, stochastic methods can provide approximate solutions in reasonable time. However, such methods are optimized for static, not dynamic graph data. In this paper, we describe a new efficient algorithm we have developed for stochastic block partitioning on time- varying, streaming graph data. Our algorithm is a refinement of the baseline algorithm of the IEEE HPEC Graph Challenge. Our incremental algorithm efficiently updates its previous internal state as new pieces are streamed in, and generates a complete partition at every time step.  Compared to the naive baseline which performs a complete partitioning from scratch at every time step, our algorithm offers speedups between 1.96x for N=500 and 3.56x for N=20k overall, for a graph streamed over 10 parts, with similar accuracy. At the margin, the speedup in processing time for additional streaming pieces over the baseline is between 7.1x for N=500 to 25.1x for N=20k. Estimating Edge-Local Triangle Count Heavy Hitters in Edge- Linear Time and Almost-Vertex-Linear Space Benjamin W Priest (Dartmouth College)*; Roger Pearce (Lawrence Livermore National Laboroatry); Geoffrey Sanders (LLNL) We describe a methodology for estimating edge-local triangle counts using HyperLogLog cardinality approximation sketches. While the approach does not guarantee relative error bounds, we will show that it preserves triangle count heavy hitters - the edges incident upon the largest number of triangles - well in practice. Furthermore, we provide empirical evidence that the sum of edge-local estimations yield reasonable estimates of the global triangle count for free. In this paper we describe a two-pass algorithms for estimating edge-local triangle count heavy hitters. This algorithms require time linear in the number of edges, memory almost linear in the number of vertices, and is easy to parallelize. We provide results on dozens of real-world and synthetic graphs. Triangle Counting with A Multi-Core Computer Evan Donato (  University of Massachusetts Boston); Ming Ouyang (University of Massachusetts Boston)*; Cristian Peguero-Isalguez (  University of Massachusetts Boston) The forward algorithm for counting the number of triangles in a sparse graph is  fast in practice.  This article studies how to prepare data and data structures for implementing the algorithm on a two-socket computer.  Specifically, a data structure is designed to increase cache efficiency for the forward algorithm, and a scheme for tiebreaking during the sorting of vertices is designed to increase sequential access to the data structure.  The Graph Challenge provides a collection of data sets for researchers to compare the performance of their implementations.  The performance metrics include the edge rate (the number of edges in the graph divided by runtime) and the triangle rate (the number of triangles divided by runtime).  Herein the highest edge rate achieved is 1.406 billion edges per second, and the highest triangle rate is 3.495 billion triangles per second. Parallel Counting of Triangles in Large Graphs: Pruning and Hierarchical Clustering Algorithms Chun-Yen Kuo (City University of Hong Kong)*; Ching Nam Hang (City University of Hong Kong); Pei-Duo Yu (City University of Hong Kong); Chee Wei Tan (City University of Hong Kong, China) As part of the 2018 Graph Challenge on subgraph isomorphism, we propose a novel joint hierarchical clustering and parallel counting technique called the PHC algorithm that can compute the exact number of triangles in large graphs. The PHC algorithm consists of first pruning followed by hierarchical clustering based on geodesic distance and then subgraph counting in parallel. This allows scalable software framework such as MapReduce/Hadoop to count triangles inside each cluster as well as those straddling between clusters in parallel. We characterize the performance of the PHC algorithm mathematically, and its performance evaluation using representative graphs including random graphs demonstrate its computational efficiency over other existing techniques. Characterizing I/O optimization opportunities for array-centric applications on HDFS Donghe Kang (The Ohio State University)*; Vedang   Patel (The Ohio State University); Spyros Blanas (The Ohio State University); Yang Wang (The Ohio State University); Srinivasan Parthasarathy (The Ohio State University); Kalyan Khandrika (The Ohio State Univeristy) There is an impedance mismatch between the increasing sophistication of array-centric analytics and the bytestream-based POSIX interface of parallel file systems. This mismatch is particularly acute in data-intensive scientific applications.  This paper examines performance bottlenecks and describes optimizations to alleviate them in the context of computational astronomy pipelines and the Hadoop distributed file system (HDFS).  We find that fast data ingestion and intelligent object consolidation promise to accelerate I/O performance by two orders of magnitude. A SoPC Based Fixed Point System for Spaceborne SAR Real- Time Imaging Processing Bingyi Li (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Changjin Li ( Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Yizhuang Xie (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology)*; Liang Chen (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Hao Shi (Tsinghua University); Yi Deng (Virginia Polytechnic Institute and State University) With the development of satellite load technology and very large- scale integrated (VLSI) circuit technology, on-board real-time synthetic aperture radar (SAR) imaging systems have facilitated rapid response to disasters. The state-of-the-art System-on-Programmable- Chip (SoPC) technique, associated with embedded processor and other function modules, provides a potential solution to satisfy all constraints. However, with the improvement of processing efficiency and imagery granularity, to implement an entire SAR imaging processing using floating-point arithmetic is unaffordable. Data fixed- pointing is an effective solution, and the core issue is the finite word length optimization under the condition of trading-off hardware resource and processing precision. In this paper, we analyze the finite word length computing error for SAR imaging system using Chirp Scaling (CS) algorithm, and propose a mathematical computing error model. Then, the empirical formula of the system’s output noise-to- signal ratio is derived. Guiding by the software simulation result, we implement and verify the proposed method into a Zynq+NetFPGA platform. The run-time results show that the proposed method can achieve a decent image quality assessed by Integrated Side Lobe Ratio (ISLR), Peak Side Lobe Ratio (PSLR) and Relative Mean Square Deviation (RMSD). An Ensemble Classifier Based on Feature Selection Using Ant Colony Optimization Jianjun Cao (NUDT - China), Guojun Lv, (Army Engineering University - China) Yuling Shang, (Army Engineering University - China) Nianfeng Weng, (Army Engineering University - China) Chen Chang, (Army Engineering University - China) Yi Liu (Army Engineering University - China) An ensemble classifier based on feature selection is proposed to provide an effective methodology for classifying device states. Support Vector Machine (SVM) was used as base classifiers, the classification performance and similarity measure were defined to construct models using serial and parallel classifiers associated with the feature selection. We put forward Ant Colony Optimization (ACO) algorithms with three models, and verified the performance of the proposed ensemble classifier on vibration signals with five working states from a certain engine’s cylinder head Efficient and Flexible 2-D Data Controller for SAR Imaging System Tianyuan Sun (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Bingyi Li (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Xiaoning Liu (DFH Satellite Co,.Ltd.); Yizhuang Xie (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology)*; He Chen ( Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology); Liang Chen (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology) Controlling of two-dimensional data is a very critical part in synthetic aperture radar (SAR) imaging systems. This paper presents a field- programmable gate array (FGPA)-based Double-Data-Rate three Synchronous Dynamic Random Access Memory (DDR3 SDRAM) data controller which can efficiently access arbitrary length data points along any one-dimensional direction from any position in two- dimensional data. To improve the efficiency of transposition, the address of data in DDR3 is obtained by sub-matrix cross-mapping method. A complete SAR imaging system with the proposed controller is implemented and validated in a XC7VX690T FPGA platform as the experiment. Our experimental results demonstrate that this design can improve system parallelism by reducing the use of RAM resources. And 5.83s is required to get the complete imaging result with the SAR raw data of 16384×16384. Multi-Player Generative Adversarial Networks Samarth Gupta (MIT)* In unsupervised learning, Generative Adversarial Networks (GANs) have become one of the most popular techniques to implicitly learn the underlying distribution of a particular dataset. The problem is formulated as a two-player zero-sum game between a Generator network G and a Discriminator network D. The Generator network aims to generate synthetic(fake) images from random noise and the Discriminator aims to correctly classify between synthetic and natural images. In this study we aim to explore the possibility of using generalized zero-sum games or multi-player zero-sum games for formulating and training GANs. Our results indicate that using a multi- player setup, the severity of mode collapse problem in GANs can be significantly reduced to generate a diverse set of synthetic images. We demonstrate this on different synthetic toy distributions and real- world datasets like MNIST & CIFAR10. Evaluating Mode Collapse in Generative Adversarial Networks Sayeri Lala (MIT)*; Maha Shady (MIT); Anastasiya Belyaeva (Massachusetts Institute of Technology); Molei Liu (Harvard) Generative Adversarial Networks (GANs) suffer from mode collapse, in which the generator produces samples from only a few of the modes present in the true distribution. Several algorithms modifying the original GAN have recently been proposed to address this. However, there is little work comparing their performance systematically. This study compares the performance of several novel GAN methods, in particular AdaGAN [1], Unrolled GAN [2], VEEGAN [3], and Wasserstein GAN [4], along various metrics and on datasets of different distributions. Our results indicate that AdaGAN performs consistently well on nearly all datasets while the other GAN methods exhibit inconsistent performance.  Metrics: Several metrics have been proposed to evaluate mode collapse in GANs. Based on insights on strengths and weaknesses of various metrics [5], we compute the Wasserstein distance, Coverage, and Birthday Paradox based estimation of a distribution’s support size. We also measure the number of captured modes and sample quality based on the sample’s distance from the mode, as in [3]. For synthetic datasets, we directly compute these quantities since the modes are known a priori. For MNIST, these quantities are computed using a pretrained MNIST classifier, where the modes are assumed to be the digits 0-9.  Datasets, Architectures, and Training Parameters: Mode collapse was evaluated on synthetic and real datasets. We implemented different GAN architectures and training parameters for each dataset type to account for the dataset complexity. To ensure a fair comparison, the study generally implemented the same architectures and training parameters across the different GAN methods.  Synthetic datasets: The synthetic datasets are a two-dimensional mixture of 8 Gaussians arranged in a ring, a two-dimensional mixture of 25 Gaussians arranged in a grid, and an 800-dimensional dataset embedding a mixture of 9 500-dimensional Gaussians [3]. Each dataset comprised of 500,000 training and 100,000 test examples. On synthetic datasets, the architecture and training parameters in [3] was used.   For 2D ring and 2D grid, AdaGAN obtained high scores across the metrics. On 2D ring, it had Coverage 0.88, 97% samples were high quality, all modes were captured, and support size was equal to the estimated support size of the training set. On 2D grid, it had coverage 1.0, 89% samples were high quality, all modes were captured, and support size was equal to the estimated support size of the training set.   Unrolled GAN, VEEGAN, and WGAN generally performed worse. For 2D ring, Unrolled GAN had Coverage 0.91, 95% high quality samples, all modes captured, and smaller support size. VEEGAN had Coverage 1.0, 88% high quality samples, all modes captured, and larger support size. For 2D grid, Unrolled GAN had Coverage 0.55, 88% high quality samples, 14/25 modes captured, and smaller support size. VEEGAN had Coverage 0.96, 66% high quality samples, all modes captured, and larger support size. WGAN consistently produced low quality samples i.e., % high quality samples < 20% and distributions with larger support size. All methods achieved Wasserstein distance of 0.0.   For the high dimensional dataset, there was no clear winner. AdaGAN had Wasserstein distance of 1.97, 39% high quality samples, all modes captured, and support larger than the true support size. VEEGAN had Wasserstein distance of 1.15, 66% high quality samples, 6/9 modes captured, with smaller support size. Unrolled GAN and WGAN had higher Wasserstein distances, significantly low % high quality samples i.e., <15%, 0-1 modes captured, with support size significantly larger.   Real dataset: The real dataset used is MNIST. The GAN architecture and training parameters in [1] was used. AdaGAN had Wasserstein distance of 7.48 * 10^-6, Coverage 0.4, 45% high quality samples, and all modes captured. Unrolled GAN had Wasserstein distance of 0.0007, Coverage 0.47, 53% high quality samples, and all modes captured. VEEGAN and WGAN had low Coverage i.e., < 0.15 and low % high quality samples i.e., < 15%.       Conclusion: Our results show that AdaGAN performs consistently well on nearly all datasets and under several different metrics. This could be because AdaGAN learns a mixture of GANs, where each GAN is trained to improve the accuracy of the mixture model.  References: 1. Tolstikhin, Ilya O., et al. "Adagan: Boosting generative models." Advances in Neural Information Processing Systems. 2017. 2. Metz, Luke, et al. "Unrolled generative adversarial networks." arXiv preprint arXiv:1611.02163 (2016). 3. Srivastava, Akash, et al. "Veegan: Reducing mode collapse in gans using implicit variational learning." Advances in Neural Information Processing Systems. 2017. 4. Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017). 5. Borji, Ali. "Pros and Cons of GAN Evaluation Measures." arXiv preprint arXiv:1802.03446 (2018). Alternative Algorithms to Remove Ringing Artifacts in Bio-image with Big Data Using Numerical and Computational Analysis Andrew Kyung (Northern Valley Regional HS at Demarest); Rachel Chung (Harvard University – Extension School ); Elise Kang (New York University); Richard Kyung (CRG(Choice Research Group)-NJ)* In this research, numerical and computational simulations were carried out with several modified low pass filters to reduce the ringing effect and improve the resolution of digital images to a degree, and finally proposed an efficient function as a new low pass filter. In the digital imaging process, when the domain of the low pass filter over the frequency domain is narrow, it produces blurry images due to the insufficient amount of frequency data obtained from K-space. The main purpose of this research was to develop better algorithms that would both enhance the quality of the final image and decrease the amount of processing time taken to produce images. Also non- Gaussian behavior was studied in this analysis. Enhanced Iterative Method Solving Computational Mechanics with a Large Number of Equations Richard Kyung (CRG(Choice Research Group)-NJ)*; Daniel Lee (CRG(Choice Research Group)-NJ) This paper examines the calculation of Eigenvalues in a large number of equations in computational mechanics such as physical and mechanical engineering problems. Eigenvalue problems are of the greatest importance in interactive dynamic problems in computational sciences and mathematical fields. To assess the validity of the optimized process, computed results such as accuracy and operational counts of the improved algorithm and the Jacobi method are compared. The analysis of the comparison shows that the values of off-diagonal elements are not pertinent in obtaining rapid convergence, while the Jacobi method is usually useful only for the diagonally dominant matrix. Therefore, the improved algorithm consists of fewer computations than the original model in solving eigenvalue problems, and it yields reduced computational cost and increased efficiency.
Posters
Thursday, September 27, 2018
HPEC 2018 25 - 27 September 2018 Westin Hotel, Waltham, MA USA