2018 IEEE High Performance
Extreme Computing Conference
(HPEC ‘18)
Twenty-second Annual HPEC Conference
25 - 27 September 2018
Westin Hotel, Waltham, MA USA
PageRank Acceleration for Large Graphs with Scalable Hardware and Two-Step SpMV
Fazle Sadi (CMU)*
PageRank is an important vertex ranking algorithm that suffers from poor performance and efficiency due to notorious memory
access behavior. Furthermore, when graphs become bigger and sparser, PageRank applications are inhibited as most current
solutions profoundly rely on large random access fast memory, which is not easily scalable. In this paper we present a 16nm ASIC
based shared memory platform for PageRank implementation that fundamentally accelerates Sparse Matrix dense Vector
multiplication (SpMV), the core kernel of PageRank. This accelerator is scalable, guarantees full DRAM streaming and reduces off-
chip communication. More importantly, it is capable of handling very large graphs (~2 billion vertices) despite using significantly less
fast random access memory than current solutions. Experimental results show that our proposed accelerator is able to yield order of
magnitude improvement in both energy efficiency and performance over state of the art shared memory commercial off-the-shelf
(COTS) solutions.
Scalable Community Detection Using Vite
SAYAN GHOSH (WASHINGTON STATE UNIVERSITY)*; Mahantesh Halappanavar (Pacific Northwest National Laboratory);
Antonino Tumeo (Pacific Northwest National Laboratory); Ananth Kalyanaraman (Washington State University); Assefaw H
Gebremedhin (Washington State University)
Graph clustering, popularly known as community detection, is a fundamental graph operation used in several applications of
relevance for network analysis and cybersecurity. Community detection is an operation of partitioning a network into an arbitrary
number of “communities” such that each com- munity represents a tightly-knit group of nodes with relatively sparser connections to
the rest of the nodes in the network. The need to compute clustering on large-scale networks necessitates the development of
efficient parallel algorithms capable of ex- ploiting modern parallel architectures. However, due to their irregular and inherently
sequential nature, many of the current algorithms for community detection are challenging to parallelize. In response to the 2018
Graph Challenge, we present Vite which is a distributed-memory parallel community detection method. Vite is a parallelization of the
algorithmic template provided by a widely-used serial method (Louvain). Alongside, it uses several heuristics aimed at significantly
improving parallelism and performance, while preserving output quality. Using the inputs from the 2018 Graph Challenge (static and
streaming), we demonstrate superior performance and high quality solutions based on four parallelization heuristics.
Triangle Counting and Truss Decomposition using FPGA
Sitao Huang (University of Illinois at Urbana-Champaign)*; Mohamed El-Hadedy (University of Illinois at Urbana-Champaign); Cong
Hao (University of Illinois at Urbana-Champaign); Qin Li (University of Illinois at Urbana-Champaign); Vikram Sharma Mailthody
(University of Illinois at Urbana-Champaign); Ketan Date (University of Illinois at Urbana-Champaign); Jinjun Xiong (IBM Thomas J.
Watson Research Center); Deming Chen (University of Illinois at Urbana-Champaign); Rakesh Nagi (University of Illinois at Urbana-
Champaign); Wen-Mei Hwu (University of Illinois at Urbana-Champaign)
Triangle counting and truss decomposition are two essential procedures in graph analysis. As the scale of graphs grows larger,
designing highly efficient graph analysis systems with less power demand becomes more and more urgent. In this paper, we present
triangle counting and truss decomposition using a Field-Programmable Gate Array (FPGA). We leverage the flexibility of FPGAs and
achieve low-latency high-efficiency implementations. Evaluation on SNAP dataset shows that our triangle counting and truss
decomposition implementations achieve 43.5x on average (up to 757.7x) and 6.4x on average (up to 64.9x) higher performance per
Watt respectively over GPU solutions.
Fast Stochastic Block Partition for Streaming Graphs
Ahsen J Uppal (The George Washington University)*; H. Howie Huang (The George Washington University)
The graph partition problem continues to be challenging, particularly for streaming graph data. Although optimal graph partitioning is
NP-hard, stochastic methods can provide approximate solutions in reasonable time. However, such methods are optimized for static,
not dynamic graph data. In this paper, we describe a new efficient algorithm we have developed for stochastic block partitioning on
time-varying, streaming graph data. Our algorithm is a refinement of the baseline algorithm of the IEEE HPEC Graph Challenge. Our
incremental algorithm efficiently updates its previous internal state as new pieces are streamed in, and generates a complete
partition at every time step. Compared to the naive baseline which performs a complete partitioning from scratch at every time step,
our algorithm offers speedups between 1.96x for N=500 and 3.56x for N=20k overall, for a graph streamed over 10 parts, with
similar accuracy. At the margin, the speedup in processing time for additional streaming pieces over the baseline is between 7.1x for
N=500 to 25.1x for N=20k.
Estimating Edge-Local Triangle Count Heavy Hitters in Edge-Linear Time and Almost-Vertex-Linear Space
Benjamin W Priest (Dartmouth College)*; Roger Pearce (Lawrence Livermore National Laboroatry); Geoffrey Sanders (LLNL)
We describe a methodology for estimating edge-local triangle counts using HyperLogLog cardinality approximation sketches. While
the approach does not guarantee relative error bounds, we will show that it preserves triangle count heavy hitters - the edges
incident upon the largest number of triangles - well in practice. Furthermore, we provide empirical evidence that the sum of edge-
local estimations yield reasonable estimates of the global triangle count for free. In this paper we describe a two-pass algorithms for
estimating edge-local triangle count heavy hitters. This algorithms require time linear in the number of edges, memory almost linear
in the number of vertices, and is easy to parallelize. We provide results on dozens of real-world and synthetic graphs.
Triangle Counting with A Multi-Core Computer
Evan Donato ( University of Massachusetts Boston); Ming Ouyang (University of Massachusetts Boston)*; Cristian Peguero-
Isalguez ( University of Massachusetts Boston)
The forward algorithm for counting the number of triangles in a sparse graph is fast in practice. This article studies how to prepare
data and data structures for implementing the algorithm on a two-socket computer. Specifically, a data structure is designed to
increase cache efficiency for the forward algorithm, and a scheme for tiebreaking during the sorting of vertices is designed to
increase sequential access to the data structure. The Graph Challenge provides a collection of data sets for researchers to compare
the performance of their implementations. The performance metrics include the edge rate (the number of edges in the graph divided
by runtime) and the triangle rate (the number of triangles divided by runtime). Herein the highest edge rate achieved is 1.406 billion
edges per second, and the highest triangle rate is 3.495 billion triangles per second.
Parallel Counting of Triangles in Large Graphs: Pruning and Hierarchical Clustering Algorithms
Chun-Yen Kuo (City University of Hong Kong)*; Ching Nam Hang (City University of Hong Kong); Pei-Duo Yu (City University of
Hong Kong); Chee Wei Tan (City University of Hong Kong, China)
As part of the 2018 Graph Challenge on subgraph isomorphism, we propose a novel joint hierarchical clustering and parallel
counting technique called the PHC algorithm that can compute the exact number of triangles in large graphs. The PHC algorithm
consists of first pruning followed by hierarchical clustering based on geodesic distance and then subgraph counting in parallel. This
allows scalable software framework such as MapReduce/Hadoop to count triangles inside each cluster as well as those straddling
between clusters in parallel. We characterize the performance of the PHC algorithm mathematically, and its performance evaluation
using representative graphs including random graphs demonstrate its computational efficiency over other existing techniques.
Characterizing I/O optimization opportunities for array-centric applications on HDFS
Donghe Kang (The Ohio State University)*; Vedang Patel (The Ohio State University); Spyros Blanas (The Ohio State University);
Yang Wang (The Ohio State University); Srinivasan Parthasarathy (The Ohio State University); Kalyan Khandrika (The Ohio State
Univeristy)
There is an impedance mismatch between the increasing sophistication of array-centric analytics and the bytestream-based POSIX
interface of parallel file systems. This mismatch is particularly acute in data-intensive scientific applications. This paper examines
performance bottlenecks and describes optimizations to alleviate them in the context of computational astronomy pipelines and the
Hadoop distributed file system (HDFS). We find that fast data ingestion and intelligent object consolidation promise to accelerate I/O
performance by two orders of magnitude.
A SoPC Based Fixed Point System for Spaceborne SAR Real-Time Imaging Processing
Bingyi Li (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology);
Changjin Li ( Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology);
Yizhuang Xie (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology)*;
Liang Chen (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology);
Hao Shi (Tsinghua University); Yi Deng (Virginia Polytechnic Institute and State University)
With the development of satellite load technology and very large-scale integrated (VLSI) circuit technology, on-board real-time
synthetic aperture radar (SAR) imaging systems have facilitated rapid response to disasters. The state-of-the-art System-on-
Programmable-Chip (SoPC) technique, associated with embedded processor and other function modules, provides a potential
solution to satisfy all constraints. However, with the improvement of processing efficiency and imagery granularity, to implement an
entire SAR imaging processing using floating-point arithmetic is unaffordable. Data fixed-pointing is an effective solution, and the
core issue is the finite word length optimization under the condition of trading-off hardware resource and processing precision. In this
paper, we analyze the finite word length computing error for SAR imaging system using Chirp Scaling (CS) algorithm, and propose a
mathematical computing error model. Then, the empirical formula of the system’s output noise-to-signal ratio is derived. Guiding by
the software simulation result, we implement and verify the proposed method into a Zynq+NetFPGA platform. The run-time results
show that the proposed method can achieve a decent image quality assessed by Integrated Side Lobe Ratio (ISLR), Peak Side
Lobe Ratio (PSLR) and Relative Mean Square Deviation (RMSD).
An Ensemble Classifier Based on Feature Selection Using Ant Colony Optimization
Jianjun Cao (NUDT - China), Guojun Lv, (Army Engineering University - China) Yuling Shang, (Army Engineering University - China)
Nianfeng Weng, (Army Engineering University - China) Chen Chang, (Army Engineering University - China) Yi Liu (Army
Engineering University - China)
An ensemble classifier based on feature selection is proposed to provide an effective methodology for classifying device states.
Support Vector Machine (SVM) was used as base classifiers, the classification performance and similarity measure were defined to
construct models using serial and parallel classifiers associated with the feature selection. We put forward Ant Colony Optimization
(ACO) algorithms with three models, and verified the performance of the proposed ensemble classifier on vibration signals with five
working states from a certain engine’s cylinder head
Efficient and Flexible 2-D Data Controller for SAR Imaging System
Tianyuan Sun (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology);
Bingyi Li (Beijing Key Laboratory of Embedded Real-Time Information Processing Technology, Beijing Institute of Technology);
Xiaoning Liu (DFH Satellite Co,.Ltd.); Yizhuang Xie (Beijing Key Laboratory of Embedded Real-Time Information Processing
Technology, Beijing Institute of Technology)*; He Chen ( Beijing Key Laboratory of Embedded Real-Time Information Processing
Technology, Beijing Institute of Technology); Liang Chen (Beijing Key Laboratory of Embedded Real-Time Information Processing
Technology, Beijing Institute of Technology)
Controlling of two-dimensional data is a very critical part in synthetic aperture radar (SAR) imaging systems. This paper presents a
field-programmable gate array (FGPA)-based Double-Data-Rate three Synchronous Dynamic Random Access Memory (DDR3
SDRAM) data controller which can efficiently access arbitrary length data points along any one-dimensional direction from any
position in two-dimensional data. To improve the efficiency of transposition, the address of data in DDR3 is obtained by sub-matrix
cross-mapping method. A complete SAR imaging system with the proposed controller is implemented and validated in a
XC7VX690T FPGA platform as the experiment. Our experimental results demonstrate that this design can improve system
parallelism by reducing the use of RAM resources. And 5.83s is required to get the complete imaging result with the SAR raw data of
16384×16384.
Multi-Player Generative Adversarial Networks
Samarth Gupta (MIT)*
In unsupervised learning, Generative Adversarial Networks (GANs) have become one of the most popular techniques to implicitly
learn the underlying distribution of a particular dataset. The problem is formulated as a two-player zero-sum game between a
Generator network G and a Discriminator network D. The Generator network aims to generate synthetic(fake) images from random
noise and the Discriminator aims to correctly classify between synthetic and natural images. In this study we aim to explore the
possibility of using generalized zero-sum games or multi-player zero-sum games for formulating and training GANs. Our results
indicate that using a multi-player setup, the severity of mode collapse problem in GANs can be significantly reduced to generate a
diverse set of synthetic images. We demonstrate this on different synthetic toy distributions and real-world datasets like MNIST &
CIFAR10.
Evaluating Mode Collapse in Generative Adversarial Networks
Sayeri Lala (MIT)*; Maha Shady (MIT); Anastasiya Belyaeva (Massachusetts Institute of Technology); Molei Liu (Harvard)
Generative Adversarial Networks (GANs) suffer from mode collapse, in which the generator produces samples from only a few of the
modes present in the true distribution. Several algorithms modifying the original GAN have recently been proposed to address this.
However, there is little work comparing their performance systematically. This study compares the performance of several novel
GAN methods, in particular AdaGAN [1], Unrolled GAN [2], VEEGAN [3], and Wasserstein GAN [4], along various metrics and on
datasets of different distributions. Our results indicate that AdaGAN performs consistently well on nearly all datasets while the other
GAN methods exhibit inconsistent performance. Metrics: Several metrics have been proposed to evaluate mode collapse in GANs.
Based on insights on strengths and weaknesses of various metrics [5], we compute the Wasserstein distance, Coverage, and
Birthday Paradox based estimation of a distribution’s support size. We also measure the number of captured modes and sample
quality based on the sample’s distance from the mode, as in [3]. For synthetic datasets, we directly compute these quantities since
the modes are known a priori. For MNIST, these quantities are computed using a pretrained MNIST classifier, where the modes are
assumed to be the digits 0-9. Datasets, Architectures, and Training Parameters: Mode collapse was evaluated on synthetic and real
datasets. We implemented different GAN architectures and training parameters for each dataset type to account for the dataset
complexity. To ensure a fair comparison, the study generally implemented the same architectures and training parameters across the
different GAN methods. Synthetic datasets: The synthetic datasets are a two-dimensional mixture of 8 Gaussians arranged in a
ring, a two-dimensional mixture of 25 Gaussians arranged in a grid, and an 800-dimensional dataset embedding a mixture of 9 500-
dimensional Gaussians [3]. Each dataset comprised of 500,000 training and 100,000 test examples. On synthetic datasets, the
architecture and training parameters in [3] was used. For 2D ring and 2D grid, AdaGAN obtained high scores across the metrics.
On 2D ring, it had Coverage 0.88, 97% samples were high quality, all modes were captured, and support size was equal to the
estimated support size of the training set. On 2D grid, it had coverage 1.0, 89% samples were high quality, all modes were captured,
and support size was equal to the estimated support size of the training set. Unrolled GAN, VEEGAN, and WGAN generally
performed worse. For 2D ring, Unrolled GAN had Coverage 0.91, 95% high quality samples, all modes captured, and smaller
support size. VEEGAN had Coverage 1.0, 88% high quality samples, all modes captured, and larger support size. For 2D grid,
Unrolled GAN had Coverage 0.55, 88% high quality samples, 14/25 modes captured, and smaller support size. VEEGAN had
Coverage 0.96, 66% high quality samples, all modes captured, and larger support size. WGAN consistently produced low quality
samples i.e., % high quality samples < 20% and distributions with larger support size. All methods achieved Wasserstein distance of
0.0. For the high dimensional dataset, there was no clear winner. AdaGAN had Wasserstein distance of 1.97, 39% high quality
samples, all modes captured, and support larger than the true support size. VEEGAN had Wasserstein distance of 1.15, 66% high
quality samples, 6/9 modes captured, with smaller support size. Unrolled GAN and WGAN had higher Wasserstein distances,
significantly low % high quality samples i.e., <15%, 0-1 modes captured, with support size significantly larger. Real dataset: The
real dataset used is MNIST. The GAN architecture and training parameters in [1] was used. AdaGAN had Wasserstein distance of
7.48 * 10^-6, Coverage 0.4, 45% high quality samples, and all modes captured. Unrolled GAN had Wasserstein distance of 0.0007,
Coverage 0.47, 53% high quality samples, and all modes captured. VEEGAN and WGAN had low Coverage i.e., < 0.15 and low %
high quality samples i.e., < 15%. Conclusion: Our results show that AdaGAN performs consistently well on nearly all datasets
and under several different metrics. This could be because AdaGAN learns a mixture of GANs, where each GAN is trained to
improve the accuracy of the mixture model. References: 1. Tolstikhin, Ilya O., et al. "Adagan: Boosting generative models."
Advances in Neural Information Processing Systems. 2017. 2. Metz, Luke, et al. "Unrolled generative adversarial networks." arXiv
preprint arXiv:1611.02163 (2016). 3. Srivastava, Akash, et al. "Veegan: Reducing mode collapse in gans using implicit variational
learning." Advances in Neural Information Processing Systems. 2017. 4. Arjovsky, Martin, Soumith Chintala, and Léon Bottou.
"Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017). 5. Borji, Ali. "Pros and Cons of GAN Evaluation Measures." arXiv
preprint arXiv:1802.03446 (2018).
Alternative Algorithms to Remove Ringing Artifacts in Bio-image with Big Data Using Numerical and Computational
Analysis
Andrew Kyung (Northern Valley Regional HS at Demarest); Rachel Chung (Harvard University – Extension School ); Elise Kang
(New York University); Richard Kyung (CRG(Choice Research Group)-NJ)*
In this research, numerical and computational simulations were carried out with several modified low pass filters to reduce the
ringing effect and improve the resolution of digital images to a degree, and finally proposed an efficient function as a new low pass
filter. In the digital imaging process, when the domain of the low pass filter over the frequency domain is narrow, it produces blurry
images due to the insufficient amount of frequency data obtained from K-space. The main purpose of this research was to develop
better algorithms that would both enhance the quality of the final image and decrease the amount of processing time taken to
produce images. Also non-Gaussian behavior was studied in this analysis.
Enhanced Iterative Method Solving Computational Mechanics with a Large Number of Equations
Richard Kyung (CRG(Choice Research Group)-NJ)*; Daniel Lee (CRG(Choice Research Group)-NJ)
This paper examines the calculation of Eigenvalues in a large number of equations in computational mechanics such as physical
and mechanical engineering problems. Eigenvalue problems are of the greatest importance in interactive dynamic problems in
computational sciences and mathematical fields. To assess the validity of the optimized process, computed results such as accuracy
and operational counts of the improved algorithm and the Jacobi method are compared. The analysis of the comparison shows that
the values of off-diagonal elements are not pertinent in obtaining rapid convergence, while the Jacobi method is usually useful only
for the diagonally dominant matrix. Therefore, the improved algorithm consists of fewer computations than the original model in
solving eigenvalue problems, and it yields reduced computational cost and increased efficiency.
Posters
Thursday, September 27, 2018