2016 IEEE High Performance
Extreme Computing Conference
(HPEC ‘16)
Twentieth Annual HPEC Conference
13 - 15 September 2016
Westin Hotel, Waltham, MA USA
Big Data 2
3:00-5:00 in Eden Vale A3
Chair: Vijay Gadepally / MIT
Thursday, September 15
[BEST PAPER FINALIST]
Benchmarking SciDB Data Import on HPC Systems
Siddharth Samsi, Laura Brattain, William Arcand, David Bestor, Bill
Bergeron, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Anna
Klein, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout,
Antonio Rosa, Charles Yee, Jeremy Kepner and Albert Reuther, MIT
Lincoln Laboratory
SciDB is a scalable, computational database man-agement system that
uses an array model for data storage. The array data model of SciDB
makes it ideally suited for storing and managing large amounts of
imaging data. SciDB is designed to support advanced analytics in
database, thus reducing the need for extracting data for analysis. It is
designed to be massively parallel and can run on commodity hardware
in a high performance computing (HPC) environment. In this paper, we
present the performance of SciDB using simulated image data. The
Dynamic Distributed Dimensional Data Model (D4M) software is used to
implement the benchmark on a cluster running the MIT SuperCloud
software stack. A peak performance of 2.2M database inserts per
second was achieved on a single node of this system. We also show
that SciDB and the D4M toolbox provide more efficient ways to access
random sub-volumes of massive datasets compared to the traditional
approaches of reading volumetric data from individual files. This work
describes the D4M and SciDB tools we developed and presents the
initial performance results. This performance was achieved by using
parallel inserts, a in-database merging of arrays as well as
supercomputing techniques, such as distributed arrays and single-
program-multiple-data programming.
[BEST STUDENT PAPER FINALIST] Cross-Engine Query Execution
in Federated Database Systems
Ankush M. Gupta*, Vijay Gadepally*+, Michael Stonebraker*
*MIT CSAIL +MIT Lincoln Laboratory
We have developed a reference implementation of the BigDAWG
system: a new architecture for future Big Data applications, guided by
the philosophy that “one size does not fit all”. Such applications not
only call for large-scale analytics, but also for real-time streaming
support, smaller analytics at interactive speeds, data visualization, and
cross-storage-system queries. The importance and effectiveness of
such a system has been demonstrated in a hospital application using
data from an intensive care unit (ICU). In this article, we describe the
implementation and evaluation of the cross-system Query Executor. In
particular, we focus on cross-engine shuffle joins within the BigDAWG
system, and evaluate various strategies of computing them when faced
with varying degrees of data skew.
Integrating Real-Time and Batch Processing in a Polystore
John Meehan, Stan Zdonik Shaobo Tian, Yulong Tian Brown University;
Nesime Tatbul, Intel Labs & MIT; Adam Dziedzic, Aaron Elmore,
University of Chicago
This paper describes a stream processing engine called S-Store and its
role in the BigDAWG polystore. Funda- mentally, S-Store acts as a
frontend processor that accepts input from multiple sources, and
massages it into a form that has eliminated errors (data cleaning) and
translates that input into a form that can be efficiently ingested into
BigDAWG. S-Store also acts as an intelligent router that sends input
tuples to the appropriate components of BigDAWG. All updates to S-
Store’s shared memory are done in a transactionally consistent (ACID)
way, thereby eliminating new errors caused by non-synchronized reads
and writes. The ability to migrate data from component to component of
BigDAWG is crucial. We have described a migrator from S-Store to
Postgres that we have implemented as a first proof of concept. We
report some interesting results using this migrator that impact the
evaluation of query plans.
Data Transformation and Migration in Polystores
Adam Dziedzic and Aaron J. Elmore, Department of Computer Science,
The University of Chicago; Michael Stonebraker, CSAIL, Massachusetts
Institute of Technology
Ever increasing data size and new requirements in data processing
has fostered the development of many new database systems. The
result is that many data-intensive appli- cations are underpinned by
different engines. To enable data mobility there is a need to transfer data
between systems easily and efficiently. We analyze the state-of-the-art
of data migration and outline research opportunities for a rapid data
transfer. Our experiments explore data migration between a diverse set
of databases, including PostgreSQL, SciDB, S-Store and Accumulo.
Each of the systems excels at specific application requirements, such as
transactional processing, numerical computation, stream- ing data, and
large scale text processing. Providing an efficient data migration tool is
essential to take advantage of superior processing from that specialized
databases. Our goal is to build such a data migration framework that will
take advantage of recent advancement in hardware and software.
The BigDawg Monitoring Framework
Peinan Chen* , Vijay Gadepally*+, Michael Stonebraker*
*MIT CSAIL +MIT Lincoln Laboratory
BigDAWG is a polystore database system designed to work with
heterogenous data that may be stored in disparate database and
storage engines. A central component of the BigDAWG polystore system
is the ability to submit queries that may be executed in different data
engines. This paper presents a monitoring framework for the BigDawg
federated database system which maintains performance information on
benchmark queries. As environmental conditions change, the monitoring
framework updates existing performance information to match current
conditions. Using this information, the monitoring system can determine
the optimal query execution plan for similar incoming queries. We also
describe a series of test queries that were run to assess whether the
system correctly determines the optimal plans for such queries.
[BEST PAPER FINALIST] Julia Implementation of Dynamic
Distributed Dimensional Data Model
Alexander Chen, Alan Edelman, Massachusetts Institute of Technology;
Jeremy Kepner, Vijay Gadepally, MIT Lincoln Laboratory; Dylan
Hutchison, University of Washington
Julia is a new language for writing data analysis programs that are easy
to implement and run at high performance. Similarly, the Dynamic
Distributed Dimensional Data Model (D4M) aims to clarify data analysis
operations while retaining strong performance. D4M accomplishes these
goals through a composable, unified data model on associative arrays.
In this work, we present an implementation of D4M in Julia and describe
how it enables and facilitates data analysis. Several experiments
showcase scalable performance in our new Julia version as compared
to the original Matlab implementation.