2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16) Twentieth Annual HPEC Conference 13 - 15 September 2016 Westin Hotel, Waltham, MA USA
Big Data 2 3:00-5:00 in Eden Vale A3  Chair: Vijay Gadepally / MIT
Thursday, September 15
[BEST PAPER FINALIST] Benchmarking SciDB Data Import on HPC Systems Siddharth Samsi, Laura Brattain, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Anna Klein, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner and Albert Reuther, MIT Lincoln Laboratory SciDB is a scalable, computational database man-agement system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity hardware in a high performance computing (HPC) environment. In this paper, we present the performance of SciDB using simulated image data. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a cluster running the MIT SuperCloud software stack. A peak performance of 2.2M database inserts per second was achieved on a single node of this system. We also show that SciDB and the D4M toolbox provide more efficient ways to access random sub-volumes of massive datasets compared to the traditional approaches of reading volumetric data from individual files. This work describes the D4M and SciDB tools we developed and presents the initial performance results. This performance was achieved by using parallel inserts, a in-database merging of arrays as well as supercomputing techniques, such as distributed arrays and single-program-multiple-data programming. [BEST STUDENT PAPER FINALIST] Cross-Engine Query Execution in Federated Database Systems Ankush M. Gupta*, Vijay Gadepally*+, Michael Stonebraker* *MIT CSAIL +MIT Lincoln Laboratory We have developed a reference implementation of the BigDAWG system: a new architecture for future Big Data applications, guided by the philosophy that “one size does not    fit all”. Such applications not only call for large-scale analytics, but also for real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. The importance and  effectiveness  of  such  a  system  has been demonstrated in a hospital application  using  data  from an intensive care unit (ICU). In this article, we describe   the implementation and evaluation of the cross-system Query Executor. In particular, we focus on cross-engine shuffle joins within the BigDAWG system, and evaluate various strategies of computing them when faced with varying degrees of data skew. Integrating Real-Time and Batch Processing in a Polystore John Meehan, Stan Zdonik Shaobo Tian, Yulong Tian Brown University; Nesime Tatbul, Intel Labs & MIT; Adam Dziedzic, Aaron Elmore, University of Chicago This paper describes a stream processing engine called S-Store and its role in the BigDAWG polystore. Funda- mentally, S-Store acts as a frontend processor that accepts input from multiple sources, and massages it into a form that has eliminated errors (data cleaning) and translates that input into    a form that can be efficiently ingested into BigDAWG. S-Store also acts as an intelligent router that sends input tuples to the appropriate components of BigDAWG. All updates to S-Store’s shared memory are done in a transactionally consistent (ACID) way, thereby eliminating new errors caused by non-synchronized reads and writes. The ability to migrate data from component to component of BigDAWG is crucial. We have described a migrator from S-Store to Postgres that we have implemented as a first proof of concept. We report some interesting results using this migrator that impact the evaluation of query plans. Data Transformation and Migration in Polystores Adam Dziedzic and Aaron J.  Elmore, Department of Computer Science,  The University of Chicago; Michael Stonebraker, CSAIL, Massachusetts Institute of Technology Ever increasing data size  and  new  requirements  in data processing has fostered the development of many new database systems. The result is that many data-intensive appli- cations are underpinned by different engines. To enable data mobility there is a need to transfer data between systems easily and efficiently. We analyze the state-of-the-art of data migration and outline research opportunities for a rapid data transfer. Our experiments explore data migration between a diverse set of databases, including PostgreSQL, SciDB, S-Store and Accumulo. Each of the systems excels at specific application requirements, such as transactional processing, numerical computation, stream- ing data, and large scale text processing. Providing an efficient data migration tool is essential to take advantage of superior processing from that specialized databases. Our goal is to build such a data migration framework that will take advantage of recent advancement in hardware and  software. The BigDawg Monitoring Framework Peinan Chen* , Vijay Gadepally*+, Michael Stonebraker*  *MIT CSAIL +MIT Lincoln Laboratory BigDAWG is a polystore database system designed to work with heterogenous data that may be stored in disparate database and storage engines. A central component of the BigDAWG polystore system is the ability to submit queries that may be executed in different data engines. This paper presents a monitoring framework for the BigDawg federated database system which maintains performance information on benchmark queries. As environmental conditions change, the monitoring framework updates existing performance information to match current conditions. Using this information, the monitoring system can determine the optimal query execution plan for similar incoming queries. We also describe a series of test queries that were run to assess whether the system correctly determines the optimal plans for such queries. [BEST PAPER FINALIST] Julia Implementation of Dynamic Distributed Dimensional Data Model Alexander Chen, Alan Edelman, Massachusetts Institute of Technology; Jeremy Kepner, Vijay Gadepally, MIT Lincoln Laboratory; Dylan Hutchison, University of Washington Julia is a new language for writing data analysis programs that are easy to implement and run at high performance.  Similarly, the Dynamic Distributed Dimensional Data Model (D4M) aims to clarify data analysis operations while retaining strong performance. D4M accomplishes these goals through a composable, unified data model on associative arrays.  In this work, we present an implementation of D4M in Julia and describe how it enables and facilitates data analysis. Several experiments showcase scalable performance in our new Julia version as compared to the original Matlab implementation.
2016 IEEE High Performance Extreme Computing Conference (HPEC ‘16) Twentieth Annual HPEC Conference 13 - 15 September 2016 Westin Hotel, Waltham, MA USA
Big Data 2 3:00-5:00 in Eden Vale A3  Chair: Vijay Gadepally / MIT
Thursday, September 15
[BEST PAPER FINALIST] Benchmarking SciDB Data Import on HPC Systems Siddharth Samsi, Laura Brattain, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Vijay Gadepally, Matthew Hubbell, Anna Klein, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Jeremy Kepner and Albert Reuther, MIT Lincoln Laboratory SciDB is a scalable, computational database man-agement system that uses an array model for data storage. The array data model of SciDB makes it ideally suited for storing and managing large amounts of imaging data. SciDB is designed to support advanced analytics in database, thus reducing the need for extracting data for analysis. It is designed to be massively parallel and can run on commodity hardware in a high performance computing (HPC) environment. In this paper, we present the performance of SciDB using simulated image data. The Dynamic Distributed Dimensional Data Model (D4M) software is used to implement the benchmark on a cluster running the MIT SuperCloud software stack. A peak performance of 2.2M database inserts per second was achieved on a single node of this system. We also show that SciDB and the D4M toolbox provide more efficient ways to access random sub-volumes of massive datasets compared to the traditional approaches of reading volumetric data from individual files. This work describes the D4M and SciDB tools we developed and presents the initial performance results. This performance was achieved by using parallel inserts, a in-database merging of arrays as well as supercomputing techniques, such as distributed arrays and single- program-multiple-data programming. [BEST STUDENT PAPER FINALIST] Cross-Engine Query Execution in Federated Database Systems Ankush M. Gupta*, Vijay Gadepally*+, Michael Stonebraker* *MIT CSAIL +MIT Lincoln Laboratory We have developed a reference implementation of the BigDAWG system: a new architecture for future Big Data applications, guided by the philosophy that “one size does not    fit all”. Such applications not only call for large-scale analytics, but also for real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. The importance and  effectiveness  of  such  a  system  has been demonstrated in a hospital application  using  data  from an intensive care unit (ICU). In this article, we describe   the implementation and evaluation of the cross-system Query Executor. In particular, we focus on cross-engine shuffle joins within the BigDAWG system, and evaluate various strategies of computing them when faced with varying degrees of data skew. Integrating Real-Time and Batch Processing in a Polystore John Meehan, Stan Zdonik Shaobo Tian, Yulong Tian Brown University; Nesime Tatbul, Intel Labs & MIT; Adam Dziedzic, Aaron Elmore, University of Chicago This paper describes a stream processing engine called S-Store and its role in the BigDAWG polystore. Funda- mentally, S-Store acts as a frontend processor that accepts input from multiple sources, and massages it into a form that has eliminated errors (data cleaning) and translates that input into    a form that can be efficiently ingested into BigDAWG. S-Store also acts as an intelligent router that sends input tuples to the appropriate components of BigDAWG. All updates to S- Store’s shared memory are done in a transactionally consistent (ACID) way, thereby eliminating new errors caused by non-synchronized reads and writes. The ability to migrate data from component to component of BigDAWG is crucial. We have described a migrator from S-Store to Postgres that we have implemented as a first proof of concept. We report some interesting results using this migrator that impact the evaluation of query plans. Data Transformation and Migration in Polystores Adam Dziedzic and Aaron J.  Elmore, Department of Computer Science,  The University of Chicago; Michael Stonebraker, CSAIL, Massachusetts Institute of Technology Ever increasing data size  and  new  requirements  in data processing has fostered the development of many new database systems. The result is that many data-intensive appli- cations are underpinned by different engines. To enable data mobility there is a need to transfer data between systems easily and efficiently. We analyze the state-of-the-art of data migration and outline research opportunities for a rapid data transfer. Our experiments explore data migration between a diverse set of databases, including PostgreSQL, SciDB, S-Store and Accumulo. Each of the systems excels at specific application requirements, such as transactional processing, numerical computation, stream- ing data, and large scale text processing. Providing an efficient data migration tool is essential to take advantage of superior processing from that specialized databases. Our goal is to build such a data migration framework that will take advantage of recent advancement in hardware and  software. The BigDawg Monitoring Framework Peinan Chen* , Vijay Gadepally*+, Michael Stonebraker*  *MIT CSAIL +MIT Lincoln Laboratory BigDAWG is a polystore database system designed to work with heterogenous data that may be stored in disparate database and storage engines. A central component of the BigDAWG polystore system is the ability to submit queries that may be executed in different data engines. This paper presents a monitoring framework for the BigDawg federated database system which maintains performance information on benchmark queries. As environmental conditions change, the monitoring framework updates existing performance information to match current conditions. Using this information, the monitoring system can determine the optimal query execution plan for similar incoming queries. We also describe a series of test queries that were run to assess whether the system correctly determines the optimal plans for such queries. [BEST PAPER FINALIST] Julia Implementation of Dynamic Distributed Dimensional Data Model Alexander Chen, Alan Edelman, Massachusetts Institute of Technology; Jeremy Kepner, Vijay Gadepally, MIT Lincoln Laboratory; Dylan Hutchison, University of Washington Julia is a new language for writing data analysis programs that are easy to implement and run at high performance.  Similarly, the Dynamic Distributed Dimensional Data Model (D4M) aims to clarify data analysis operations while retaining strong performance. D4M accomplishes these goals through a composable, unified data model on associative arrays.  In this work, we present an implementation of D4M in Julia and describe how it enables and facilitates data analysis. Several experiments showcase scalable performance in our new Julia version as compared to the original Matlab implementation.