2015 IEEE High Performance
Extreme Computing Conference
(HPEC ‘15)
Nineteenth Annual HPEC Conference
15 - 17 September 2015
Westin Hotel, Waltham, MA USA
Bioinformatics & Big Data 2
1:00-2:40 in Eden Vale C1 - C2
Chair: Don Peck / General Electric
D4M: Bringing Associative Arrays to Database Engines
Vijay Gadepally, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Lauren Edwards, Matthew Hubbell, Peter
Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Albert Reuther, MIT Lincoln Laboratory
The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap
between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety.
Numerous tools exist that allow users to store, query and index these massive quantities of data. Each storage or database
engine comes with the promise of dealing with complex data. Scientists and engineers who wish to use these systems often
quickly find that there is no single technology that offers a panacea to the complexity of information. When using multiple
technologies, however, there is significant trouble in designing the movement of information between storage and database
engines to support an end-to-end application along with a steep learning curve associated with learning the nuances of each
underlying technology. In this article, we present the Dynamic Distributed Dimensional Data Model (D4M) as a potential tool to
unify database and storage engine operations. Previous articles on D4M have showcased the ability of D4M to interact with the
popular NoSQL Accumulo database. Recently however, D4M now operates on a variety of backend storage or database
engines while providing a federated look to the end user through the use of associative arrays. In order to showcase how new
databases may be supported by D4M, we describe the process of building the D4M-SciDB connector and present performance
of this connection.
[Best Student Paper Finalist]
Improving Big Data Visual Analytics with Interactive Virtual Reality
Andrew Moran, MIT Dept. of Electrical Engineering and Computer Science, Vijay Gadepally, Matthew Hubbell, Jeremy Kepner, MIT Lincoln
Laboratory
For decades, the growth and volume of digital data collection has made it challenging to digest large volumes of information and
extract underlying structure. Coined `Big Data', massive amounts of information has quite often been gathered inconsistently
(e.g from many sources, of various forms, at different rates, etc.). These factors impede the practices of not only processing
data, but also analyzing and displaying it in an efficient manner to the user. Many efforts have been completed in the data
mining and visual analytics community to create effective ways to further improve analysis and achieve the knowledge desired
for better understanding. Our approach for improved big data visual analytics is two-fold, focusing on both visualization and
interaction. Given geo-tagged information, we are exploring the benefits of visualizing datasets in the original geospatial domain
by utilizing a virtual reality platform. After running proven analytics on the data, we intend to represent the information in a more
realistic 3D setting, where analysts can achieve an enhanced situational awareness and rely on familiar perceptions to draw in-
depth conclusions on the dataset. In addition, developing a human-computer interface that responds to natural user actions and
inputs creates a more intuitive environment. Tasks can be performed to manipulate the dataset and allow users to dive deeper
upon request, adhering to desired demands and intentions. Due to the volume and popularity of social media, we developed a
3D tool visualizing Twitter on MIT's campus for analysis. Utilizing emerging technologies of today to create a fully immersive
tool that promotes visualization and interaction can help ease the process of understanding and representing big data.
Biomedical Relation Extraction Using Stochastic Difference Equations
Carl Tony Fakhry, Kourosh Zarringhalam, Ping Chen, University of Massachusetts Boston
We propose an unsupervised method for extracting causal relations between biomedical entities using stochastic difference
equations (SDE). Our method attempts to generalize the propagation of relevance in medical sentences in order to extract
related biomedical terms and the semantic relation between them. We model the propagation of relevance in candidate medical
sentences through the use of stochastic difference equations. The equation of propagation of relevance helps in identifying the
most relevant medical terms in a causal sentence. It also increases the accuracy of Information Extraction as it allows to set a
threshold for the minimum relevance for relation extraction.
High Performance Computing of Gene Regulatory Networks using a Message-Passing Model
Kimberly Glass, Brigham and Women’s Hospital and Harvard Medical School, John Quackenbush, Dana-Farber Cancer Institute and Harvard
School of Public Health, Jeremy Kepner, MIT Lincoln Laboratory
Gene regulatory network reconstruction is a fundamental problem in computational biology. We recently developed an algorithm,
called PANDA (Passing Attributes Between Networks for Data Assimilation), that integrates multiple sources of 'omics data and
estimates regulatory network models. This approach was initially implemented in the C++ programming language and has since
been applied to a number of biological systems. In our current research we are beginning to expand the algorithm to incorporate
larger and most diverse data-sets, to reconstruct networks that contain increasing numbers of elements, and to build not only
single network models, but sets of networks. In order to accomplish these "Big Data" applications, it has become critical that we
increase the computational efficiency of the PANDA implementation. In this paper we show how to recast PANDA's similarity
equations as matrix operations. This allows us to implement a highly readable version of the algorithm using the
MATLAB/Octave programming language. We find that the resulting M-code much shorter (103 compared to 1128 lines) and
more easily modifiable for potential future applications. The new implementation also runs significantly faster, with increasing
efficiency as the network models increase in size. Tests comparing the C-code and M-code versions of PANDA demonstrate that
this speed-up is on the order of 20-80 times faster for networks of similar dimensions to those we find in current biological
applications.
Lustre, Hadoop, Accumulo
Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Lauren Edwards, Vijay Gadepally, Matthew Hubbell, Peter
Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Albert Reuther, MIT Lincoln Laboratory
Data processing systems impose multiple views on data as it is processed by the system. These views include spreadsheets,
databases, matrices, and graphs. There are a wide variety of technologies that can be used to store and process data through
these different steps. The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all
designed to address the largest and the most challenging data storage problems. There have been many ad-hoc comparisons
of these technologies. This paper describes the foundational principles of each technology, provides simple models for
assessing their capabilities, and compares the various technologies on a hypothetical common cluster. These comparisons
indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and
provides higher bandwidth on general purpose workloads. Hadoop can provide 4x greater read bandwidth on special purpose
workloads. Accumulo provides 105 lower latency on random lookups than either Lustre or Hadoop but Accumulo’s bulk
bandwidth is 10x less. Significant recent work has been done to enable mix-and-match solutions that allow Lustre, Hadoop, and
Accumulo to be combined in different ways.
Thursday, September 17