Home Welcome Message Committee Invited Speakers Program Demos
2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA
Bioinformatics & Big Data 2 1:00-2:40 in Eden Vale C1 - C2 Chair: Don Peck / General Electric D4M: Bringing Associative Arrays to Database Engines Vijay Gadepally, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Lauren Edwards, Matthew Hubbell, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Albert Reuther, MIT Lincoln Laboratory The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety.  Numerous tools exist that allow users to store, query and index these massive quantities of data. Each storage or database engine comes with the promise of dealing with complex data. Scientists and engineers who wish to use these systems often quickly find that there is no single technology that offers a panacea to the complexity of information. When using multiple technologies, however, there is significant trouble in designing the movement of information between storage and database engines to support an end-to-end application along with a steep learning curve associated with learning the nuances of each underlying technology. In this article, we present the Dynamic Distributed Dimensional Data Model (D4M) as a potential tool to unify database and storage engine operations. Previous articles on D4M have showcased the ability of D4M to interact with the popular NoSQL Accumulo database. Recently however, D4M now operates on a variety of backend storage or database engines while providing a federated look to the end user through the use of associative arrays. In order to showcase how new databases may be supported by D4M, we describe the process of building the D4M-SciDB connector and present performance of this connection. [Best Student Paper Finalist] Improving Big Data Visual Analytics with Interactive Virtual Reality Andrew Moran, MIT Dept. of Electrical Engineering and Computer Science, Vijay Gadepally, Matthew Hubbell, Jeremy Kepner, MIT Lincoln Laboratory For decades, the growth and volume of digital data collection has made it challenging to digest large volumes of information and extract underlying structure.  Coined `Big Data', massive amounts of information has quite often been gathered inconsistently (e.g from many sources, of various forms, at different rates, etc.).  These factors impede the practices of not only processing data, but also analyzing and displaying it in an efficient manner to the user.  Many efforts have been completed in the data mining and visual analytics community to create effective ways to further improve analysis and achieve the knowledge desired for better understanding.  Our approach for improved big data visual analytics is two-fold, focusing on both visualization and interaction.  Given geo-tagged information, we are exploring the benefits of visualizing datasets in the original geospatial domain by utilizing a virtual reality platform.  After running proven analytics on the data, we intend to represent the information in a more realistic 3D setting, where analysts can achieve an enhanced situational awareness and rely on familiar perceptions to draw in- depth conclusions on the dataset.  In addition, developing a human-computer interface that responds to natural user actions and inputs creates a more intuitive environment.  Tasks can be performed to manipulate the dataset and allow users to dive deeper upon request, adhering to desired demands and intentions.  Due to the volume and popularity of social media, we developed a 3D tool visualizing Twitter on MIT's campus for analysis.  Utilizing emerging technologies of today to create a fully immersive tool that promotes visualization and interaction can help ease the process of understanding and representing big data.   Biomedical Relation Extraction Using Stochastic Difference Equations Carl Tony Fakhry, Kourosh Zarringhalam, Ping Chen, University of Massachusetts Boston We propose an unsupervised method for extracting causal relations between biomedical entities using stochastic difference equations (SDE). Our method attempts to generalize the propagation of relevance in medical sentences in order to extract related biomedical terms and the semantic relation between them. We model the propagation of relevance in candidate medical sentences through the use of stochastic difference equations. The equation of propagation of relevance helps in identifying the most relevant medical terms in a causal sentence. It also increases the accuracy of Information Extraction  as it allows to set a threshold for the minimum relevance for relation extraction.  High Performance Computing of Gene Regulatory Networks using a Message-Passing Model Kimberly Glass, Brigham and Women’s Hospital and Harvard Medical School, John Quackenbush, Dana-Farber Cancer Institute and Harvard School of Public Health, Jeremy Kepner, MIT Lincoln Laboratory Gene regulatory network reconstruction is a fundamental problem in computational biology. We recently developed an algorithm, called PANDA (Passing Attributes Between Networks for Data Assimilation), that integrates multiple sources of 'omics data and estimates regulatory network models. This approach was initially implemented in the C++ programming language and has since been applied to a number of biological systems. In our current research we are beginning to expand the algorithm to incorporate larger and most diverse data-sets, to reconstruct networks that contain increasing numbers of elements, and to build not only single network models, but sets of networks. In order to accomplish these "Big Data" applications, it has become critical that we increase the computational efficiency of the PANDA implementation. In this paper we show how to recast PANDA's similarity equations as matrix operations. This allows us to implement a highly readable version of the algorithm using the MATLAB/Octave programming language. We find that the resulting M-code much shorter (103 compared to 1128 lines) and more easily modifiable for potential future applications. The new implementation also runs significantly faster, with increasing efficiency as the network models increase in size. Tests comparing the C-code and M-code versions of PANDA demonstrate that this speed-up is on the order of 20-80 times faster for networks of similar dimensions to those we find in current biological applications. Lustre, Hadoop, Accumulo Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Lauren Edwards, Vijay Gadepally, Matthew Hubbell, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Albert Reuther, MIT Lincoln Laboratory Data processing systems impose multiple views on data as it is processed by the system.  These views include spreadsheets, databases, matrices, and graphs.  There are a wide variety of technologies that can be used to store and process data through these different steps.  The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all designed to address the largest and the most challenging data storage problems.  There have been many ad-hoc comparisons of these technologies.  This paper describes the foundational principles of each technology, provides simple models for assessing their capabilities, and compares the various technologies on a hypothetical common cluster.  These comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads.  Hadoop can provide 4x greater read bandwidth on special purpose workloads.  Accumulo provides 105 lower latency on random lookups than either Lustre or Hadoop but Accumulo’s bulk bandwidth is 10x less.  Significant recent work has been done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo to be combined in different ways.
Thursday, September 17
2015 IEEE High Performance Extreme Computing Conference (HPEC ‘15) Nineteenth Annual HPEC Conference 15 - 17 September 2015 Westin Hotel, Waltham, MA USA
Bioinformatics & Big Data 2 1:00-2:40 in Eden Vale C1 - C2 Chair: Don Peck / General Electric D4M: Bringing Associative Arrays to Database Engines Vijay Gadepally, Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Lauren Edwards, Matthew Hubbell, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Albert Reuther, MIT Lincoln Laboratory The ability to collect and analyze large amounts of data is a growing problem within the scientific community. The growing gap between data and users calls for innovative tools that address the challenges faced by big data volume, velocity and variety.  Numerous tools exist that allow users to store, query and index these massive quantities of data. Each storage or database engine comes with the promise of dealing with complex data. Scientists and engineers who wish to use these systems often quickly find that there is no single technology that offers a panacea to the complexity of information. When using multiple technologies, however, there is significant trouble in designing the movement of information between storage and database engines to support an end-to-end application along with a steep learning curve associated with learning the nuances of each underlying technology. In this article, we present the Dynamic Distributed Dimensional Data Model (D4M) as a potential tool to unify database and storage engine operations. Previous articles on D4M have showcased the ability of D4M to interact with the popular NoSQL Accumulo database. Recently however, D4M now operates on a variety of backend storage or database engines while providing a federated look to the end user through the use of associative arrays. In order to showcase how new databases may be supported by D4M, we describe the process of building the D4M-SciDB connector and present performance of this connection. [Best Student Paper Finalist] Improving Big Data Visual Analytics with Interactive Virtual Reality Andrew Moran, MIT Dept. of Electrical Engineering and Computer Science, Vijay Gadepally, Matthew Hubbell, Jeremy Kepner, MIT Lincoln Laboratory For decades, the growth and volume of digital data collection has made it challenging to digest large volumes of information and extract underlying structure.  Coined `Big Data', massive amounts of information has quite often been gathered inconsistently (e.g from many sources, of various forms, at different rates, etc.).  These factors impede the practices of not only processing data, but also analyzing and displaying it in an efficient manner to the user.  Many efforts have been completed in the data mining and visual analytics community to create effective ways to further improve analysis and achieve the knowledge desired for better understanding.  Our approach for improved big data visual analytics is two-fold, focusing on both visualization and interaction.  Given geo-tagged information, we are exploring the benefits of visualizing datasets in the original geospatial domain by utilizing a virtual reality platform.  After running proven analytics on the data, we intend to represent the information in a more realistic 3D setting, where analysts can achieve an enhanced situational awareness and rely on familiar perceptions to draw in-depth conclusions on the dataset.  In addition, developing a human-computer interface that responds to natural user actions and inputs creates a more intuitive environment.  Tasks can be performed to manipulate the dataset and allow users to dive deeper upon request, adhering to desired demands and intentions.  Due to the volume and popularity of social media, we developed a 3D tool visualizing Twitter on MIT's campus for analysis.  Utilizing emerging technologies of today to create a fully immersive tool that promotes visualization and interaction can help ease the process of understanding and representing big data.   Biomedical Relation Extraction Using Stochastic Difference Equations Carl Tony Fakhry, Kourosh Zarringhalam, Ping Chen, University of Massachusetts Boston We propose an unsupervised method for extracting causal relations between biomedical entities using stochastic difference equations (SDE). Our method attempts to generalize the propagation of relevance in medical sentences in order to extract related biomedical terms and the semantic relation between them. We model the propagation of relevance in candidate medical sentences through the use of stochastic difference equations. The equation of propagation of relevance helps in identifying the most relevant medical terms in a causal sentence. It also increases the accuracy of Information Extraction  as it allows to set a threshold for the minimum relevance for relation extraction.  High Performance Computing of Gene Regulatory Networks using a Message-Passing Model Kimberly Glass, Brigham and Women’s Hospital and Harvard Medical School, John Quackenbush, Dana-Farber Cancer Institute and Harvard School of Public Health, Jeremy Kepner, MIT Lincoln Laboratory Gene regulatory network reconstruction is a fundamental problem in computational biology. We recently developed an algorithm, called PANDA (Passing Attributes Between Networks for Data Assimilation), that integrates multiple sources of 'omics data and estimates regulatory network models. This approach was initially implemented in the C++ programming language and has since been applied to a number of biological systems. In our current research we are beginning to expand the algorithm to incorporate larger and most diverse data-sets, to reconstruct networks that contain increasing numbers of elements, and to build not only single network models, but sets of networks. In order to accomplish these "Big Data" applications, it has become critical that we increase the computational efficiency of the PANDA implementation. In this paper we show how to recast PANDA's similarity equations as matrix operations. This allows us to implement a highly readable version of the algorithm using the MATLAB/Octave programming language. We find that the resulting M-code much shorter (103 compared to 1128 lines) and more easily modifiable for potential future applications. The new implementation also runs significantly faster, with increasing efficiency as the network models increase in size. Tests comparing the C-code and M-code versions of PANDA demonstrate that this speed-up is on the order of 20-80 times faster for networks of similar dimensions to those we find in current biological applications. Lustre, Hadoop, Accumulo Jeremy Kepner, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Lauren Edwards, Vijay Gadepally, Matthew Hubbell, Peter Michaleas, Julie Mullen, Andrew Prout, Antonio Rosa, Charles Yee, Albert Reuther, MIT Lincoln Laboratory Data processing systems impose multiple views on data as it is processed by the system.  These views include spreadsheets, databases, matrices, and graphs.  There are a wide variety of technologies that can be used to store and process data through these different steps.  The Lustre parallel file system, the Hadoop distributed file system, and the Accumulo database are all designed to address the largest and the most challenging data storage problems.  There have been many ad-hoc comparisons of these technologies.  This paper describes the foundational principles of each technology, provides simple models for assessing their capabilities, and compares the various technologies on a hypothetical common cluster.  These comparisons indicate that Lustre provides 2x more storage capacity, is less likely to loose data during 3 simultaneous drive failures, and provides higher bandwidth on general purpose workloads.  Hadoop can provide 4x greater read bandwidth on special purpose workloads.  Accumulo provides 105 lower latency on random lookups than either Lustre or Hadoop but Accumulo’s bulk bandwidth is 10x less.  Significant recent work has been done to enable mix-and-match solutions that allow Lustre, Hadoop, and Accumulo to be combined in different ways.
Thursday, September 17
Home