Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

KBase R&D ARRA Projects

Pilot Projects

In order to explore some possible Systems Biology Knowledgebase implementations and architectures, as well as demonstrate and characterize the range of computational challenges facing Genomic Science projects, part of this project supported the development of pilot projects to address three goals:

  • Develop benchmarks for existing computational biology and bioinformatics programs on existing architectures,
  • Develop prototypic computational biology and bioinformatics programs on new architectures including cloud architectures,
  • Develop novel and integrative web platforms as possible solutions to bioinformatics problems in anticipation of and to inform a future Systems Biology Knowledgebase.

The goal of the pilot project is to identify computational problems and solutions in the context of the Knowledgebase that inform design approaches for the final report.

Five pilot projects were selected and are listed below.

(1) An Experience Report: Porting the Existing MG-RAST Multi-User Web Application to the Cloud

Argonne National Laboratory, Folker Meyer, Principal Investigator
Jared Wilkening, Andreas Wilke, Elizabeth M. Glass, Narayan L. Desai, and Folker Meyer

This project investigated the requirements for distributing data across multiple platforms to optimize computational throughput. Researchers focused on the similarity analysis stage of the MG-RAST metagenome annotation server. This stage is implemented using the National Center for Biotechnology Information’s BLAST resource, and investigators determined it was a good candidate for distributed computing. The project also developed guidelines for determining how best to use cloud and ad hoc computational resources.

Download: Project summary

(2) Design Requirements and Prototypes of Workflows in the SBKB for Support of Engineering of Metabolic Pathways

Lawrence Berkeley National Laboratory, Adam Arkin, Principal Investigator
Dylan Chivian, John Bates, Paramvir Dehal, Marcin Joachimiak, Morgan Price, Vinay Satish Kumar, and Adam Arkin

This project designed and implemented workflows for metabolic reconstruction within MicrobesOnline, a web portal for comparative and functional genomic analyses. Investigators began developing interfaces for navigating metabolic networks and experimental functional omics data using the Google-like Application for Metabolic Maps or GLAMM. GLAMM suggests pathways that may offer routes for retrosynthesis (e.g., how to build a pathway to convert feedstock X into chemical Y in organism Z).

Download: Project summary

(3) Database Management Systems Technologies for Computational Biology & Bioinformatics Applications

Lawrence Berkeley National Laboratory, Victor Markowitz, Principal Investigator
Victor Markowitz

This project focused on evaluating new database management system technologies that allow efficient analysis of very large datasets. Prototypes of a large database based on the DOE JGI’s Integrated Microbial Genomes (IMG) data management system were implemented using several of these technologies. Performance tests of IMG “all versus all” data were conducted in Hbase on the DOE National Energy Research Scientific Computing Center’s Magellan Hadoop cluster and on a smaller departmental Hadoop cluster. Results show that distributed tabular storage has significant long-term potential for KBase but that it is not yet ready for large-scale production use. Investigators note that Hadoop and Hbase are undergoing rapid development, and they anticipate that stability issues will be addressed within two years.

Download: Project summary

(4) Exploring Architecture Options for Workflows in a Federated, Cloud-based Systems Biology Knowledgebase

Pacific Northwest National Laboratory, Ian Gorton, Principal Investigator
Ian Gorton, Yan Liu, Jian Yin, Leeann McCue, Bill Cannon, and Gordon Anderson

This project involved investigating available mechanisms for storing and accessing biological data in a cloud computing environment and evaluating access to large archives of “omics” data using a cloud architecture to provide “Data As A Service.” A use case scenario to identify and curate published genome annotations was established, and investigators implemented this workflow using a federated cloud architecture as proposed for KBase.

Download: Project summary

(5) Final Evaluation Report for the Semantic Driven Knowledge Discovery and Integration in the Systems Biology Knowledgebase Project

Pacific Northwest National Laboratory, Kerstin Kleese van Dam, Principal Investigator
Kerstin Kleese van Dam, Cliff Joslyn, Lee Ann McCue, Bill Cannon, Carina Lansing, Zoe Guillen, Margaret Romine, Gordon Anderson, and Abigail Corrigan

This project gathered requirements to design test scenarios for semantic services such as data annotation, publication, search, access, and integration in KBase. Investigators developed a prototype test environment that included a collaborative, project-centric user environment and a prototype data services infrastructure to support the KBase user environment. The project demonstrated that semantic technologies are sufficiently mature to be used in a production environment to support research.

Download: Project summary