A map of gene expression correlation triangles with positive correlations (blue edges) between Kalanchoë genes (dark green nodes) and pineapple genes (yellow nodes) and negative correlations (red edges) between Kalanchoë or pineapple genes and Arabidopsis genes (light green nodes) [Dan Jacobson, Oak Ridge National Laboratory]
The scientific objectives of the Genomic Science program require a highly coordinated application of expertise that transcends traditional disciplinary boundaries. As such, one of the program's most challenging but critical goals is the creation of robust computational frameworks for data integration, analysis, and sharing that can accommodate the wide variety of heterogeneous data streams being generated across the Genomic Science community. These frameworks include not only the various types of omics data (as well as meta-omics variations) discussed earlier in this report, but also data derived from nonomics-based analytical technologies for quantitative physiological analysis, physicochemical measurements of environmental factors, and a vast array of other experimental data types.
Data-specific needs for Genomic Science program research often revolve around tracking high-throughput experimental and contextual environmental data; developing tools for capturing and archiving large and complex datasets; and generating innovative new approaches for analysis, distillation, and integration of systems biology data. Tracking the data requires a Laboratory Information Management System (LIMS) appropriate for specific project needs to monitor experimental cycles, track samples and workflows, and collect internally compatible data from varying instrument types. Data capture and storage present considerable challenges, and the volume and complexity of data generated by systems biology research often require new technologies and bioinformatics approaches permitting rapid data storage, retrieval, and transfer at very large scales. The generation of raw data only begins the cycle of scientific inquiry. Improved data-distillation strategies for filtering out noise and compressing noncritical information, as well as identifying biologically meaningful data subsets, are critical for enabling subsequent cycles of analysis, integration, and modeling.
The process of modeling and simulation attempts to build a more integrated understanding of the dynamic nature of biological systems and enable scientists to test their knowledge via computerized "virtual experiments." Creating models that predict biosystem response to untested conditions requires continued emphasis on quantitative details such as kinetic constants, enzyme activities, and dynamic metabolic measurements underlying functional biological processes. Continuing to build on well-developed model organisms such as Escherichia coli, Saccharomyces cerevisiae, and Arabidopsis thaliana is important, as is extending this area of research to develop predictive models of biological function in a broader class of organisms. Moving beyond the level of individual organisms, new mathematical and machine-learning methods are needed to address biological variables at the community scale and understand evolving interactions with external signals from the environment. As more powerful resources for high-performance computing become available, the amount of biological data produced by high-throughput experimental approaches grows at an even faster pace. Although this data continues to yield insights into and improve the quantitative understanding of biological systems, incorporating detailed molecular, biochemical, physiological, and structural information into biological models and simulations remains a major challenge.
Given the data-intensive nature of Genomic Science research, all supported projects are required to generate data management and integration plans that emphasize an iterative approach to data analysis and lead to a predictive understanding of the biological system(s) under investigation. Developing these types of plans serves not only the objectives of the individual project, but also facilitates the collaborative sharing of resulting data across the broader community via mechanisms such as the DOE Systems Biology Knowledgebase. The long-term success of the Genomic Science program, and systems biology in general, depends on achieving high levels of data and information integration and sharing. BER has established an information and data sharing policy requiring public accessibility to all publishable information.