Accelerating Biological Discovery by Enabling Genomic Resources in Deep Phenotyping on Automated Platforms (DRAFT)

AI Pilot Project: Pacific Northwest National Laboratory

  • Principal Investigator: Jaydeep Bardhan (PNNL)
  • Co-Investigators: Kjiersten Fagnan (Lawrence Berkeley National Laboratory)
  • Scope/Objectives: Develop AI agent technologies and autonomous workflows to leverage the BER Data Lakehouse, integrating automated high-throughput experimental data
  • Potential Impact and Interface with the American Science Cloud (AmSC) and Transformational AI Models Consortium (ModCon): Develop AI models as Model Team and connect with AmSC through BER Data Lakehouse

Summary

The team will develop emerging artificial intelligence (AI) agent technologies and autonomous workflows to leverage the Biological and Environmental Research (BER) program’s Data Lakehouse (DLH) ecosystem in integrating automated high-content, deep-phenotyping experimental capabilities (mass-spectrometry proteomics and metabolomics, and high-throughput imaging), ultimately creating a unified framework that can effectively interpret complex biological responses when used at scale with automated laboratory platforms. This pilot-scale effort will focus on the soil bacterium Pseudomonas putida as a model organism because of its metabolic versatility, bioeconomic potential, and relevance to the extraction and recovery of critical minerals and materials (CMMs). Because of the availability of existing data, P. putida is an ideal use case to advance AI agents and agentic workflows, which have become essential for deciphering and optimizing the complex multiscale interactions between the genome, molecular function, physiology, and macroscale phenome of biological systems. Using critical challenges in understanding and optimizing biological mechanisms for CMM extraction and production as a driver, the project aims to bridge the gap between extensive genomic data availability and persistent challenges in translating genetic information into functional understanding across varying environmental conditions. The project’s short-term goal is to integrate deep phenotyping data into cohesive predictive models, leveraging the data resources to be included in the DLH to support autonomously analyzing multiomics data, developing metabolic models, and designing experiments, ultimately accelerating the optimization of CMM extraction processes. Success will yield significant advancements toward autonomous scientific discovery by bridging the gap between high-dimensional biological data and actionable insights for biodesign, thereby enhancing supply chain security and resilience for CMMs.

The long-term vision is for BER to transform the U.S. bioeconomy by leveraging AI to integrate the capabilities of world-leading Office of Science user facilities across multiple programs [e.g., BER data resources and facilities, Basic Energy Sciences (BES) light sources, and Advanced Scientific Computing Research (ASCR) Exascale computing resources]. Collectively, these resources represent a suite of capabilities well beyond the medium-term reach of any industry. The team’s long-term strategy is to leverage autonomous collaboration between AI agents to solve rate-limiting problems in design–build–test–learn (DBTL) cycles. Here, the project focuses on the strategic opportunity of automating the analysis of high-content, deep-phenotyping data, enabling comprehensive integration with predictive models and experimental design. The value of these modalities, especially combined with lab automation, have been proven in segments of the life sciences industry, but not yet realized in the emerging bioeconomy. Thus, this project aims to establish (1) how genomic-based research benefits from deep phenotyping capabilities and multi-institutional autonomous scientific workflows for a bioeconomy use case, conducted on a unified data ecosystem, and (2) how agentic-AI capabilities can create network effects (advantages of scale) on the larger BER (and Office of Science) networks of experimental and data resources. The project’s multidisciplinary, multi-institutional team includes computing and data leadership from both the Environmental Molecular Sciences Laboratory (EMSL; PI: Computing, Analytics, and Modeling lead Jaydeep Bardhan) and Joint Genome Institute (JGI; Chief Informatics Officer Kjiersten Fagnan), AI experts with deep domain expertise in mass spectrometry omics, image analytics, and systems biology (Thrust leads: Aivett Bilbao and Song Feng), domain experts, and senior system architects (Thrust lead: Ravi Ravichandran) with experience in enterprise-scale data transformations and the deployment of AI capabilities.