IDeA: Intelligent Design Assistant for Enzyme Discovery and Biosynthetic Pathway Optimization (DRAFT)

AI Pilot Project: Argonne National Laboratory

  • Principal Investigator: Arvind Ramanathan (ANL)
  • Co-Investigator: Josh Michener (Oak Ridge National Laboratory)
  • Scope/Objectives: Creating a scalable, autonomous computational framework (agentic framework), which will be a foundational driver for DOE BER Data Lakehouse development.
  • Potential Impact and Interface with the American Science Cloud (AmSC) and Transformational AI Models Consortium (ModCon): Leverage AmSC in congruence with Data Lakehouse

Summary

The Intelligent Design Assistant (IDeA) project addresses critical bottlenecks in biosystems design by developing the first intelligent computational assistant (an AI co-scientist) that autonomously guides enzyme discovery and biosynthetic pathway optimization within the DOE BER ecosystem. Current biosystems design processes are severely constrained by time-intensive, expert-dependent approaches requiring years to transition from concept to scalable production, while fragmentation of biological data across siloed databases prevents artificial intelligence from realizing its transformative potential. IDeA addresses these challenges by creating a scalable, autonomous computational framework (agentic framework) that serves dual purposes as both a transformative scientific tool and foundational driver for DOE BER Data Lakehouse development.

The pilot phase focuses on three integrated aims to establish IDeA’s foundational capabilities using amide synthetases for polyamide biosynthesis as a motivating use case.

  • Aim 1 develops an AI co-scientist that intelligently orchestrates computational biology toolkits through enhanced natural language interfaces, comprehensive tool integration (bioinformatics, cheminformatics, simulation), and step-by-step computational decision-making processes with built-in verification mechanisms.
  • Aim 2 implements reasoning frameworks through computational models integrating diverse data across gene → protein → function → environment scales and causal reasoning modules to enable interpretable, mechanistically informed discoveries.
  • Aim 3 establishes multi-objective optimization frameworks that simultaneously improve enzyme kinetic parameters, thermostability, and substrate selectivity while incorporating Bayesian uncertainty quantification to guide experimental validation.

The two-year pilot will deliver quantifiable scientific advances including discovery of 3 to 5 novel amide synthetase candidates with >10x acceleration in enzyme characterization compared to manual approaches, demonstration of >80% accuracy in enzyme function prediction/annotation, and validation of 3 to 5 optimized enzyme variants showing measurable improvements in at least two target properties leveraging support from other DOE, BER, and (national) laboratory-directed research and development (LDRD) programs.

The project will contribute foundational Data Lakehouse infrastructure modules including semantic interoperability frameworks, automated literature integration pipelines, and community-accessible optimization algorithms specifically designed for biological systems constraints. The project also brings together a truly inter-disciplinary team of researchers encompassing computational biology, genome science, mathematical sciences and artificial intelligence (AI)/machine learning (ML) across two national laboratories (Argonne National Laboratory and Oak Ridge National Laboratory) to address enzyme discovery, design and optimization as a prototype application for AI-based advanced reasoning techniques. IDeA thus represents the first truly autonomous agentic system capable of functioning as an AI co-scientist within the biological sciences, leveraging DOE’s exascale computing infrastructure to orchestrate complex workflows spanning discovery, annotation, characterization, and design using federated Data Lakehouse architecture. The system transforms biosystems design from trial-and-error to data-driven and hypothesis-informed process, reducing development timelines from years to months while dramatically improving success rates. By integrating supercomputing capabilities with federated biological data access, IDeA enables simultaneous optimization of millions of enzyme variants and thousands of pathway configurations, delivering data-driven design decisions that account for molecular interactions, environmental conditions, and industrial constraints beyond human expert capabilities. This foundational agentic infrastructure serves as a generalizable platform for diverse DOE applications, representing a critical investment in the computational biology capabilities required to maintain American technological leadership in biotechnology.