LAMBDA: A Lakehouse-enabled AI-ready Multimodal Bioimaging Data Architecture (DRAFT)

AI Pilot Project: Lawrence Berkeley National Laboratory

  • Principal Investigator: Paul D. Adams (LBNL)
  • Co-Investigators: Dion Antonopoulos (Argonne National Laboratory), Qun Liu (Brookhaven National Laboratory), Jon Taylor (Oak Ridge National Laboratory), James Evans (Pacific Northwest National Laboratory), Aina Cohen (SLAC National Accelerator Laboratory)
  • Website: lambda-doe.org
  • Scope/Objectives: Initiate a connected data framework that can be used by all BER-supported photon, neutron, and imaging resources enabling biological insights and biomolecular design
  • Potential Impact and Interface with the American Science Cloud (AmSC) and Transformational AI Models Consortium (ModCon): LAMBDA can build AI models as a Model Team for ModCon and can leverage AmSC infrastructure

Summary

The Department of Energy’s Biological and Environmental Research (BER) program supports world-class photon, neutron, and electron facilities that generate complementary structural biology data critical for understanding complex biomolecular systems. However, these datasets are currently fragmented across facilities—stored in disparate formats and silos—making it difficult to integrate multimodal data for advanced analyses. This lack of interoperability limits progress on grand-challenge science that requires combining crystallography, scattering, cryo-EM/ET, nuclear magnetic resonance, and imaging data into unified views of biological structure and function.

To address this, the team will build LAMBDA: a Lakehouse-enabled AI-ready Multimodal Bioimaging Data Architecture, which is a cross-facility, standardized data framework for all BER-supported structural biology and imaging resources. LAMBDA will establish common data structures, metadata standards, and APIs to enable seamless integration of multimodal datasets into the BER Data Lakehouse, while incorporating AI tools for automated data access, harmonization, and workflow execution. The framework will follow a modular architecture, supporting raw data ingestion, structured metadata for integrative querying, and AI-ready processed datasets. Wherever possible, researchers will leverage existing community standards (e.g., PDB, NeXus, mmCIF, EMPIAR, Simple Scattering) and DOE data infrastructure (e.g., Data Lakehouse, KBase, NMDC, DataFed) to ensure interoperability and sustainability. The approach is organized into three activities:

  1. Community Input – Convene workshops and working groups to assess current practices, identify challenges in developing AI-ready data and federated workflows, and select high-value BER science use cases to drive framework development.
  2. Standardization – Develop a shared, API-accessible data and metadata infrastructure aligned with community schemas and BER Lakehouse design patterns, enabling cross-facility discovery and AI integration.
  3. Workflows – Implement automated, facility-adaptable workflows for metadata capture, harmonization, quality control, and secure data deposition, ensuring that local practices are respected while enabling global interoperability.

In addition, selected science use cases will demonstrate the framework’s value across BER research, potentially including critical mineral recovery, host–pathogen interactions or biofuel production. By unifying structural biology data across photon, neutron, and electron sources, LAMBDA will transform how BER researchers discover, integrate, and analyze multimodal datasets. This will accelerate AI-driven insights into genotype–phenotype relationships, biomolecular design principles, and biological responses to environmental change—enabling predictive modeling and innovation in sustainable mineral recovery, resilient bioenergy crops, and advanced biofuel production.