AI Pilot Project: SLAC National Accelerator Laboratory
- Principal Investigator: Samuel M. Webb (SLAC)
- Co-Investigators: Alex Hagen (Pacific Northwest National Laboratory), Benjamin Cole (Lawrence Berkeley National Laboratory; LBNL), Kjiersten Fagnan (LBNL)
- Scope/Objectives: Build AI/ML model with multiscale multimodal imaging and sequencing data to predict heat stress tolerance in plants, integrating multimodal data into BER Data Lakehouse
- Potential Impact and Interface with the American Science Cloud (AmSC) and Transformational AI Models Consortium (ModCon): Build multimodal AI models (Model Teams) and connect with AmSC through BER Data Lakehouse
Summary
Elucidating how genomic information drives the emergence of complex plant phenotypes is a grand challenge in modern biology, foundational to advancing transformative bioenergy and biotechnology solutions central to DOE’s mission. This proposal brings together a partnership between SLAC National Accelerator Laboratory (SLAC), the DOE Joint Genome Institute (JGI), and Pacific Northwest National Laboratory (PNNL), combining complementary capabilities to provide the holistic, multiscale understanding needed to link genotype to phenotype. SLAC offers world-leading expertise and infrastructure in state-of-the-art X-ray and cryo-EM bioimaging, enabling chemical and structural characterization from nanometer to centimeter scales. JGI provides comprehensive, high-throughput sequencing and transcriptomic profiling across cells, tissues, and whole organisms, delivering essential molecular context. PNNL provides data science expertise on the use of AI/ML to classify images across a variety of spatial scales. Together, these facilities generate large, complex datasets ideally suited for AI/ML integration to detect patterns and predict stress responses in plants, spanning scales from molecules to entire organisms.
While AI tools offer an unprecedented opportunity to bridge the gap between genotype and phenotype using multimodal data, several challenges must be addressed: (1) the lack of consistently co-registered multimodal imaging and sequencing data from the same specimens; (2) diverse data formats, metadata schemas, and the need for encoding and linking imaging, genomic, and transcriptomic information into AI-ready datasets; and (3) inherently low throughput and complexity of image processing and registration, which impedes interpretation of large, heterogeneous datasets. Developing a framework to address these challenges is essential for integrating DOE imaging data into the DOE-BER Data Lakehouse and enabling AI workflows.
The team will integrate AI/ML with multiscale multimodal imaging and sequencing data to predict heat stress tolerance in Arabidopsis thaliana, a well-established model with a fully sequenced genome and numerous characterized transgenic variants. The goal is to develop flexible AI tools to analyze complex imaging data from SLAC together with transcriptomic data from JGI and prepare to extend this integration to a wide range of plant and microbial systems and scientific questions. The primary aims are to: (1) develop the framework for integrating imaging modalities and sequencing data to contribute to the Data Lakehouse; (2) train AI models to detect molecular to tissue-scale signatures of heat stress in A. thaliana; and (3) build predictive models linking genetic modifications and phenotypic expression. The long-term vision is to enable scalable, transferable AI frameworks for predictive modeling of plant processes under diverse environmental conditions, laying the foundation for future DOE efforts in climate-resilient agriculture and biosystems design.