Leveraging Large Language Models to Synthesize and Develop New Questions
Authors:
Paramvir S. Dehal1* ([email protected]), Benjamin Allen3, Jason Baumohl1, Kathleen Beilsmith2, Joseph Bezouska2, David Dakota Blair4, Mikaela Cashman1, John-Marc Chandonia1, Dylan Chivian1, Zachary Crockett3, Ellen G. Dow1, Meghan Drake3, Janaka N. Edirisinghe2, José P. Faria2, Jason Fillman1, Andrew Freiburger2, Tianhao Gu2, Prachi Gupta1, A. J. Ireland1, Marcin P. Joachimiak1, Sean Jungbluth1, Roy Kamimura1, Keith Keller1, Dileep Kishore3, Dan Klos2, Filipe Lui2, David Lyon1, Cody O’Donnell1, Christopher Neely1, Erik Pearson1, Gavin Price1, Priya Ranjan3, William Riehl1, Boris Sadkhin2, Samuel Seaver2, Alan Seleman2, Gwyneth Terry1, Pamela Weisenhorn2, Ziming Yang4, Shinjae Yoo4, Sijie Xiang1, Qizhi Zhang2, Shane Canon1, Elisha Wood-Charlson1, Robert Cottingham3, Christopher S. Henry2, Adam P. Arkin1 (PI)
Institutions:
1Lawrence Berkeley National Laboratory; 2Argonne National Laboratory; 3Oak Ridge National Laboratory; 4Brookhaven National Laboratory
URLs:
Goals
The DOE Systems Biology Knowledgebase (KBase) is a knowledge creation and discovery environment designed for both biologists and bioinformaticians. KBase integrates a large variety of data and analysis tools, from DOE and other public services, into an easy-to-use platform that leverages scalable computing infrastructure to perform sophisticated systems biology analyses. KBase is a publicly available and developer extensible platform that enables scientists to analyze their own data within the context of public data and share their findings across the system.
Abstract
In the rapidly evolving landscape of biological data analysis, KBase is uniquely positioned with respect to its capabilities that combine analytic tools, large-scale data compendia, user data, and publishing and sharing. To leverage these capabilities, this team is starting two new initiatives that are driven by custom trained large language models (LLMs). The first initiative is the use of LLM-powered artificial intelligence (AI) agents that assist users in their analysis and the interpretation of those results. And second, LLMs can assist users with search and discovery of data within KBase to help formulate and evaluate hypotheses. This presentation shows the progress made in these areas. This includes the creation of an AI agent with a natural language interface that guides the user through the analysis of a microbial genome from the reads through to a genome paper. This agent is capable of invoking all the necessary tools and assisting the user in interpreting the output of those tools. In addition, this presentation will discuss the development of an LLM infrastructure to enhance the search and discovery of data within KBase. This includes creating a natural language query interface, personalizing the search experience, and employing intelligent data retrieval and reasoning to answer complex scientific questions. By implementing these AI-enhanced capabilities, KBase aims to offer a more intuitive and effective platform for scientific exploration through a collaborative environment.