Knowledge Extraction from Literature

Authors:

Shinjae Yoo¹* ([email protected]), Carlos X. Soto¹, Gilchan Park¹, Christopher Neely², Vivek K. Mutalik², Paramvir S. Dehal²(PI)

Institutions:

¹Brookhaven National Laboratory; ²Lawrence Berkeley National Laboratory

Goals

This presentation details a proof-ofconcept demonstration that applies state-of-the-art natural language processing (NLP) techniques to automatically extract biological entities and genetic tools from the literature for synthetic biology research. This work seeks to address important knowledge gaps in this field, while simultaneously providing a meaningful staging ground to expose new NLP tools to the KBase community and to gather feedback on their efficacy and use.

Abstract

Genetic tool engineering in non-model organisms remains a major challenge in the field of synthetic biology and is typically throttled by the large literature searches that invariably accompany the project. Indeed, with decades of publications containing a vast corpus of non-standardized data formats and methods, synthesizing an adequate protocol to guide the study can be a daunting task. But as global issues like climate change, degradation of ecosystems, and increasing food scarcity continue to emerge and grow, the need to engineer these organisms and their relevant toolsets becomes ever greater. Clearly, there is a pressing need for fast and comprehensive searches that help inform and guide laboratory research, as incomplete, cursory searches may increase the time it takes to complete the project or may even preempt its success.

Advances in the field of natural language processing have resulted in the development of powerful large language models (LLMs) to provide solutions to such problems. Fine-tuning these models to identify genetic engineerability terms and to perform biological entity extraction can direct researchers towards useful and informative answers that are driven by existing literature. This team presents a workflow for automating this extraction and subsequently incorporating the data into model file-tuning. Working with a large number of publications extracted from bioRxiv, the team uses text-mining techniques guided by human experts to extract biologically relevant entities from a large publication corpus: organism names and genetic tools, including plasmids, promoters, reporters, and other entities of interest. Researchers test various publicly available LLMs, including Falcon (Almazrouei et al. 2023), LLaMA-2 (Touvron et al. 2023), MPT-Chat (MosaicML 2023), and others, to identify the best-performing model and augment it using state-of-the-art techniques to mitigate model hallucination.

As a proof of principle, this group presents a web application that interfaces a chatbot driven by this model with a visualization tool. User queries are highlighted on the National Center for Biotechnology Information (Schoch et al. 2020) taxonomy tree at the genus-level, highlighting the organisms of interest and displaying their nearest relatives. Datasets collected from various isolate reference databases, including BacDive (Reimer et al. 2022), as well as specific genetic tool databases like the Phage-Host Database (Albrycht et al. 2022), the Plasmid Database (Schmartz et al. 2022), and others, are searched for relevant matches to display to the user. In addition, matches from the literature, mined by the LLM, are also provided to the user. Integrating this tool into KBase infrastructure provides another bridge for DOE BER researchers to access this information, linking it not only with biologically relevant organisms for laboratory experiments, but also with KBase’s own ecosystem to allow subsequent analyses and publication of results.

Image

Path with arrows from literature scraping to machine learning models to a host datasheet.

Project Overview. The group proposes to use best natural language processing and machine learning models to process and learn from literature data about growth characteristics, conditions, traits (e.g., antibiotic and stress tolerance, fitness traits) and available in silico models and genetic tools to engineer DOE-BER centric micro-organisms. [Courtesy Lawrence Berkeley National Laboratory]

References

Albrycht, K., et al. 2022. “Daily Reports on Phage-Host Interactions,” Frontiers in Microbiology 13. DOI:10.3389/ fmicb.2022.946070.

Almazrouei, E., et al. 2023. “The Falcon Series of Open Models,” arXiv. DOI:10.48550/arXiv.2311.16867.

MosaicML NLP Team. 2023. “Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs.” Databricks. www.mosaicml.com/blog/mpt-7b.

Reimer, L. C., et al. 2022. “BacDive in 2022: The Knowledge Base for Standardized Bacterial and Archaeal Data,” Nucleic Acids Research 50. DOI:10.1093/nar/gkab961.

Schmartz, G. P., et al. 2022. “PLSDB: Advancing a Comprehensive Database of Bacterial Plasmids,” Nucleic Acids Research 50. DOI:10.1093/nar/gkab1111.

Schoch, C. L., et al. 2020. “NCBI Taxonomy: A Comprehensive Update on Curation, Resources and Tools,” Database (Oxford). DOI:10.1093/database/baaa062.

Touvron, H., et al. 2023. “LLaMA 2: Open Foundation and Fine-Tuned Chat Models,” arXiv. DOI:10.48550/ arXiv.2307.09288.

Funding Information

This material is based upon work supported by the DOE Office of Science, BER program under award number DE-AC02-05CH11231 (Lawrence Berkeley National Laboratory) and DE-SC0012704 (Brookhaven National Laboratory).