Leveraging Machine Learning for Enhanced Prediction of Microbial Carbon Utilization Phenotypes

Authors:

Priya Ranjan¹* ([email protected]), Dileep Kishore¹, Christopher Neely², William Riehl², Marcin P. Joachimiak²,
Paramvir S. Dehal², Janaka N. Edirisinghe³, José P. Faria³, Dale A. Pelletier¹, Mitchel J. Doktycz¹, Robert W. Cottingham¹ (PI), Christopher S. Henry³ (PI), Adam P. Arkin² (PI)

Institutions:

¹Oak Ridge National Laboratory; ²Lawrence Berkeley National Laboratory; ³Argonne National Laboratory

Goals

The DOE Systems Biology Knowledgebase (KBase) is a knowledge creation and discovery environment designed for biologists and bioinformaticians. KBase integrates many data and analysis tools from DOE and other public services into an easy-to-use platform that leverages scalable computing infrastructure to perform sophisticated systems biology analyses. KBase is a publicly available and developer-extensible platform that enables scientists to analyze their data within the context of public data and share their findings across the system. Here, the team describes a microbial carbon-utilization phenotype prediction pipeline developed as a collaboration between KBase, the Oak Ridge National Laboratory’s Plant-Microbe Interfaces (PMI) Science Focus Area (SFA), and the Ecosystems and Networks Integrated with Genes and Molecular Assemblies (ENIGMA) SFA.

Abstract

Deciphering the mechanisms of microbial nutrient utilization phenotypes from genomic data is necessary for understanding microbial niches in ecosystems. However, incomplete gene annotations and limited, inconsistent training data hamper the accuracy of current predictive models. This study tackles these obstacles by integrating diverse annotation sources, including metabolic [e.g., Rapid Annotation using Subsystem Technology (RAST), KOfam), protein functional (e.g., UniProt)], and de novo protein clustering, to enrich feature representations. The group employed a comprehensive dataset comprising 626 diverse microbial genomes and their individual growth outcomes across 98 different carbon sources, facilitating the development of 98 phenotype-specific classifiers.

This study employed a range of feature preprocessing and selection strategies alongside a standardized evaluation framework, to facilitate the comparison of classifier performance and enable the effective integration of models that use different feature representations. Group members evaluated the accuracy and robustness of these classifiers utilizing various feature representations and observed that metabolic features, specifically RAST, exhibit the highest average accuracy across the different phenotypes. However, specific phenotype classifiers exhibit improved performance when utilizing protein function annotations or de novo protein clusters, suggesting that these genomes may possess incomplete metabolic annotations in pathways relevant to the given phenotype. These findings highlight the potential limitations of current genome annotation methods and the need for continued research to enhance understanding of metabolic pathways and their associated phenotypes.

Moreover, the group observed that while feature selection enhances classifier accuracy, methods like non-negative matrix factorization, which reduce feature dimensionality, detrimentally impact performance. This loss in accuracy indicates the critical role of smaller sets of specific enzymes or proteins in phenotype expression. Overall, the merger of feature sets notably boosts prediction accuracy for challenging phenotypes, underscoring this method’s effectiveness in addressing annotation inaccuracies. Furthermore, as a part of the PMI and the ENIGMA SFAs, group members are currently curating gold-standard phenotypic datasets under the same experimental protocols. These standardized datasets will form the foundation for further developing more robust phenotype prediction classifiers that cover a broader array of carbon sources and a phylogenetically diverse set of microbes. Ultimately, the group aims to integrate these classifiers into the KBase platform, making them available as applications and as part of the relation engine-driven pipelines. With this integration, the group strives to enable users to predict microbial phenotypes from a wide array of genomes, including all imported Reference Sequence genomes, thereby significantly advancing microbial research.

This study aims to improve microbial phenotype prediction by utilizing multifaceted feature representations, advanced machine-learning techniques, and standardized datasets to create accurate classifiers for specific phenotypes. These classifiers have the potential to advance scientists’ understanding of microbial growth phenotypes and serve as essential resources for improving annotations of metabolic pathways and understanding of microbial ecology.

Funding Information

This work is supported as part of the Genomic Science program (GSP) of BER. The DOE Systems Biology Knowledgebase (KBase) is funded by the DOE, Office of Science, BER program under DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-AC02-98CH10886.

This work is also supported by DOE, Office of Science, BER program, GSP as part of the Plant–Microbe Interfaces Science Focus Area at Oak Ridge National Laboratory (pmiweb.ornl.gov). Oak Ridge National Laboratory is managed by UT-Battelle, LLC., for DOE under contract DE-AC05-00OR22725.