Genomic Science Program
U.S. Department of Energy | Office of Science | Biological and Environmental Research Program

2024 Abstracts

Predicting Protein Function Using Structure and Sequence Similarity in KBase

Authors:

Christopher S. Henry1* ([email protected], PI), Nidhi Gupta1, Claudia Lerma-Ortiz1, Janaka N. Edirisinghe1, Nils Oberg2, John Gerlt2, KBase Team1,3,4,5, Robert Cottingham4, Adam P. Arkin3

Institutions:

1Argonne National Laboratory; 2University of Illinois Urbana- Champaign; 3Lawrence Berkeley National Laboratory; 4Oak Ridge National Laboratory; 5Brookhaven National Laboratory

URLs:

Goals

Protein families of unknown function are a significant challenge facing the DOE BER research community. While many tools in KBase and elsewhere today permit the discovery of entirely new protein families, very few tools exist to study the function of these families. The Enzyme Function Initiative (EFI; enzymefunction.org) offers tools to address this critical problem. This project aims to integrate the EFI toolset into KBase fully, with complete ties to DOE BER sequencing sources, including all sequence data in KBase and the DOE Joint Genome Institute Integrated Microbial Genomes database. Further, the team will ensure the interoperability of these tools with other functional genomics tools in KBase, particularly tools to integrate structural data from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB).

Abstract

One of the most significant challenges currently inhibiting understanding of complex biological systems from genomic and multiomic data is the staggering number of proteins with unknown functions. Tools are needed to integrate multiple sources of evidence to decode the functions of uncharacterized protein families and understand the limits of annotation propagation. EFI toolkit supports protein function discovery through Sequence Similarity Networks (SSNs) (Zallot et al. 2019; Oberg et al. 2023). Here, researchers will demonstrate how the EFI toolkit (now partially in KBase) is combined with other tools, particularly tools for integrating structural insights from RCSB-PDB, to study the propagation of function through the members of a close protein family.

The team’s first demonstration of the protein function discovery pipeline in KBase focuses on the aconitase superfamily. In the quest to enable automated rapid reconstruction of high-quality fungal metabolic models, researchers detected essential functions often misannotated in fungal genomes. The team focused on three examples: aconitases AcnN and AcnD and
2-methylisocitrate lyase. First, researchers constructed a protein family around correctly annotated instances of each function; then, for each family, researchers constructed phylogenetic protein trees and SSNs (generated by the EFI tools). The team correctly annotated the problematic protein families across diverse fungal genomes using these constructs, improving annotation consistency and the corresponding metabolic models. Furthermore, group members investigated the structure of the aconitase proteins using KBase-RCSB tools, observing that yeast aconitase D’s structure is similar to that of bacterial proteins. In contrast, the bacterial aconitase A has only 6% identity with its fungal mitochondrial equivalent (AcnN).

The second demonstration explores enzymes involved in microbial degradation of pyridine, specifically recently discovered Group C mono-oxygenases pdbA and pyrA and an alternative route involving Vanillate O-demethylase oxygenase VanA. Protein families were constructed in KBase around these enzymes; protein members were then organized into a tree and compiled into an SSN. Researchers used these approaches to study the evolutionary patterns of variation within these protein families to explore the potential phylogenetic breadth of the pyridine degradation activity discovered or proposed for these genes. The team also applied AlphaFold to produce structures for representative genes, which were compared with related structures in PDB and applied to perform docking simulations in KBase. Together, these tools reveal insights into accurately propagating these relatively new annotations to new genomes, improving the representation of the new pyridine degradation pathway in metabolic models produced by KBase.

This group is working to validate the new monooxygenase annotations in the pyridine degradation pathway using the self-driven laboratory system at Argonne National Laboratory and the strain Acinetobacter sp. ADP1. Using these capabilities, researchers can track the capacity of ADP1 to grow on pyridine with complementation and knockout of the candidate proteins. This team is exploring using this platform as an automated testbed for validating new protein function discoveries emerging from KBase.

References

Oberg, N., et al. 2023. “EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) Web Resource for Genomic Enzymology Tools,” Journal of Molecular Biology 435(14). DOI:10.1016/j.jmb.2023.168018.

Zallot, R., et al. 2019. “The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways,” Biochemistry 58(41), 4169–82. DOI:10.1021/acs.biochem.9b00735.

Funding Information

This work is supported as part of the BER Genomic Science program. The DOE Systems Biology Knowledgebase (KBase) is funded by the DOE Office of Science BER program under DE-AC02-05CH11231, DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-AC02-98CH10886. Plans for Integration of Enzyme Function Initiative (EFI) Tools into the KBase Platform is funded by the DOE Office of Science BER program under PRJ1010515. Self-driven laboratory work was funded by Laboratory Directed Research Development at Argonne National Laboratory.