Machine Learning-Based Protein Annotation Tool Predicts Protein Function

Snekmer allows scientists to use rapid prototyping to better understand the function of proteins in microbes.

Cartoon of Snekmer highlighting input proteins, peptide kmers, family assignment, and biological interpretation.

Snekmer is an application for building and searching protein family models and novel sequence clusters.
[Image courtesy of Jason McDermott, Pacific Northwest National Laboratory]

The Science

Microbes drive key processes of life on Earth. They affect global elemental cycles—the movement of carbon, nitrogen, and other elements. They also promote plant growth and affect the development of diseases. These roles are essential in every ecosystem. Research constantly expands the database of microbial DNA sequences but does not provide all the biological information about proteins. To engineer microbes for sustainable bioenergy and other bioproducts, scientists need a fuller understanding of the function of proteins and other molecules. Scientists infer the function of a protein by comparing it with reference databases of already characterized proteins. However, these comparisons are difficult and not scalable to massive databases. To address this challenge, scientists have applied machine learning to models that predict protein function. The result is the program Snekmer, which allows scientists to quickly model families of proteins.

The Impact

Studying biological protein molecules in microbes will help scientists pursue new applications for engineered microbes. Snekmer is easy to deploy in high-performance computing environments. In addition, it is incorporated into the DOE KBase framework as a new application that will allow users to annotate their genome and metagenome sequences. This will help scientists to better model the effects of engineering microbes. That includes these microbes’ effect on the climate and their benefits for crop health and bioproduction. Snekmer will also help scientists study the evolution of microbes and patterns in microbiomes.


The inability of current methods to predict function for 30-50% of bacterial protein sequences is a significant barrier to better understanding of complex systems such as soil microbiomes. Most protocols rely on pair-wise alignments, which are becoming computationally intractable and more challenging to interpret as databases expand. For alignment-based models of protein families, the sensitivity and accuracy depend on initial training sets, which risk obsolescence as additional sequence diversity is discovered. Many bacterial proteins have either no functional assignment or are only assigned a general function based solely on taxonomic understanding.

To address this need, researchers at Pacific Northwest National Laboratory, Baylor University, and Oregon Health & Science University developed Snekmer, a software tool leveraging redundancy of amino acid residue properties to reduce sequence space and using short protein sequence (kmer) features for machine learning to generate protein family models. Snekmer users can recode protein sequences into reduced alphabet kmer vectors and perform construction of supervised classification models trained on input protein families, or protein functional classification based on Snekmer models.

Principal Investigator

Jason McDermott
Pacific Northwest National Laboratory
[email protected]

Related Links

BER Program Manager

Ramana Madupu

U.S. Department of Energy, Biological and Environmental Research (SC-33)
Biological Systems Science Division
[email protected]


This research was supported by the Department of Energy Office of Science, Biological and Environmental Research program, and is a contribution to the “Persistence Control of Engineered Functions in Complex Soil Microbiomes” Scientific Focus Area. Additional support was provided by the National Science Foundation and the Defense Threat Reduction Agency


Chang, C.H., et al., Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recodingBioinformatics Advances 3, 1 (2023). [DOI: 10.1093/bioadv/vbad005]