Engineering Overlapping Genes in Bacteria
Authors:
Sean P. Leonard1, Jennifer L. Chlebek1, Jose Manuel Martí1, Hunter Nisonoff2, Chloe Hsu2, Charlotte Rochereau3, Christina Kang-Yun1, Harris Wang3, Jennifer Listgarten2, Jonathan Allen1, Dante P. Ricci1, Dan Park1, and Yongqin Jiao1* ([email protected])
Institutions:
1Lawrence Livermore National Laboratory; 2University of California–Berkeley; and 3Columbia University
URLs:
Goals
A primary goal of the Science Focus Area (SFA) is to establish genetic sequence entanglement—in which two genes are encoded within the same DNA sequence through use of alternative reading frames—as a generalizable biocontainment strategy to protect engineered functions against mutational inactivation and to mitigate the horizontal transfer of invasive genes. Achieving sequence entanglement remains a significant challenge due to sequence constraints that necessitate large-scale redesign of the entangled proteins. Through design-build-test-learn (DBTL) iterations using the Constraining Adaptive Mutations using Engineered Overlapping Sequences, eXtended (CAMEOX) algorithm (Blazejewski, Ho, and Wang 2019), high throughput functional assays, and state-of-the-art machine learning and protein structural prediction algorithms, researchers aim to improve the accuracy of entanglement designs and expand the application to a broad range of microbes.
Abstract
To generate an initial data set for model training and development, researchers initiated a DBTL campaign of an entanglement pair composed of infA and aroB. infA encodes the translation initiation factor 1 (72 AA) that is essential for growth and aroB encodes 3-dehydroquinate synthase (362 AA) that is required for aromatic amino acid biosynthesis. CAMEOX was used to generate 130,000 entanglement designs, among which, 2,000 designs were selected for experimental testing. The functionality of infA and aroB were assayed separately through selection. The results revealed that between ~10-30% of aroB variants and ~25% of infA variants were highly enriched in the surviving population, indicative of protein function. Researchers found that the gene fitness scoring metric generated by CAMEOX— pseudolikelihood score—correlates well with the experimental enrichment scores, confirming the pseudolikelihood score as a reliable indicator for protein fitness. Combing the results from both assays for infA and aroB, 14 variants were identified with high enrichment values for both genes and are being tested for functionality within the entangled context.
Using the experimentally measured fitness data for infA and aroB, the team trained random forest (RF) classifiers based on amino acid composition and used the model to predict the fitness of CAMEOX-designed entanglement solutions. Researchers found that simple classifiers such as the frequency of certain amino acids were able to predict variant fitness with high accuracy, even when a small number of measured variants was used in the training set. Using these RF models, researchers screened the complete set of 130,000 infA/aroB CAMEOX designs and identified 29 with potential functionality for both genes. Experimental testing of these variants is underway.
In addition to RF models, researchers have leveraged AlphaFold to rank CAMEOX variants according to how foldable their structures are. Relying on AFRank (Roney and Ovchinnikov 2022), the team predicted the structures of CAMEOX variants and used the predicted confidence metrics as a proxy for variant fitness. This approach complements sequence-based screening methods to in silico select the best variants for experimental testing. Besides better ranking the variants, researchers have also modified the algorithm to expand the diversity of proposed CAMEOX solutions. Researchers have developed gradient-based Markov Chain Monte Carlo (MCMC) methods for designing the entangled nucleotide sequences. This new optimization protocol generally improves over the previous greedy optimization algorithms and enables the generation of more diverse sequences with better fitness scores.
In addition to engineering entanglement for specific gene pairs, the team seeks to comprehensively assess entanglement feasibility of a wide array of gene pairs to identify features of DNA/ protein sequences that make genes co-encoding more amenable. Leveraging improved speed and automation of CAMEOX, the team undertook a campaign to computationally generate entanglement designs for nearly all conditionally essential genes in Escherichia coli (94) by entangling them with one another. An additional 24 genes of interest (positive controls with naturally entangled phiX174 genes, reporter genes that allow for quantitative phenotypic characterization, antibiotic resistance cassettes) were also included. This in silico campaign yielded > 8 million pairwise entanglements solutions in both +1 and +2 reading frames. Using custom evolutionary models of the parental proteins, the team developed a scoring rubric for CAMEOX designs that allows the ability to quantify and compare the entangle-ability of individual genes as well as the compatibility of gene pairs. By generating and testing specific designs, the team found CAMEOX can successfully design functional protein variants with low sequence conservation (<50% identify) to their natural orthologs. Furthermore, researchers have identified specific features of proteins that correlate with better entanglement outcomes such as enrichment for amino acids with a higher degree of codon degeneracy, a property also observed with naturally occurring entangled genes.
References
Blazejewski, T., H.-I. Ho, and H. H. Wang. 2019. “Synthetic Sequence Entanglement Augments Stability and Containment of Genetic Information in Cells.” Science 365, 595–598.
Roney, J. P., and S. Ovchinnikov. 2022. “State-of-the-Art Estimation of Protein Model Accuracy using AlphaFold.” Physical Review Letters 129(23), 238101.
Funding Information
This work is supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Lawrence Livermore National Laboratory Secure Biosystems Design SFA “From Sequence to Cell to Population: Secure and Robust Biosystems Design for Environmental Microorganisms”. Work at LLNL is performed under the auspices of the U.S. Department of Energy at Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.