Towards in silico enzyme design

What is enzyme design?

Enzyme catalysis is a key process in a wide range of industries from pharmaceutical to food sciences [1-6]. In evolutionary context, nature has optimized enzymes in living organisms to adapt to specific niches over millions, if not billions, of years. Despite the constant evolution, these enzymes might not meet our modern-day needs. For example, the enzyme’s efficiency may be too low for certain industrial application. Or worse, there may be no natural enzyme performing our desired chemical reaction.
To circumvent the problem, we “engineer” protein enzymes using various approaches. Rational enzyme design, directed evolution and de novo protein design are the three main categories. Rational enzyme design requires expert knowledge to identify which specific part of the enzymes to optimize. On the other hand, directed evolution is “unsupervised”. Through mutations, random alterations are made on the enzyme, where some of the altered enzymes will be selected based on their performance in desired traits. This mutation-selection process is repeated in the lab until there is a satisfactory enzyme produced. The third approach, de novo design, skips all the lab work for mutation and selection. In silico protein design suites, such as Rosetta [7-8], generates a pool of hypothetical protein structures from sequences. The best mutant(s) were then selected from the pool and verified experimentally.

So what is the problem?

Despite all these breakthroughs, enzyme design is still a time-consuming, labor-intensive and expensive process. Even with directed evolution, iterative expression, screening and cell culturing in wet lab remain manual and tedious processes [9]. With de novo design, the output is often enzyme structures while in reality we care most about the enzyme properties, such as thermal stability and catalytic efficiency. This creates a mismatch between needs and capabilities.
What’s more, the peptide (protein) sequence space is too huge to explore arbitrarily. Every site on a protein has 20 amino acid type possible. If we assume a protein has 200 sites, there would be

16000 00000 00000 00000 00000 00000 00000 00000 00000 00000
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
00000 00000 00000 00000 00000 00000 00000 00000 00000 00000
00000 00000

combinations to try either experimentally or computationally.

How are we doing now?

These difficulties have encouraged the development of high-throughput experiment and Machine Learning (ML) as the new wave of enzyme design. High-throughput experiment requires customizing for specific chemical reaction of interest and is an on-going development [10-11].

Existing enzyme property databases such as ProTherm [12] have thousands of data points. However, the experimental conditions vary among the dataset, resulting in noise. Also, experimental readings are often the mixture of multiple enzyme properties at the same time, making it even harder to analyze. For that, we need a consistent and cleaner database. Construction of such databases specifically for enzymes has proved recent progress [13-14].
Meanwhile, although there are many ML applications in protein design, references specifically for enzyme design are few [15-19]. Ideally, ML in enzyme design should seek to predict desired mutation to improve specific enzyme trait(s) based on existing database. ML methods from linear models to neural networks have been employed [9,20,21]. Yet, the model and performance drastically change from application to application.
Here, I illustrated the background and difficulties in enzyme engineering. This extends to the need of database construction and ML application. In my next post, I will go deeper into ML in enzyme design and summarizes its latest development.

Sun, Huihua, et al. “Biocatalysis for the Synthesis of Pharmaceuticals and Pharmaceutical Intermediates.” Bioorganic and Medicinal Chemistry, vol. 26, no. 7, Elsevier Ltd, 2018, pp. 1275–84, doi:10.1016/j.bmc.2017.06.043.
Dies, Gideon, et al. “Structures and Mechanisms of Glycosyl Hydrolases.” Structure, vol. 3, 1995, pp. 853–59, doi:10.1016/S0969-2126(01)00220-9.
Lalonde, Jim. “Highly Engineered Biocatalysts for Efficient Small Molecule Pharmaceutical Synthesis.” Current Opinion in Biotechnology, vol. 42, Elsevier Ltd, 2016, pp. 152–58, doi:10.1016/j.copbio.2016.04.023.
Savile, Christopher K., et al. “Biocatalytic Asymmetric Synthesis of Sitagliptin Manufacture.” Science, vol. 329, no. July, 2010, pp. 305–10, doi:10.1126/science.1188934.
Wolf, Clancey, et al. “Engineering of Kuma030: A Gliadin Peptidase That Rapidly Degrades Immunogenic Gliadin Peptides in Gastric Conditions.” Journal of the American Chemical Society, vol. 137, no. 40, 2015, pp. 13106–13, doi:10.1021/jacs.5b08325.
Reaction, Bimolecular Diels-alder, et al. “Computational Design of an Enzyme Catalyst for a Stereoselective Bimolecular Diels-Alder Reaction.” Science, vol. 329, no. July, 2010, pp. 309–14.
Leaver-Fay, Andrew, et al. “The Rosetta All-Atom Energy Function for Macromolecular Modeling and Design.” Journal of Chemical Theory and Computation, vol. 13, no. 6, 2017, pp. 3031–48, doi:10.1021/acs.jctc.7b00125.
Chaudhury, Sidhartha, et al. “PyRosetta: A Script-Based Interface for Implementing Molecular Modeling Algorithms Using Rosetta.” Bioinformatics, vol. 26, no. 5, 2010, pp. 689–91, doi:10.1093/bioinformatics/btq007.
Yang, Kevin K., et al. “Machine Learning in Protein Engineering.” ArXiv, no. Md, 2018,
Romero, Philip A., et al. “Dissecting Enzyme Function with Microfluidic-Based Deep Mutational Scanning.” Proceedings of the National Academy of Sciences, vol. 112, no. 23, 2015, pp. 7159–64, doi:10.1073/pnas.1422285112.
Fowler, Douglas M., and Stanley Fields. “Deep Mutational Scanning: A New Style of Protein Science.” Nature Methods, vol. 11, no. 8, 2014, pp. 801–07, doi:10.1038/nmeth.3027.
Gromiha, M. Michael, et al. “ProTherm, Thermodynamic Database for Proteins and Mutants: Developments in Version 3.0.” Nucleic Acids Research, vol. 30, no. 1, 2002, pp. 301–02, doi:10.1093/nar/30.1.301.
Carlin, Dylan Alexander, et al. “Thermal Stability & Kinetic Constants for 129 Variants of a Family 1 Glycoside Hydrolase Reveal That Enzyme Activity & Stability Can Be Separately Designed.” PLoS ONE, vol. 12, no. 5, 2017, pp. 1–13, doi:10.1371/journal.pone.0176255.
Carlin, Dylan Alexander, et al. “Kinetic Characterization of 100 Glycoside Hydrolase Mutants Enables the Discovery of Structural Features Correlated with Kinetic Constants.” PLoS ONE, vol. 11, no. 1, 2016, pp. 1–14, doi:10.1371/journal.pone.0147596.
Yang, Yang, et al. “Pon-Tstab: Protein Variant Stability Predictor. Importance of Training Data Quality.” International Journal of Molecular Sciences, vol. 19, no. 4, 2018, doi:10.3390/ijms19041009.
Tian, Jian, et al. “Predicting Changes in Protein Thermostability Brought about by Single- or Multi-Site Mutations.” BMC Bioinformatics, vol. 11, no. 370, 2010.
Yang, Kevin K., et al. “Machine Learning in Protein Engineering.” ArXiv, no. Md, 2018,
Wu, Zachary, et al. Machine-Learning-Assisted Directed Protein Evolution with Combinatorial Libraries. no. 16, 2019, doi:10.1073/pnas.1901979116.
Bonk, Brian M., et al. “Characteristics That Promote Enzyme Catalysis Machine Learning Identifies Chemical Characteristics That Promote Enzyme Catalysis.” Journal of the American Chemical Society, vol. 141, American Chemical Society, 2019, pp. 4108–18, doi:10.1021/jacs.8b13879.
Gainza, P., et al. “Deciphering Interaction Fingerprints from Protein Molecular Surfaces Molecular Surfaces.” BioRxiv, 2019, pp. 1–44, doi:
Alley, Ethan C., et al. “Unified Rational Protein Engineering with Sequence-Only Deep Representation Learning.” BioRxiv, 2019, doi:10.1182/blood-2016-09-742205.