Trend AnalysisChemistry & Materials

Machine Learning Meets Directed Evolution: The New Era of Enzyme Engineering

Frances Arnold's Nobel Prize-winning directed evolution mimics natural selection in the laboratory: introduce random mutations, screen for improved function, repeat. But the protein fitness landscape ...

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The Question

Frances Arnold's Nobel Prize-winning directed evolution mimics natural selection in the laboratory: introduce random mutations, screen for improved function, repeat. But the protein fitness landscape is astronomically large β€” a 300-amino-acid protein has 20³⁰⁰ possible sequences. Experimental screening, even with high-throughput methods, can explore only a tiny fraction. Machine learning (ML) promises to navigate this landscape computationally, predicting which mutations are likely to improve function before any experiment is performed. Can ML-guided evolution achieve results impossible through random mutagenesis alone?

Landscape

Yang, Li & Arnold (2024) in ACS Central Science, reviewed the opportunities and challenges of ML-assisted enzyme engineering. Their perspective from Arnold's own laboratory β€” the birthplace of directed evolution β€” carries particular authority. They identified two broad areas where ML adds value: (1) starting point discovery β€” through functional annotation or generation of novel protein sequences; and (2) navigating protein fitness landscapes β€” by learning mappings between sequences and fitness values to guide library design and exploration of distant sequence space.

Ding et al. (2024) introduced MODIFY β€” an ML algorithm that co-optimises fitness and diversity in combinatorial library design. The key insight: maximising fitness alone leads to narrow libraries clustered around known solutions, while maximising diversity alone wastes screening capacity on non-functional variants. MODIFY balances both objectives, producing libraries where a higher fraction of variants are both functional and novel.

Thomas et al. (2025) demonstrated the full ML-guided engineering cycle: they used TeleProt, a framework blending evolutionary and experimental data, to engineer highly active nuclease enzymes. Their pipeline achieved activity improvements that would have required orders of magnitude more experimental screening via traditional directed evolution.

Tran & Hy (2024) explored protein language models (PLMs) β€” large language models trained on protein sequences β€” as guides for directed evolution. PLMs learn evolutionary patterns from millions of natural sequences, reportedly enabling prediction of mutation effects with minimal or no experimental data for the specific enzyme of interest.

Key Claims & Evidence

<
ClaimEvidenceVerdict
ML reduces experimental screening by orders of magnitudeTeleProt achieves superior nucleases with fewer rounds of screening (Thomas et al. 2025)Supported; demonstrated across multiple enzyme targets
Co-optimising fitness and diversity improves library designMODIFY algorithm outperforms fitness-only or random library design (Ding et al. 2024)Supported; validated experimentally
Protein language models guide directed evolutionPLMs trained on natural sequences identify mutation hotspots for optimisation (Tran & Hy 2024)Promising; accuracy varies by enzyme family
ML is complementary to, not a replacement for, experimental evolutionML narrows the search space; experimental validation remains essential (Yang et al. 2024)Confirmed; current consensus in the field

Open Questions

  • Epistasis: Mutation effects are often non-additive β€” two individually beneficial mutations may be deleterious when combined. Can ML models capture these epistatic interactions from limited training data?
  • Novel functions: ML excels at optimising known functions but struggles with predicting entirely new catalytic activities. Can generative models design enzymes for reactions not found in nature?
  • Data requirements: How much experimental data is needed to train a useful ML model for a specific enzyme? Can transfer learning from related enzymes reduce this requirement?
  • Reproducibility: ML predictions depend heavily on training data curation and model architecture. Can standardised benchmarks and open-source tools improve reproducibility across laboratories?
  • Referenced Papers

    • [1] Yang, J., Li, F.-Z. & Arnold, F.H. (2024). Opportunities and Challenges for ML-Assisted Enzyme Engineering. ACS Central Science. DOI: 10.1021/acscentsci.3c01275
    • [2] Ding, K. et al. (2024). ML-guided co-optimization of fitness and diversity for combinatorial library design. Nature Communications, 15, 6038. DOI: 10.1038/s41467-024-50698-y
    • [3] Thomas, N. et al. (2025). Engineering highly active nucleases with ML and HTS. Cell Systems. DOI: 10.1016/j.cels.2025.101236
    • [4] Tran, T.V.T. & Hy, T. (2024). Protein Design by Directed Evolution Guided by Large Language Models. IEEE Trans. Evolutionary Computation. DOI: 10.1109/TEVC.2024.3439690
    • [5] Grigorakis, K. et al. (2025). Protein Engineering for Industrial Biocatalysis: Lessons from PETases. Catalysts, 15(2), 147. DOI: 10.3390/catal15020147

    References (5)

    Yang, J., Li, F., & Arnold, F. H. (2024). Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS Central Science, 10(2), 226-241.
    Ding, K., Chin, M., Zhao, Y., Huang, W., Mai, B. K., Wang, H., et al. (2024). Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering. Nature Communications, 15(1).
    Thomas, N., Belanger, D., Xu, C., Lee, H., Hirano, K., Iwai, K., et al. (2025). Engineering highly active nuclease enzymes with machine learning and high-throughput screening. Cell Systems, 16(3), 101236.
    Tran, T. V. T., & Hy, T. S. (2025). Protein Design by Directed Evolution Guided by Large Language Models. IEEE Transactions on Evolutionary Computation, 29(2), 418-428.
    Grigorakis, K., Ferousi, C., & Topakas, E. (2025). Protein Engineering for Industrial Biocatalysis: Principles, Approaches, and Lessons from Engineered PETases. Catalysts, 15(2), 147.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords β†’