Trend AnalysisChemistry & Materials
Machine Learning Meets Directed Evolution: The New Era of Enzyme Engineering
Frances Arnold's Nobel Prize-winning directed evolution mimics natural selection in the laboratory: introduce random mutations, screen for improved function, repeat. But the protein fitness landscape ...
By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.
The Question
Frances Arnold's Nobel Prize-winning directed evolution mimics natural selection in the laboratory: introduce random mutations, screen for improved function, repeat. But the protein fitness landscape is astronomically large β a 300-amino-acid protein has 20Β³β°β° possible sequences. Experimental screening, even with high-throughput methods, can explore only a tiny fraction. Machine learning (ML) promises to navigate this landscape computationally, predicting which mutations are likely to improve function before any experiment is performed. Can ML-guided evolution achieve results impossible through random mutagenesis alone?
Landscape
Yang, Li & Arnold (2024) in ACS Central Science, reviewed the opportunities and challenges of ML-assisted enzyme engineering. Their perspective from Arnold's own laboratory β the birthplace of directed evolution β carries particular authority. They identified two broad areas where ML adds value: (1) starting point discovery β through functional annotation or generation of novel protein sequences; and (2) navigating protein fitness landscapes β by learning mappings between sequences and fitness values to guide library design and exploration of distant sequence space.
Ding et al. (2024) introduced MODIFY β an ML algorithm that co-optimises fitness and diversity in combinatorial library design. The key insight: maximising fitness alone leads to narrow libraries clustered around known solutions, while maximising diversity alone wastes screening capacity on non-functional variants. MODIFY balances both objectives, producing libraries where a higher fraction of variants are both functional and novel.
Thomas et al. (2025) demonstrated the full ML-guided engineering cycle: they used TeleProt, a framework blending evolutionary and experimental data, to engineer highly active nuclease enzymes. Their pipeline achieved activity improvements that would have required orders of magnitude more experimental screening via traditional directed evolution.
Tran & Hy (2024) explored protein language models (PLMs) β large language models trained on protein sequences β as guides for directed evolution. PLMs learn evolutionary patterns from millions of natural sequences, reportedly enabling prediction of mutation effects with minimal or no experimental data for the specific enzyme of interest.
Key Claims & Evidence
<
| Claim | Evidence | Verdict |
|---|
| ML reduces experimental screening by orders of magnitude | TeleProt achieves superior nucleases with fewer rounds of screening (Thomas et al. 2025) | Supported; demonstrated across multiple enzyme targets |
| Co-optimising fitness and diversity improves library design | MODIFY algorithm outperforms fitness-only or random library design (Ding et al. 2024) | Supported; validated experimentally |
| Protein language models guide directed evolution | PLMs trained on natural sequences identify mutation hotspots for optimisation (Tran & Hy 2024) | Promising; accuracy varies by enzyme family |
| ML is complementary to, not a replacement for, experimental evolution | ML narrows the search space; experimental validation remains essential (Yang et al. 2024) | Confirmed; current consensus in the field |
Open Questions
Epistasis: Mutation effects are often non-additive β two individually beneficial mutations may be deleterious when combined. Can ML models capture these epistatic interactions from limited training data?
Novel functions: ML excels at optimising known functions but struggles with predicting entirely new catalytic activities. Can generative models design enzymes for reactions not found in nature?
Data requirements: How much experimental data is needed to train a useful ML model for a specific enzyme? Can transfer learning from related enzymes reduce this requirement?
Reproducibility: ML predictions depend heavily on training data curation and model architecture. Can standardised benchmarks and open-source tools improve reproducibility across laboratories?Referenced Papers
- [1] Yang, J., Li, F.-Z. & Arnold, F.H. (2024). Opportunities and Challenges for ML-Assisted Enzyme Engineering. ACS Central Science. DOI: 10.1021/acscentsci.3c01275
- [2] Ding, K. et al. (2024). ML-guided co-optimization of fitness and diversity for combinatorial library design. Nature Communications, 15, 6038. DOI: 10.1038/s41467-024-50698-y
- [3] Thomas, N. et al. (2025). Engineering highly active nucleases with ML and HTS. Cell Systems. DOI: 10.1016/j.cels.2025.101236
- [4] Tran, T.V.T. & Hy, T. (2024). Protein Design by Directed Evolution Guided by Large Language Models. IEEE Trans. Evolutionary Computation. DOI: 10.1109/TEVC.2024.3439690
- [5] Grigorakis, K. et al. (2025). Protein Engineering for Industrial Biocatalysis: Lessons from PETases. Catalysts, 15(2), 147. DOI: 10.3390/catal15020147
λ©΄μ±
μ‘°ν: μ΄ κ²μλ¬Όμ μ 보 μ 곡μ λͺ©μ μΌλ‘ ν μ°κ΅¬ λν₯ κ°μμ΄λ€. νμ μ°κ΅¬μμ μΈμ©νκΈ° μ μ ꡬ체μ μΈ μ°κ΅¬ κ²°κ³Ό, ν΅κ³, μ£Όμ₯μ μλ³Έ λ
Όλ¬Έμ ν΅ν΄ νμΈν΄μΌ νλ€.
κΈ°κ³ νμ΅κ³Ό μ§ν₯ μ§νμ λ§λ¨: ν¨μ 곡νμ μλ‘μ΄ μλ
λΆμΌ: νν Β· μλͺ
곡ν | λ°©λ²λ‘ : κ³μ°-μ€νμ
μ μ: Sean K.S. Shin | λ μ§: 2026-03-17
μ°κ΅¬ μ§λ¬Έ
Frances Arnoldμ λ
Έλ²¨μμ μμν μ§ν₯ μ§ν(directed evolution)λ μ€νμ€μμ μμ°μ νμ λͺ¨λ°©νλ€: 무μμ λμ°λ³μ΄λ₯Ό λμ
νκ³ , κΈ°λ₯μ΄ ν₯μλ κ²μ μ λ³νκ³ , μ΄λ₯Ό λ°λ³΅νλ€. κ·Έλ¬λ λ¨λ°±μ§ μ ν©λ μ§ν(fitness landscape)μ μ²λ¬Ένμ μΌλ‘ λ°©λνλ€ β μλ―Έλ
Έμ° 300κ°λ‘ μ΄λ£¨μ΄μ§ λ¨λ°±μ§μ 20Β³β°β°κ°μ κ°λ₯ν μμ΄μ κ°μ§λ€. κ³ μ²λ¦¬λ λ°©λ²μ μ¬μ©νλλΌλ μ€νμ μ λ³λ‘λ κ·Έ μ€ κ·Ήν μΌλΆλ§ νμν μ μλ€. κΈ°κ³ νμ΅(ML)μ μ΄ μ§νμ κ³μ°μ μΌλ‘ νμνμ¬, μ€νμ μννκΈ° μ μ μ΄λ€ λμ°λ³μ΄κ° κΈ°λ₯μ ν₯μμν¬ κ°λ₯μ±μ΄ μλμ§ μμΈ‘ν κ²μ μ½μνλ€. ML κΈ°λ° μ§νλ 무μμ λμ°λ³μ΄ μ λ°λ§μΌλ‘λ λΆκ°λ₯ν κ²°κ³Όλ₯Ό λ¬μ±ν μ μμκΉ?
μ°κ΅¬ λν₯
Yang, Li & Arnold (2024)λ ACS Central Scienceμμ ML 보쑰 ν¨μ 곡νμ κΈ°νμ κ³Όμ λ₯Ό κ²ν νμλ€. μ§ν₯ μ§νμ λ°μμ§μΈ Arnold μ°κ΅¬μ€μμ λμ¨ μ΄ κ΄μ μ νΉλ³ν κΆμλ₯Ό κ°μ§λ€. μ΄λ€μ MLμ΄ κ°μΉλ₯Ό λνλ λ κ°μ§ κ΄λ²μν μμμ κ·λͺ
νμλ€: (1) μμμ λ°κ²¬ β κΈ°λ₯ μ£Όμ(functional annotation) λλ μλ‘μ΄ λ¨λ°±μ§ μμ΄ μμ±μ ν΅ν΄; (2) λ¨λ°±μ§ μ ν©λ μ§ν νμ β μμ΄κ³Ό μ ν©λ κ° μ¬μ΄μ λ§€νμ νμ΅νμ¬ λΌμ΄λΈλ¬λ¦¬ μ€κ³μ λ¨Ό μμ΄ κ³΅κ° νμμ μλ΄ν¨μΌλ‘μ¨.
Ding et al. (2024)μ μ‘°ν© λΌμ΄λΈλ¬λ¦¬(combinatorial library) μ€κ³μμ μ ν©λμ λ€μμ±μ 곡λ μ΅μ ννλ ML μκ³ λ¦¬μ¦μΈ MODIFYλ₯Ό μκ°νμλ€. ν΅μ¬ ν΅μ°°μ λ€μκ³Ό κ°λ€: μ ν©λλ§μ μ΅λννλ©΄ μλ €μ§ ν΄ μ£Όλ³μ μ§μ€λ μ’μ λΌμ΄λΈλ¬λ¦¬κ° μμ±λλ λ°λ©΄, λ€μμ±λ§μ μ΅λννλ©΄ λΉκΈ°λ₯μ λ³μ΄μ²΄μ μ λ³ μλμ λλΉνκ² λλ€. MODIFYλ λ λͺ©νλ₯Ό κ· ν μκ² μ‘°μ¨νμ¬, κΈ°λ₯μ μ΄λ©΄μλ μλ‘μ΄ λ³μ΄μ²΄μ λΉμ¨μ΄ λ λμ λΌμ΄λΈλ¬λ¦¬λ₯Ό μμ±νλ€.
Thomas et al. (2025)μ ML κΈ°λ° κ³΅ν μ£ΌκΈ° μ 체λ₯Ό μ€μ¦νμλ€: μ΄λ€μ μ§νμ λ°μ΄ν°μ μ€νμ λ°μ΄ν°λ₯Ό νΌν©ν νλ μμν¬μΈ TeleProtμ μ¬μ©νμ¬ κ³ νμ± λ΄ν΄λ μμ (nuclease) ν¨μλ₯Ό 곡νμ μΌλ‘ μ€κ³νμλ€. μ΄λ€μ νμ΄νλΌμΈμ μ ν΅μ μΈ μ§ν₯ μ§νλ₯Ό ν΅ν΄ λ¬μ±νλ €λ©΄ μμ λ°° μ΄μμ μ€νμ μ λ³μ΄ νμνμ νμ± ν₯μμ μ΄λ£¨μ΄λλ€.
Tran & Hy (2024)λ λ¨λ°±μ§ μΈμ΄ λͺ¨λΈ(PLM, protein language model) β λ¨λ°±μ§ μμ΄λ‘ νλ ¨λ λν μΈμ΄ λͺ¨λΈ β μ μ§ν₯ μ§νμ μλ΄μλ‘μ νꡬνμλ€. PLMμ μλ°±λ§ κ°μ μμ° μμ΄λ‘λΆν° μ§νμ ν¨ν΄μ νμ΅νλ©°, κ΄μ¬ ν¨μμ λν μ€νμ λ°μ΄ν°κ° μ΅μνλκ±°λ μ ν μλ μνμμλ λμ°λ³μ΄ ν¨κ³Ό μμΈ‘μ΄ κ°λ₯ν κ²μΌλ‘ λ³΄κ³ λμλ€.
μ£Όμ μ£Όμ₯ λ° κ·Όκ±°
<
| μ£Όμ₯ | κ·Όκ±° | νμ |
|---|
| MLμ μ€νμ μ λ³μ μμ λ°° μ΄μ κ°μμν¨λ€ | TeleProtμ λ μ μ μ λ³ λΌμ΄λλ‘ μ°μν λ΄ν΄λ μμ λ₯Ό λ¬μ±ν¨ (Thomas et al. 2025) | μ§μ§λ¨; μ¬λ¬ ν¨μ νμ μ κ±Έμ³ μ€μ¦λ¨ |
| μ ν©λμ λ€μμ±μ 곡λ μ΅μ νλ λΌμ΄λΈλ¬λ¦¬ μ€κ³λ₯Ό κ°μ νλ€ | MODIFY μκ³ λ¦¬μ¦μ΄ μ ν©λ λ¨λ
λλ 무μμ λΌμ΄λΈλ¬λ¦¬ μ€κ³λ³΄λ€ μ°μν μ±λ₯μ 보μ (Ding et al. 2024) | μ§μ§λ¨; μ€νμ μΌλ‘ κ²μ¦λ¨ |
| λ¨λ°±μ§ μΈμ΄ λͺ¨λΈμ΄ μ§ν₯ μ§νλ₯Ό μλ΄νλ€ | μμ° μμ΄λ‘ νλ ¨λ PLMμ΄ μ΅μ νλ₯Ό μν λμ°λ³μ΄ ν«μ€νμ μλ³ν¨ (Tran & Hy 2024) | μ λ§ν¨; ν¨μ κ³ν΅μ λ°λΌ μ νλκ° λ€μν¨ |
| MLμ μ€νμ μ§νμ λ체μ κ° μλ 보μμ μ΄λ€ | MLμ νμ 곡κ°μ μ’νκ³ , μ€νμ κ²μ¦μ μ¬μ ν νμμ μ (Yang et al. 2024) | νμΈλ¨; ν΄λΉ λΆμΌμ νμ¬ ν©μ |
λ―Έν΄κ²° μ§λ¬Έ
μμμ±(Epistasis): λμ°λ³μ΄ ν¨κ³Όλ μ’
μ’
λΉκ°μ°μ μ΄λ€ β κ°λ³μ μΌλ‘λ μ μ΅ν λ λμ°λ³μ΄κ° κ²°ν©λλ©΄ μ ν΄ν μ μλ€. ML λͺ¨λΈμ μ νλ νλ ¨ λ°μ΄ν°λ‘λΆν° μ΄λ¬ν μμμ μνΈμμ©μ ν¬μ°©ν μ μμκΉ?
μλ‘μ΄ κΈ°λ₯: MLμ μλ €μ§ κΈ°λ₯μ μ΅μ νμλ λ°μ΄λμ§λ§, μμ ν μλ‘μ΄ μ΄λ§€ νμ±μ μμΈ‘νλ λ°λ μ΄λ €μμ κ²ͺλλ€. μμ± λͺ¨λΈ(generative model)μ μμ°μμ λ°κ²¬λμ§ μλ λ°μμ μν ν¨μλ₯Ό μ€κ³ν μ μλκ°?
λ°μ΄ν° μꡬ μ¬ν: νΉμ ν¨μμ μ μ©ν ML λͺ¨λΈμ νλ ¨νκΈ° μν΄ μΌλ§λ λ§μ μ€ν λ°μ΄ν°κ° νμνκ°? κ΄λ ¨ ν¨μλ‘λΆν°μ μ μ΄ νμ΅(transfer learning)μ΄ μ΄λ¬ν μꡬ μ¬νμ μ€μΌ μ μλκ°?
μ¬νμ±: ML μμΈ‘μ νλ ¨ λ°μ΄ν° νλ μ΄μ
λ° λͺ¨λΈ μν€ν
μ²μ ν¬κ² μμ‘΄νλ€. νμ€νλ λ²€μΉλ§ν¬μ μ€ν μμ€ λκ΅¬κ° μ€νμ€ κ° μ¬νμ±μ ν₯μμν¬ μ μλκ°?μ°Έκ³ λ
Όλ¬Έ
- [1] Yang, J., Li, F.-Z. & Arnold, F.H. (2024). Opportunities and Challenges for ML-Assisted Enzyme Engineering. ACS Central Science. DOI: 10.1021/acscentsci.3c01275
- [2] Ding, K. et al. (2024). ML-guided co-optimization of fitness and diversity for combinatorial library design. Nature Communications, 15, 6038. DOI: 10.1038/s41467-024-50698-y
- [3] Thomas, N. et al. (2025). Engineering highly active nucleases with ML and HTS. Cell Systems. DOI: 10.1016/j.cels.2025.101236
- [4] Tran, T.V.T. & Hy, T. (2024). Protein Design by Directed Evolution Guided by Large Language Models. IEEE Trans. Evolutionary Computation. DOI: 10.1109/TEVC.2024.3439690
- [5] Grigorakis, K. et al. (2025). Protein Engineering for Industrial Biocatalysis: Lessons from PETases. Catalysts, 15(2), 147. DOI: 10.3390/catal15020147
References (5)
Yang, J., Li, F., & Arnold, F. H. (2024). Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS Central Science, 10(2), 226-241.
Ding, K., Chin, M., Zhao, Y., Huang, W., Mai, B. K., Wang, H., et al. (2024). Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering. Nature Communications, 15(1).
Thomas, N., Belanger, D., Xu, C., Lee, H., Hirano, K., Iwai, K., et al. (2025). Engineering highly active nuclease enzymes with machine learning and high-throughput screening. Cell Systems, 16(3), 101236.
Tran, T. V. T., & Hy, T. S. (2025). Protein Design by Directed Evolution Guided by Large Language Models. IEEE Transactions on Evolutionary Computation, 29(2), 418-428.
Grigorakis, K., Ferousi, C., & Topakas, E. (2025). Protein Engineering for Industrial Biocatalysis: Principles, Approaches, and Lessons from Engineered PETases. Catalysts, 15(2), 147.