Trend AnalysisChemistry & Materials

Machine Learning Meets Directed Evolution: The New Era of Enzyme Engineering

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The Question

Frances Arnold's Nobel Prize-winning directed evolution mimics natural selection in the laboratory: introduce random mutations, screen for improved function, repeat. But the protein fitness landscape is astronomically large — a 300-amino-acid protein has 20³⁰⁰ possible sequences. Experimental screening, even with high-throughput methods, can explore only a tiny fraction. Machine learning (ML) promises to navigate this landscape computationally, predicting which mutations are likely to improve function before any experiment is performed. Can ML-guided evolution achieve results impossible through random mutagenesis alone?

Landscape

Yang, Li & Arnold (2024) in ACS Central Science, reviewed the opportunities and challenges of ML-assisted enzyme engineering. Their perspective from Arnold's own laboratory — the birthplace of directed evolution — carries particular authority. They identified two broad areas where ML adds value: (1) starting point discovery — through functional annotation or generation of novel protein sequences; and (2) navigating protein fitness landscapes — by learning mappings between sequences and fitness values to guide library design and exploration of distant sequence space.

Ding et al. (2024) introduced MODIFY — an ML algorithm that co-optimises fitness and diversity in combinatorial library design. The key insight: maximising fitness alone leads to narrow libraries clustered around known solutions, while maximising diversity alone wastes screening capacity on non-functional variants. MODIFY balances both objectives, producing libraries where a higher fraction of variants are both functional and novel.

Thomas et al. (2025) demonstrated the full ML-guided engineering cycle: they used TeleProt, a framework blending evolutionary and experimental data, to engineer highly active nuclease enzymes. Their pipeline achieved activity improvements that would have required orders of magnitude more experimental screening via traditional directed evolution.

Tran & Hy (2024) explored protein language models (PLMs) — large language models trained on protein sequences — as guides for directed evolution. PLMs learn evolutionary patterns from millions of natural sequences, reportedly enabling prediction of mutation effects with minimal or no experimental data for the specific enzyme of interest.

Key Claims & Evidence

Claim	Evidence	Verdict
ML reduces experimental screening by orders of magnitude	TeleProt achieves superior nucleases with fewer rounds of screening (Thomas et al. 2025)	Supported; demonstrated across multiple enzyme targets
Co-optimising fitness and diversity improves library design	MODIFY algorithm outperforms fitness-only or random library design (Ding et al. 2024)	Supported; validated experimentally
Protein language models guide directed evolution	PLMs trained on natural sequences identify mutation hotspots for optimisation (Tran & Hy 2024)	Promising; accuracy varies by enzyme family
ML is complementary to, not a replacement for, experimental evolution	ML narrows the search space; experimental validation remains essential (Yang et al. 2024)	Confirmed; current consensus in the field

Open Questions

Epistasis: Mutation effects are often non-additive — two individually beneficial mutations may be deleterious when combined. Can ML models capture these epistatic interactions from limited training data?

Novel functions: ML excels at optimising known functions but struggles with predicting entirely new catalytic activities. Can generative models design enzymes for reactions not found in nature?

Data requirements: How much experimental data is needed to train a useful ML model for a specific enzyme? Can transfer learning from related enzymes reduce this requirement?

Reproducibility: ML predictions depend heavily on training data curation and model architecture. Can standardised benchmarks and open-source tools improve reproducibility across laboratories?

Referenced Papers

[1] Yang, J., Li, F.-Z. & Arnold, F.H. (2024). Opportunities and Challenges for ML-Assisted Enzyme Engineering. ACS Central Science. DOI: 10.1021/acscentsci.3c01275
[2] Ding, K. et al. (2024). ML-guided co-optimization of fitness and diversity for combinatorial library design. Nature Communications, 15, 6038. DOI: 10.1038/s41467-024-50698-y
[3] Thomas, N. et al. (2025). Engineering highly active nucleases with ML and HTS. Cell Systems. DOI: 10.1016/j.cels.2025.101236
[4] Tran, T.V.T. & Hy, T. (2024). Protein Design by Directed Evolution Guided by Large Language Models. IEEE Trans. Evolutionary Computation. DOI: 10.1109/TEVC.2024.3439690
[5] Grigorakis, K. et al. (2025). Protein Engineering for Industrial Biocatalysis: Lessons from PETases. Catalysts, 15(2), 147. DOI: 10.3390/catal15020147

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계, 주장은 원본 논문을 통해 확인해야 한다.

기계 학습과 지향 진화의 만남: 효소 공학의 새로운 시대

분야: 화학 · 생명공학 | 방법론: 계산-실험적

저자: Sean K.S. Shin | 날짜: 2026-03-17

연구 질문

Frances Arnold의 노벨상을 수상한 지향 진화(directed evolution)는 실험실에서 자연선택을 모방한다: 무작위 돌연변이를 도입하고, 기능이 향상된 것을 선별하고, 이를 반복한다. 그러나 단백질 적합도 지형(fitness landscape)은 천문학적으로 방대하다 — 아미노산 300개로 이루어진 단백질은 20³⁰⁰개의 가능한 서열을 가진다. 고처리량 방법을 사용하더라도 실험적 선별로는 그 중 극히 일부만 탐색할 수 있다. 기계 학습(ML)은 이 지형을 계산적으로 탐색하여, 실험을 수행하기 전에 어떤 돌연변이가 기능을 향상시킬 가능성이 있는지 예측할 것을 약속한다. ML 기반 진화는 무작위 돌연변이 유발만으로는 불가능한 결과를 달성할 수 있을까?

연구 동향

Yang, Li & Arnold (2024)는 ACS Central Science에서 ML 보조 효소 공학의 기회와 과제를 검토하였다. 지향 진화의 발상지인 Arnold 연구실에서 나온 이 관점은 특별한 권위를 가진다. 이들은 ML이 가치를 더하는 두 가지 광범위한 영역을 규명하였다: (1) 시작점 발견 — 기능 주석(functional annotation) 또는 새로운 단백질 서열 생성을 통해; (2) 단백질 적합도 지형 탐색 — 서열과 적합도 값 사이의 매핑을 학습하여 라이브러리 설계와 먼 서열 공간 탐색을 안내함으로써.

Ding et al. (2024)은 조합 라이브러리(combinatorial library) 설계에서 적합도와 다양성을 공동 최적화하는 ML 알고리즘인 MODIFY를 소개하였다. 핵심 통찰은 다음과 같다: 적합도만을 최대화하면 알려진 해 주변에 집중된 좁은 라이브러리가 생성되는 반면, 다양성만을 최대화하면 비기능적 변이체에 선별 역량을 낭비하게 된다. MODIFY는 두 목표를 균형 있게 조율하여, 기능적이면서도 새로운 변이체의 비율이 더 높은 라이브러리를 생성한다.

Thomas et al. (2025)은 ML 기반 공학 주기 전체를 실증하였다: 이들은 진화적 데이터와 실험적 데이터를 혼합한 프레임워크인 TeleProt을 사용하여 고활성 뉴클레아제(nuclease) 효소를 공학적으로 설계하였다. 이들의 파이프라인은 전통적인 지향 진화를 통해 달성하려면 수십 배 이상의 실험적 선별이 필요했을 활성 향상을 이루어냈다.

Tran & Hy (2024)는 단백질 언어 모델(PLM, protein language model) — 단백질 서열로 훈련된 대형 언어 모델 — 을 지향 진화의 안내자로서 탐구하였다. PLM은 수백만 개의 자연 서열로부터 진화적 패턴을 학습하며, 관심 효소에 대한 실험적 데이터가 최소화되거나 전혀 없는 상태에서도 돌연변이 효과 예측이 가능한 것으로 보고되었다.

주요 주장 및 근거

주장	근거	판정
ML은 실험적 선별을 수십 배 이상 감소시킨다	TeleProt은 더 적은 선별 라운드로 우수한 뉴클레아제를 달성함 (Thomas et al. 2025)	지지됨; 여러 효소 표적에 걸쳐 실증됨
적합도와 다양성의 공동 최적화는 라이브러리 설계를 개선한다	MODIFY 알고리즘이 적합도 단독 또는 무작위 라이브러리 설계보다 우수한 성능을 보임 (Ding et al. 2024)	지지됨; 실험적으로 검증됨
단백질 언어 모델이 지향 진화를 안내한다	자연 서열로 훈련된 PLM이 최적화를 위한 돌연변이 핫스팟을 식별함 (Tran & Hy 2024)	유망함; 효소 계통에 따라 정확도가 다양함
ML은 실험적 진화의 대체제가 아닌 보완제이다	ML은 탐색 공간을 좁히고, 실험적 검증은 여전히 필수적임 (Yang et al. 2024)	확인됨; 해당 분야의 현재 합의

미해결 질문

상위성(Epistasis): 돌연변이 효과는 종종 비가산적이다 — 개별적으로는 유익한 두 돌연변이가 결합되면 유해할 수 있다. ML 모델은 제한된 훈련 데이터로부터 이러한 상위적 상호작용을 포착할 수 있을까?

새로운 기능: ML은 알려진 기능의 최적화에는 뛰어나지만, 완전히 새로운 촉매 활성을 예측하는 데는 어려움을 겪는다. 생성 모델(generative model)은 자연에서 발견되지 않는 반응을 위한 효소를 설계할 수 있는가?

데이터 요구 사항: 특정 효소에 유용한 ML 모델을 훈련하기 위해 얼마나 많은 실험 데이터가 필요한가? 관련 효소로부터의 전이 학습(transfer learning)이 이러한 요구 사항을 줄일 수 있는가?

재현성: ML 예측은 훈련 데이터 큐레이션 및 모델 아키텍처에 크게 의존한다. 표준화된 벤치마크와 오픈 소스 도구가 실험실 간 재현성을 향상시킬 수 있는가?

참고 논문

[1] Yang, J., Li, F.-Z. & Arnold, F.H. (2024). Opportunities and Challenges for ML-Assisted Enzyme Engineering. ACS Central Science. DOI: 10.1021/acscentsci.3c01275
[2] Ding, K. et al. (2024). ML-guided co-optimization of fitness and diversity for combinatorial library design. Nature Communications, 15, 6038. DOI: 10.1038/s41467-024-50698-y
[3] Thomas, N. et al. (2025). Engineering highly active nucleases with ML and HTS. Cell Systems. DOI: 10.1016/j.cels.2025.101236
[4] Tran, T.V.T. & Hy, T. (2024). Protein Design by Directed Evolution Guided by Large Language Models. IEEE Trans. Evolutionary Computation. DOI: 10.1109/TEVC.2024.3439690
[5] Grigorakis, K. et al. (2025). Protein Engineering for Industrial Biocatalysis: Lessons from PETases. Catalysts, 15(2), 147. DOI: 10.3390/catal15020147

References (5)

Yang, J., Li, F., & Arnold, F. H. (2024). Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS Central Science, 10(2), 226-241.

DOI Scholar

Ding, K., Chin, M., Zhao, Y., Huang, W., Mai, B. K., Wang, H., et al. (2024). Machine learning-guided co-optimization of fitness and diversity facilitates combinatorial library design in enzyme engineering. Nature Communications, 15(1).

DOI Scholar

Thomas, N., Belanger, D., Xu, C., Lee, H., Hirano, K., Iwai, K., et al. (2025). Engineering highly active nuclease enzymes with machine learning and high-throughput screening. Cell Systems, 16(3), 101236.

DOI Scholar

Tran, T. V. T., & Hy, T. S. (2025). Protein Design by Directed Evolution Guided by Large Language Models. IEEE Transactions on Evolutionary Computation, 29(2), 418-428.

DOI Scholar

Grigorakis, K., Ferousi, C., & Topakas, E. (2025). Protein Engineering for Industrial Biocatalysis: Principles, Approaches, and Lessons from Engineered PETases. Catalysts, 15(2), 147.

DOI Scholar

Machine Learning Meets Directed Evolution: The New Era of Enzyme Engineering

The Question

Landscape

Key Claims & Evidence

Open Questions

Referenced Papers

기계 학습과 지향 진화의 만남: 효소 공학의 새로운 시대

연구 질문

연구 동향

주요 주장 및 근거

미해결 질문

참고 논문

References (5)

Explore this topic deeper