Paper ReviewAI & Machine LearningMachine/Deep Learning

AI Scientist v2: When Machine-Written Papers Pass Human Peer Review

AI Scientist-v2 automates the full scientific workflow—hypothesis formation, experimentation, data analysis, and paper writing—using agentic tree search. The resulting papers, fully AI-generated, achieve an average reviewer score of 6.33 in human peer review, meeting the acceptance threshold for workshop venues. The question is no longer whether AI can write papers, but what this means for scientific practice.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The automation of scientific research has proceeded in stages. First, AI tools automated data analysis—running statistical tests, fitting models, generating plots. Then, AI systems began assisting with literature review—searching databases, summarizing papers, identifying gaps. More recently, large language models have been used to draft manuscripts, producing text that is grammatically correct and stylistically appropriate if not always scientifically rigorous.

AI Scientist-v2 (2025) attempts something more ambitious: automating the entire scientific workflow from hypothesis formation through experimentation, data analysis, and paper writing, using agentic tree search to explore the space of possible research directions. The headline result, as reported in the abstract: fully AI-generated papers achieve an average human reviewer score of 6.33, meeting the threshold for workshop-level acceptance.

This is a qualitative threshold worth examining carefully. Not because a score of 6.33 represents excellent science—it does not—but because it represents the point at which AI-generated research becomes indistinguishable from marginal human-generated research in a blind review setting.

The Research Landscape

Automated scientific discovery has a longer history than the current LLM era might suggest. Systems like BACON (Langley et al., 1987) could rediscover simple physical laws from data. More recently, systems like AlphaFold have made transformative contributions to specific scientific problems (protein structure prediction) through AI methods.

What distinguishes AI Scientist-v2 from these predecessors is generality and integration. AlphaFold solves one type of problem with extraordinary capability. BACON discovers laws from prepared datasets. AI Scientist-v2 attempts to replicate the general workflow of a human researcher across multiple stages: identifying a question worth investigating, designing experiments to address it, running those experiments, analyzing the results, and communicating the findings in a written paper.

The first version of AI Scientist (2024) demonstrated the feasibility of this pipeline but produced papers of limited quality. AI Scientist-v2 introduces agentic tree search as the core mechanism for improving quality: rather than generating a single linear research trajectory, the system explores multiple research directions in a tree structure, evaluating and pruning branches based on intermediate results.

Agentic Tree Search for Research

The tree search mechanism is the primary technical contribution. According to the abstract, the system uses this search to explore the space of possible research directions. At each node in the tree, the agent faces a decision—which hypothesis to pursue, which experimental design to use, how to interpret ambiguous results—and the tree structure allows the system to explore multiple options before committing.

This is a meaningful improvement over linear generation. A human researcher does not pursue the first idea that comes to mind; they consider alternatives, evaluate feasibility, and select the most promising direction. The tree search mechanism provides an analogous capability: the system generates multiple candidate hypotheses, evaluates their feasibility through preliminary experiments, and selects the most promising branch for deeper investigation.

The "agentic" qualifier indicates that the system uses tool-calling capabilities—executing code, querying databases, running experiments—rather than generating research purely through text completion. This grounds the system's claims in actual computational results rather than plausible-sounding but unverified assertions.

The 6.33 Score: What It Means

The average reviewer score of 6.33, as stated in the abstract, requires careful interpretation. In typical machine learning conference review scales:

1–3: Clear reject—fundamental flaws in methodology, significance, or correctness.
4–5: Below threshold—some merit but significant weaknesses.
6: Marginally above threshold—acceptable with reservations.
7–8: Good paper—solid contribution with minor issues.
9–10: Excellent—significant contribution to the field.

A score of 6.33 places AI Scientist-v2's output at the marginal acceptance level for workshop papers—venues with acceptance rates typically between 40% and 60%. This is not the same as acceptance at top conferences (ICML, NeurIPS, ICLR main conference), which typically require scores of 6.5–7.0 or higher and have acceptance rates of 20–30%.

The achievement is nonetheless significant. It means that in a blind review, human reviewers found the AI-generated papers to be of comparable quality to papers written by human researchers at the workshop level. The papers are not merely grammatically correct; they contain hypotheses, experiments, results, and analyses that pass the scrutiny of expert reviewers.

Passing peer review and doing good science are not identical—peer review is an imperfect filter that this result also reveals.

Critical Analysis: Claims and Evidence

Claim	Source	Verdict
AI Scientist-v2 achieves workshop-level automated scientific discovery	Abstract	✅ Supported by reported reviewer scores
Agentic tree search is the mechanism for quality improvement	Abstract	✅ Described as core architectural choice
Fully AI-generated papers pass human peer review	Abstract, reported score of 6.33	✅ Supported — score meets workshop acceptance threshold
The system automates hypothesis formation, experimentation, analysis, and writing	Abstract	✅ Reported as implemented pipeline
This represents a qualitative advance over AI Scientist v1	Contextual comparison	⚠️ Plausible given v1's limitations, but direct comparison details matter
AI-generated research is equivalent to human research at workshop level	Interpretation	⚠️ Passes the same review filter; equivalence in scientific contribution is a stronger claim

A critical consideration: the domains in which AI Scientist-v2 operates are likely constrained to areas where experiments can be run computationally (machine learning, numerical simulations) rather than domains requiring physical experiments, human subjects, or long-term observation. The generality claim should be understood within these bounds.

Open Questions

Novelty vs. competence. Passing peer review demonstrates competence—the ability to execute a research workflow correctly. But does AI Scientist-v2 produce genuinely novel insights, or does it recombine existing ideas in technically competent but intellectually incremental ways? Workshop-level papers, by definition, are not expected to be highly novel.

Research taste. Perhaps a critical capability a human researcher possesses is taste—the ability to identify which questions are worth asking, which results are surprising, which directions will be productive. Can tree search approximate taste, or does it produce competent answers to uninteresting questions?

Reproducibility and verification. Can other researchers reproduce the experiments described in AI-generated papers? Are the experimental setups sufficiently detailed and the code sufficiently clean for external verification?

Scientific ecosystem effects. If AI can generate workshop-level papers at minimal cost, what happens to the workshop ecosystem? Does the volume of submissions increase to the point where human review becomes infeasible? Does the signal-to-noise ratio of the scientific literature change?

What This Means for Your Research

For researchers, AI Scientist-v2 is not an immediate replacement for human scientific inquiry—6.33 is not 8.0, and workshop acceptance is not main-conference acceptance. But it may be a useful tool for preliminary exploration: generating initial hypotheses, running screening experiments, identifying promising directions before human researchers invest significant effort.

For the scientific community, the system raises governance questions that will need addressing. If AI-generated papers are submitted to venues without disclosure, reviewers and readers cannot appropriately calibrate their trust. Transparency about AI involvement in research production is not just an ethical concern—it is a prerequisite for the scientific community to adapt its quality-control mechanisms.

The progression from v1 to v2 demonstrates that agentic architectures with search can produce qualitative improvements in complex tasks.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 특정 연구 결과, 통계 및 주장은 원본 논문을 통해 검증해야 한다.

AI Scientist v2: 기계가 작성한 논문이 인간의 동료 심사를 통과할 때

과학 연구의 자동화는 단계적으로 진행되어 왔다. 처음에는 AI 도구가 데이터 분석을 자동화했다—통계 검정 실행, 모델 피팅, 그래프 생성 등이 이에 해당한다. 이후 AI 시스템은 문헌 검토를 보조하기 시작했다—데이터베이스 검색, 논문 요약, 연구 공백 파악 등의 방식으로. 보다 최근에는 대형 언어 모델(large language model)이 원고 초안 작성에 활용되었으며, 과학적 엄밀성이 항상 보장되지는 않더라도 문법적으로 정확하고 문체적으로 적절한 텍스트를 생성해 왔다.

AI Scientist-v2 (2025)는 보다 야심찬 목표를 지향한다. 가설 형성부터 실험, 데이터 분석, 논문 작성에 이르는 전체 과학 연구 워크플로우를 자동화하되, 에이전틱 트리 탐색(agentic tree search)을 활용하여 가능한 연구 방향의 공간을 탐색한다. 초록에 보고된 핵심 결과는 다음과 같다: 완전히 AI가 생성한 논문이 평균 인간 심사자 점수 6.33을 달성하여 워크숍 수준의 게재 승인 기준을 충족했다.

이는 신중하게 살펴볼 만한 질적 임계값이다. 6.33이라는 점수가 탁월한 과학을 의미하기 때문이 아니라—그렇지 않다—AI가 생성한 연구가 맹목적 심사(blind review) 환경에서 한계선상의 인간 연구와 구별 불가능해지는 지점을 나타내기 때문이다.

연구 동향

자동화된 과학적 발견의 역사는 현재의 LLM 시대가 암시하는 것보다 더 길다. BACON (Langley et al., 1987)과 같은 시스템은 데이터로부터 단순한 물리 법칙을 재발견할 수 있었다. 보다 최근에는 AlphaFold와 같은 시스템이 AI 방법론을 통해 특정 과학 문제(단백질 구조 예측)에 혁신적인 기여를 했다.

AI Scientist-v2를 이러한 선행 연구들과 구별 짓는 것은 범용성과 통합성이다. AlphaFold는 특정 유형의 문제를 탁월한 능력으로 해결한다. BACON은 준비된 데이터셋으로부터 법칙을 발견한다. AI Scientist-v2는 여러 단계에 걸쳐 인간 연구자의 일반적인 워크플로우를 복제하고자 시도한다: 조사할 가치 있는 질문 식별, 이를 다루기 위한 실험 설계, 실험 수행, 결과 분석, 그리고 작성된 논문을 통한 연구 결과 전달.

최초 버전인 AI Scientist (2024)는 이 파이프라인의 실현 가능성을 입증했지만, 품질이 제한적인 논문을 생성했다. AI Scientist-v2는 품질 향상의 핵심 메커니즘으로 에이전틱 트리 탐색을 도입한다: 단일한 선형 연구 경로를 생성하는 대신, 시스템은 트리 구조 내에서 여러 연구 방향을 탐색하고 중간 결과를 기반으로 가지를 평가하고 가지치기한다.

연구를 위한 에이전틱 트리 탐색

트리 탐색 메커니즘이 주요 기술적 기여이다. 초록에 따르면, 시스템은 이 탐색을 통해 가능한 연구 방향의 공간을 탐색한다. 트리의 각 노드에서 에이전트는 결정에 직면한다—어떤 가설을 추구할지, 어떤 실험 설계를 사용할지, 모호한 결과를 어떻게 해석할지—그리고 트리 구조는 시스템이 확정하기 전에 여러 선택지를 탐색할 수 있게 한다.

이는 선형 생성 방식에 비해 의미 있는 개선이다. 인간 연구자는 처음 떠오르는 아이디어를 무작정 추구하지 않는다. 대안을 검토하고, 실현 가능성을 평가하며, 가장 유망한 방향을 선택한다. 트리 탐색 메커니즘은 유사한 기능을 제공한다: 시스템이 여러 후보 가설을 생성하고, 예비 실험을 통해 실현 가능성을 평가하며, 더 심층적인 조사를 위해 가장 유망한 가지를 선택한다.

"에이전틱"이라는 수식어는 시스템이 순수하게 텍스트 완성을 통해 연구를 생성하는 것이 아니라, 도구 호출 기능—코드 실행, 데이터베이스 조회, 실험 수행—을 사용한다는 것을 나타낸다. 이를 통해 시스템의 주장은 그럴듯하게 들리지만 검증되지 않은 주장이 아닌 실제 계산 결과에 근거하게 된다.

6.33 점수: 그 의미

초록에 제시된 평균 리뷰어 점수 6.33은 신중한 해석이 필요하다. 일반적인 머신러닝 학회 리뷰 척도에서:

1–3: 명확한 거절—방법론, 중요성 또는 정확성에 근본적인 결함이 있음.
4–5: 기준 미달—일부 장점이 있으나 중대한 약점이 존재함.
6: 기준을 간신히 상회—유보 조건부 수용 가능.
7–8: 우수한 논문—사소한 문제가 있으나 견실한 기여.
9–10: 탁월함—해당 분야에 중요한 기여.

6.33이라는 점수는 AI Scientist-v2의 결과물을 워크숍 논문의 경계선상 수용 수준에 위치시키는데, 워크숍은 통상적으로 40%에서 60% 사이의 수용률을 보이는 발표 장소이다. 이는 상위 학회(ICML, NeurIPS, ICLR 메인 컨퍼런스) 수용과는 동일하지 않으며, 상위 학회는 통상적으로 6.5–7.0 이상의 점수를 요구하고 수용률이 20–30%에 불과하다.

그럼에도 불구하고 이 성과는 의미가 있다. 블라인드 리뷰에서 인간 리뷰어들이 AI가 생성한 논문을 워크숍 수준의 인간 연구자가 작성한 논문과 비슷한 품질로 평가했다는 것을 의미하기 때문이다. 해당 논문들은 단순히 문법적으로 올바른 수준에 머무는 것이 아니라, 전문가 리뷰어의 면밀한 검토를 통과하는 가설, 실험, 결과, 분석을 담고 있다.

동료 심사를 통과하는 것과 좋은 과학을 수행하는 것은 동일하지 않다—동료 심사는 불완전한 필터이며, 이 결과는 그 사실 또한 드러낸다.

비판적 분석: 주장과 근거

주장	출처	판정
AI Scientist-v2가 워크숍 수준의 자동화된 과학적 발견을 달성함	초록	✅ 보고된 리뷰어 점수에 의해 뒷받침됨
에이전트 트리 탐색이 품질 향상의 메커니즘임	초록	✅ 핵심 아키텍처 선택으로 설명됨
완전히 AI가 생성한 논문이 인간 동료 심사를 통과함	초록, 보고된 점수 6.33	✅ 뒷받침됨 — 점수가 워크숍 수용 기준을 충족함
시스템이 가설 형성, 실험, 분석, 작성을 자동화함	초록	✅ 구현된 파이프라인으로 보고됨
이것이 AI Scientist v1에 비해 질적 발전을 나타냄	맥락적 비교	⚠️ v1의 한계를 감안하면 타당하나, 직접 비교 세부 사항이 중요함
AI가 생성한 연구가 워크숍 수준에서 인간 연구와 동등함	해석	⚠️ 동일한 리뷰 필터를 통과함; 과학적 기여의 동등성은 더 강한 주장임

중요한 고려 사항: AI Scientist-v2가 운용되는 영역은 물리적 실험, 피험자, 또는 장기 관찰이 필요한 분야가 아닌, 계산적으로 실험을 수행할 수 있는 영역(머신러닝, 수치 시뮬레이션)으로 제한될 가능성이 높다. 일반성 주장은 이러한 범위 내에서 이해되어야 한다.

미해결 질문들

참신성 대 역량. 동료 심사를 통과하는 것은 역량—연구 워크플로우를 올바르게 실행하는 능력—을 입증한다. 그러나 AI Scientist-v2는 진정으로 참신한 통찰을 생성하는가, 아니면 기술적으로는 유능하지만 지적으로는 점진적인 방식으로 기존 아이디어를 재조합하는가? 워크숍 수준의 논문은 정의상 높은 참신성을 요구받지 않는다.

연구 안목. 인간 연구자가 보유한 핵심 역량은 어쩌면 안목—어떤 질문을 던질 가치가 있는지, 어떤 결과가 놀라운지, 어떤 방향이 생산적인지를 식별하는 능력—일 것이다. 트리 탐색이 안목을 근사할 수 있는가, 아니면 흥미롭지 않은 질문에 대한 유능한 답변만을 생성하는가?

재현 가능성과 검증. 다른 연구자들이 AI가 생성한 논문에 기술된 실험을 재현할 수 있는가? 실험 설정이 충분히 상세하고 코드가 외부 검증을 위해 충분히 정돈되어 있는가?

과학 생태계 효과. AI가 최소한의 비용으로 워크숍 수준의 논문을 생성할 수 있다면, 워크숍 생태계에는 어떤 일이 발생하는가? 제출 논문의 양이 인간 리뷰가 불가능해질 정도로 증가하는가? 과학 문헌의 신호 대 잡음비는 어떻게 변하는가?

이것이 당신의 연구에 갖는 의미

연구자들에게 있어 AI Scientist-v2는 인간의 과학적 탐구를 즉각적으로 대체하는 것이 아니다—6.33은 8.0이 아니며, 워크숍 채택은 메인 컨퍼런스 채택이 아니다. 그러나 예비 탐색을 위한 유용한 도구가 될 수 있다: 초기 가설을 생성하고, 스크리닝 실험을 수행하며, 인간 연구자가 상당한 노력을 투자하기 전에 유망한 방향을 식별하는 것이다.

과학 커뮤니티에게 있어 이 시스템은 해결이 필요한 거버넌스 문제를 제기한다. AI가 생성한 논문이 공개 없이 학술대회에 제출된다면, 심사자와 독자들은 신뢰를 적절히 조정할 수 없다. 연구 생산에 있어 AI 관여에 대한 투명성은 단순한 윤리적 문제가 아니다—그것은 과학 커뮤니티가 품질 관리 메커니즘을 적응시키기 위한 전제 조건이다.

v1에서 v2로의 발전은 검색 기능을 갖춘 에이전틱(agentic) 아키텍처가 복잡한 과제에서 질적 개선을 이끌어낼 수 있음을 보여준다.

관련 연구를 ORAA ResearchBrain을 통해 탐색하라.

References (1)

[1] (2025). AI Scientist v2: Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066.

DOI Scholar