Paper ReviewAI & Machine LearningMachine/Deep Learning

AI Scientist v2: When Machine-Written Papers Pass Human Peer Review

AI Scientist-v2 automates the full scientific workflow—hypothesis formation, experimentation, data analysis, and paper writing—using agentic tree search. The resulting papers, fully AI-generated, achieve an average reviewer score of 6.33 in human peer review, meeting the acceptance threshold for workshop venues. The question is no longer whether AI can write papers, but what this means for scientific practice.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The automation of scientific research has proceeded in stages. First, AI tools automated data analysis—running statistical tests, fitting models, generating plots. Then, AI systems began assisting with literature review—searching databases, summarizing papers, identifying gaps. More recently, large language models have been used to draft manuscripts, producing text that is grammatically correct and stylistically appropriate if not always scientifically rigorous.

AI Scientist-v2 (2025) attempts something more ambitious: automating the entire scientific workflow from hypothesis formation through experimentation, data analysis, and paper writing, using agentic tree search to explore the space of possible research directions. The headline result, as reported in the abstract: fully AI-generated papers achieve an average human reviewer score of 6.33, meeting the threshold for workshop-level acceptance.

This is a qualitative threshold worth examining carefully. Not because a score of 6.33 represents excellent science—it does not—but because it represents the point at which AI-generated research becomes indistinguishable from marginal human-generated research in a blind review setting.

The Research Landscape

Automated scientific discovery has a longer history than the current LLM era might suggest. Systems like BACON (Langley et al., 1987) could rediscover simple physical laws from data. More recently, systems like AlphaFold have made transformative contributions to specific scientific problems (protein structure prediction) through AI methods.

What distinguishes AI Scientist-v2 from these predecessors is generality and integration. AlphaFold solves one type of problem with extraordinary capability. BACON discovers laws from prepared datasets. AI Scientist-v2 attempts to replicate the general workflow of a human researcher across multiple stages: identifying a question worth investigating, designing experiments to address it, running those experiments, analyzing the results, and communicating the findings in a written paper.

The first version of AI Scientist (2024) demonstrated the feasibility of this pipeline but produced papers of limited quality. AI Scientist-v2 introduces agentic tree search as the core mechanism for improving quality: rather than generating a single linear research trajectory, the system explores multiple research directions in a tree structure, evaluating and pruning branches based on intermediate results.

Agentic Tree Search for Research

The tree search mechanism is the primary technical contribution. According to the abstract, the system uses this search to explore the space of possible research directions. At each node in the tree, the agent faces a decision—which hypothesis to pursue, which experimental design to use, how to interpret ambiguous results—and the tree structure allows the system to explore multiple options before committing.

This is a meaningful improvement over linear generation. A human researcher does not pursue the first idea that comes to mind; they consider alternatives, evaluate feasibility, and select the most promising direction. The tree search mechanism provides an analogous capability: the system generates multiple candidate hypotheses, evaluates their feasibility through preliminary experiments, and selects the most promising branch for deeper investigation.

The "agentic" qualifier indicates that the system uses tool-calling capabilities—executing code, querying databases, running experiments—rather than generating research purely through text completion. This grounds the system's claims in actual computational results rather than plausible-sounding but unverified assertions.

The 6.33 Score: What It Means

The average reviewer score of 6.33, as stated in the abstract, requires careful interpretation. In typical machine learning conference review scales:

  • 1–3: Clear reject—fundamental flaws in methodology, significance, or correctness.
  • 4–5: Below threshold—some merit but significant weaknesses.
  • 6: Marginally above threshold—acceptable with reservations.
  • 7–8: Good paper—solid contribution with minor issues.
  • 9–10: Excellent—significant contribution to the field.
A score of 6.33 places AI Scientist-v2's output at the marginal acceptance level for workshop papers—venues with acceptance rates typically between 40% and 60%. This is not the same as acceptance at top conferences (ICML, NeurIPS, ICLR main conference), which typically require scores of 6.5–7.0 or higher and have acceptance rates of 20–30%.

The achievement is nonetheless significant. It means that in a blind review, human reviewers found the AI-generated papers to be of comparable quality to papers written by human researchers at the workshop level. The papers are not merely grammatically correct; they contain hypotheses, experiments, results, and analyses that pass the scrutiny of expert reviewers.

Passing peer review and doing good science are not identical—peer review is an imperfect filter that this result also reveals.

Critical Analysis: Claims and Evidence

<
ClaimSourceVerdict
AI Scientist-v2 achieves workshop-level automated scientific discoveryAbstract✅ Supported by reported reviewer scores
Agentic tree search is the mechanism for quality improvementAbstract✅ Described as core architectural choice
Fully AI-generated papers pass human peer reviewAbstract, reported score of 6.33✅ Supported — score meets workshop acceptance threshold
The system automates hypothesis formation, experimentation, analysis, and writingAbstract✅ Reported as implemented pipeline
This represents a qualitative advance over AI Scientist v1Contextual comparison⚠️ Plausible given v1's limitations, but direct comparison details matter
AI-generated research is equivalent to human research at workshop levelInterpretation⚠️ Passes the same review filter; equivalence in scientific contribution is a stronger claim

A critical consideration: the domains in which AI Scientist-v2 operates are likely constrained to areas where experiments can be run computationally (machine learning, numerical simulations) rather than domains requiring physical experiments, human subjects, or long-term observation. The generality claim should be understood within these bounds.

Open Questions

  • Novelty vs. competence. Passing peer review demonstrates competence—the ability to execute a research workflow correctly. But does AI Scientist-v2 produce genuinely novel insights, or does it recombine existing ideas in technically competent but intellectually incremental ways? Workshop-level papers, by definition, are not expected to be highly novel.
  • Research taste. Perhaps a critical capability a human researcher possesses is taste—the ability to identify which questions are worth asking, which results are surprising, which directions will be productive. Can tree search approximate taste, or does it produce competent answers to uninteresting questions?
  • Reproducibility and verification. Can other researchers reproduce the experiments described in AI-generated papers? Are the experimental setups sufficiently detailed and the code sufficiently clean for external verification?
  • Scientific ecosystem effects. If AI can generate workshop-level papers at minimal cost, what happens to the workshop ecosystem? Does the volume of submissions increase to the point where human review becomes infeasible? Does the signal-to-noise ratio of the scientific literature change?
  • What This Means for Your Research

    For researchers, AI Scientist-v2 is not an immediate replacement for human scientific inquiry—6.33 is not 8.0, and workshop acceptance is not main-conference acceptance. But it may be a useful tool for preliminary exploration: generating initial hypotheses, running screening experiments, identifying promising directions before human researchers invest significant effort.

    For the scientific community, the system raises governance questions that will need addressing. If AI-generated papers are submitted to venues without disclosure, reviewers and readers cannot appropriately calibrate their trust. Transparency about AI involvement in research production is not just an ethical concern—it is a prerequisite for the scientific community to adapt its quality-control mechanisms.

    The progression from v1 to v2 demonstrates that agentic architectures with search can produce qualitative improvements in complex tasks.

    Explore related work through ORAA ResearchBrain.

    References (1)

    [1] (2025). AI Scientist v2: Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 6 keywords →