When AI Writes the Paper: How Autonomous Research Agents Passed Peer Review

In April 2025, an AI system submitted three research papers to a peer-reviewed ICLR workshop. One was accepted. No human wrote a single sentence. This was not a stunt or a thought experiment — it was the output of The AI Scientist-v2, a system that autonomously generates hypotheses, writes code, runs experiments, produces figures, and drafts manuscripts. The era of autonomous scientific discovery by AI agents has arrived, and the implications extend far beyond machine learning.

From Templates to True Autonomy

The original AI Scientist (Lu et al., 2024) demonstrated feasibility but relied heavily on human-authored code templates for each research topic. This dependency limited its scope and contradicted the goal of autonomous discovery. Yamada, Lange, Lu et al. (2025) address this directly with The AI Scientist-v2, which eliminates template dependency through two architectural innovations.

The system introduces an Experiment Progress Manager that coordinates four stages of research — preliminary investigation, hyperparameter tuning, research agenda execution, and ablation studies — mirroring the structure of human scientific practice. More significantly, it replaces the predecessor's linear experimentation pipeline with agentic tree search, allowing the system to explore branching hypotheses rather than following a single path to completion.

A Vision-Language Model feedback loop provides iterative refinement of figures and visualizations, addressing a common weakness of automated systems. The result is a pipeline that operates across multiple machine learning domains without requiring domain-specific scaffolding.

The ICLR workshop submission provides a concrete benchmark. Of three fully autonomous manuscripts, one achieved an average reviewer score of 6.33, placing it in the top 45% of submissions and meeting the acceptance threshold. Notably, the accepted paper reported negative results — finding that compositional regularization does not yield significant improvements — which the reviewers valued for its clarity and honesty.

Scaling Scientific Output with Kosmos

While The AI Scientist-v2 focuses on the end-to-end manuscript pipeline within machine learning, Kosmos (Mitchener, Yiu, Chang et al., 2025) takes a different approach: automating data-driven discovery across diverse scientific disciplines including metabolomics, materials science, neuroscience, and statistical genetics.

Kosmos uses a structured world model to coordinate large numbers of parallel agents — data analysis agents and literature search agents — sharing information through a continuously updated knowledge representation. In a single run, the system executes an average of 42,000 lines of code across 166 data analysis agent rollouts and reads 1,500 full-length scientific papers across 36 literature review rollouts. Every claim in a Kosmos report is linked to either a data analysis notebook or a cited paper, ensuring full traceability.

Independent evaluation by expert scientists found that 79.4% of statements in Kosmos reports were accurate, with data analysis claims reaching 85.5% reproducibility and literature review claims reaching 82.1% validation. Synthesis statements — where the system interprets and connects findings — achieved 57.9% accuracy, highlighting the gap that remains in higher-order scientific reasoning.

The system reported seven discoveries across its test runs: three independently reproduced findings from preprinted or unpublished manuscripts that Kosmos had not accessed, and four made novel contributions to the scientific literature. Collaborating academic groups estimated that a 20-cycle Kosmos run would have taken them 6.14 months of research time to complete.

What These Systems Cannot Do

The enthusiasm around AI scientists must be tempered by honest assessment of their limitations. The AI Scientist-v2's accepted paper, while technically sound, received criticism for insufficient justification of its methodological choices — precisely the kind of deep conceptual reasoning that distinguishes competent research from important research.

Kosmos's accuracy drops significantly for synthesis and interpretation statements (57.9% versus 85.5% for data analysis), suggesting that while AI agents excel at executing well-defined analytical procedures, they struggle with the integrative reasoning that produces genuine insight.

Neither system currently handles experimental design in the physical sciences, field research, or any domain requiring embodied interaction with the world. Both operate within the confines of computational experiments on existing datasets.

Open Questions

The emergence of functional AI scientists raises questions that the field has only begun to address. How should peer review adapt when reviewers cannot distinguish AI-authored from human-authored submissions? What does authorship mean when the entire research pipeline is automated? And how do we prevent these systems from amplifying existing biases in the scientific literature they consume?

Perhaps the most consequential question is economic: if an AI system can produce workshop-quality research at negligible marginal cost, what happens to the incentive structures — tenure, grants, publication records — that currently organize scientific labor?

Looking Forward

The trajectory from v1 to v2 of The AI Scientist, and the parallel development of systems like Kosmos, suggests rapid progress. The key constraint is no longer whether AI can do science, but how to ensure that automated science is reliable, novel, and ethically governed. The research community's response to this challenge will shape the future of scientific discovery.