Critical ReviewAI & Machine Learning

SWE-Bench Pro: Why AI Coding Agents Struggle with Real Enterprise Code

AI coding agents solve 43.6% of public benchmark tasksβ€”but how do they fare on real enterprise codebases? SWE-Bench Pro reveals that performance drops steeply when agents face long-horizon, multi-file engineering tasks drawn from commercial repositories, exposing a significant gap between benchmark scores and practical capability.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

A software engineer at a mid-size company opens a ticket: refactor the authentication module to support multi-tenant SSO, update the test suite, ensure backward compatibility with three legacy endpoints, and document the changes. The task touches twelve files across four directories and requires understanding an undocumented internal API. It takes the engineer two days. Could an AI coding agent do it?

The honest answer, according to Deng et al. (2025), is: probably not. While AI coding agents have made genuine progress on structured benchmark tasks, their performance degrades substantially when confronted with the kind of work that fills actual engineering backlogsβ€”long-horizon tasks spanning multiple files, requiring contextual understanding of large codebases, and demanding the kind of architectural judgment that comes from familiarity with a system's history and constraints.

The Research Landscape

SWE-Bench Pro is a benchmark containing nearly 1,900 problems drawn from 41 repositories, organized into three tiers: public repositories (where contamination through training data is possible), held-out repositories (not publicly available during training), and commercial enterprise codebases (proprietary code that no model has seen). All tasks are human-verified with sufficient context provided, and the benchmark specifically targets long-horizon tasksβ€”problems that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications.

The benchmark design addresses a known weakness of existing evaluations like the original SWE-Bench: most prior benchmarks test relatively contained, single-file modifications where the solution can be inferred from local context. Real software engineering rarely works this way. A bug fix in production code often requires understanding how a component interacts with distant parts of the system, respecting invariants that are implicit rather than documented, and making changes that are consistent with the codebase's architectural patterns.

The Performance Gap

The central finding is sobering for those who track AI coding benchmarks. On public repositoriesβ€”the kind of code that appears in training data and existing benchmarksβ€”current agents achieve resolution rates around 43.6%. This is a genuine capability: nearly half of structured engineering tasks can be addressed by current systems.

But performance drops significantly on commercial enterprise codebases. The authors report that agents struggle with the characteristics that define enterprise software: large interconnected codebases where changes propagate across module boundaries, domain-specific conventions that differ from open-source norms, and implicit requirements that are not captured in issue descriptions.

Failure Pattern Analysis

Deng et al. analyze the failure modes of AI agents on the benchmark and identify several recurring patterns:

  • Context navigation failures: Agents fail to locate the relevant code in large repositories, spending their budget exploring wrong directories or fixating on superficially similar but incorrect files.
  • Multi-file coordination failures: Even when agents correctly identify the change needed in one file, they fail to propagate related changes to dependent filesβ€”the kind of cross-cutting modification that human engineers handle through system-level understanding.
  • Specification inference failures: Enterprise tasks often have implicit requirements ("don't break the billing integration" or "maintain backward compatibility with API v2") that are not stated in the task description but would be obvious to a developer familiar with the system.
  • Long-horizon planning failures: Tasks requiring a sequence of coordinated changesβ€”first refactor this, then update that, then add testsβ€”expose the limited planning capability of current agents, which tend to attempt changes in isolation rather than as part of a coherent plan.

Critical Analysis: Claims and Evidence

<
ClaimEvidenceVerdict
Agents achieve ~43.6% on public benchmark tasksEvaluated across multiple leading agents on public repository splitβœ… Supported
Performance drops significantly on enterprise codebasesComparison across public, held-out, and commercial tiersβœ… Supported
Long-horizon multi-file tasks are the primary difficultyFailure pattern analysis across ~1,900 tasksβœ… Supported
The benchmark is contamination-resistantCommercial codebase split is proprietary and unseenβœ… Supported by design
Current agents lack architectural reasoning capabilityInferred from failure patterns, not directly measured⚠️ Plausible but indirectly evidenced

The benchmark design is strong: human-verified tasks, tiered difficulty, and contamination resistance through proprietary code are methodological improvements over prior work. The primary limitation is that the specific performance numbers are bound to the particular agents evaluated and will likely shift as new systems emerge. The qualitative findingβ€”that enterprise code is substantially harder than open-source benchmark codeβ€”is more durable.

Open Questions

  • Repository-specific adaptation: Would agents that can be "onboarded" to a specific codebaseβ€”given access to documentation, architecture diagrams, and commit historyβ€”close the gap? The current evaluation assumes agents encounter each repository cold.
  • The role of implicit knowledge: How much of the enterprise performance gap comes from missing information (undocumented conventions, tribal knowledge) versus genuine reasoning limitations? If agents were given perfect documentation, how much would performance improve?
  • Compositional planning: Can current reasoning approaches (chain-of-thought, tree search) scale to the multi-step planning required for long-horizon engineering tasks, or is a fundamentally different planning architecture needed?
  • Evaluation stability: As AI agents improve, will the benchmark maintain its discriminative power, or will it need continuous updates with harder tasksβ€”a treadmill problem familiar from other AI benchmarks?
  • Human-AI collaboration: The binary framing (agent succeeds or fails) may miss the most practical use case: agents that handle routine aspects of a task while flagging architectural decisions for human review. How would a collaborative evaluation change the results?
  • What This Means for Your Research

    SWE-Bench Pro provides a useful corrective to the narrative that AI coding agents are approaching human-level software engineering. They are notβ€”at least not for the kind of work that defines professional engineering practice. The benchmark quantifies what many practitioners have observed informally: AI agents are helpful for contained, well-specified tasks but struggle with the contextual reasoning, multi-file coordination, and implicit specification inference that characterize real engineering work.

    For researchers building coding agents, the benchmark identifies specific capability gaps worth targeting. For organizations evaluating whether to deploy coding agents, it provides a more realistic baseline than public benchmark scores suggest.

    Explore related work through ORAA ResearchBrain.

    References (1)

    [1] Deng, X., Da, J., Pan, E., He, Y.Y., Ide, C., Garg, K., ... & Kenstler, B. (2025). SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv:2509.16941.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 6 keywords β†’