Critical ReviewAI & Machine Learning
SWE-Bench Pro: Why AI Coding Agents Struggle with Real Enterprise Code
AI coding agents solve 43.6% of public benchmark tasksβbut how do they fare on real enterprise codebases? SWE-Bench Pro reveals that performance drops steeply when agents face long-horizon, multi-file engineering tasks drawn from commercial repositories, exposing a significant gap between benchmark scores and practical capability.
By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.
A software engineer at a mid-size company opens a ticket: refactor the authentication module to support multi-tenant SSO, update the test suite, ensure backward compatibility with three legacy endpoints, and document the changes. The task touches twelve files across four directories and requires understanding an undocumented internal API. It takes the engineer two days. Could an AI coding agent do it?
The honest answer, according to Deng et al. (2025), is: probably not. While AI coding agents have made genuine progress on structured benchmark tasks, their performance degrades substantially when confronted with the kind of work that fills actual engineering backlogsβlong-horizon tasks spanning multiple files, requiring contextual understanding of large codebases, and demanding the kind of architectural judgment that comes from familiarity with a system's history and constraints.
The Research Landscape
SWE-Bench Pro is a benchmark containing nearly 1,900 problems drawn from 41 repositories, organized into three tiers: public repositories (where contamination through training data is possible), held-out repositories (not publicly available during training), and commercial enterprise codebases (proprietary code that no model has seen). All tasks are human-verified with sufficient context provided, and the benchmark specifically targets long-horizon tasksβproblems that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications.
The benchmark design addresses a known weakness of existing evaluations like the original SWE-Bench: most prior benchmarks test relatively contained, single-file modifications where the solution can be inferred from local context. Real software engineering rarely works this way. A bug fix in production code often requires understanding how a component interacts with distant parts of the system, respecting invariants that are implicit rather than documented, and making changes that are consistent with the codebase's architectural patterns.
The central finding is sobering for those who track AI coding benchmarks. On public repositoriesβthe kind of code that appears in training data and existing benchmarksβcurrent agents achieve resolution rates around 43.6%. This is a genuine capability: nearly half of structured engineering tasks can be addressed by current systems.
But performance drops significantly on commercial enterprise codebases. The authors report that agents struggle with the characteristics that define enterprise software: large interconnected codebases where changes propagate across module boundaries, domain-specific conventions that differ from open-source norms, and implicit requirements that are not captured in issue descriptions.
Failure Pattern Analysis
Deng et al. analyze the failure modes of AI agents on the benchmark and identify several recurring patterns:
- Context navigation failures: Agents fail to locate the relevant code in large repositories, spending their budget exploring wrong directories or fixating on superficially similar but incorrect files.
- Multi-file coordination failures: Even when agents correctly identify the change needed in one file, they fail to propagate related changes to dependent filesβthe kind of cross-cutting modification that human engineers handle through system-level understanding.
- Specification inference failures: Enterprise tasks often have implicit requirements ("don't break the billing integration" or "maintain backward compatibility with API v2") that are not stated in the task description but would be obvious to a developer familiar with the system.
- Long-horizon planning failures: Tasks requiring a sequence of coordinated changesβfirst refactor this, then update that, then add testsβexpose the limited planning capability of current agents, which tend to attempt changes in isolation rather than as part of a coherent plan.
Critical Analysis: Claims and Evidence
<
| Claim | Evidence | Verdict |
|---|
| Agents achieve ~43.6% on public benchmark tasks | Evaluated across multiple leading agents on public repository split | β
Supported |
| Performance drops significantly on enterprise codebases | Comparison across public, held-out, and commercial tiers | β
Supported |
| Long-horizon multi-file tasks are the primary difficulty | Failure pattern analysis across ~1,900 tasks | β
Supported |
| The benchmark is contamination-resistant | Commercial codebase split is proprietary and unseen | β
Supported by design |
| Current agents lack architectural reasoning capability | Inferred from failure patterns, not directly measured | β οΈ Plausible but indirectly evidenced |
The benchmark design is strong: human-verified tasks, tiered difficulty, and contamination resistance through proprietary code are methodological improvements over prior work. The primary limitation is that the specific performance numbers are bound to the particular agents evaluated and will likely shift as new systems emerge. The qualitative findingβthat enterprise code is substantially harder than open-source benchmark codeβis more durable.
Open Questions
Repository-specific adaptation: Would agents that can be "onboarded" to a specific codebaseβgiven access to documentation, architecture diagrams, and commit historyβclose the gap? The current evaluation assumes agents encounter each repository cold.The role of implicit knowledge: How much of the enterprise performance gap comes from missing information (undocumented conventions, tribal knowledge) versus genuine reasoning limitations? If agents were given perfect documentation, how much would performance improve?Compositional planning: Can current reasoning approaches (chain-of-thought, tree search) scale to the multi-step planning required for long-horizon engineering tasks, or is a fundamentally different planning architecture needed?Evaluation stability: As AI agents improve, will the benchmark maintain its discriminative power, or will it need continuous updates with harder tasksβa treadmill problem familiar from other AI benchmarks?Human-AI collaboration: The binary framing (agent succeeds or fails) may miss the most practical use case: agents that handle routine aspects of a task while flagging architectural decisions for human review. How would a collaborative evaluation change the results?What This Means for Your Research
SWE-Bench Pro provides a useful corrective to the narrative that AI coding agents are approaching human-level software engineering. They are notβat least not for the kind of work that defines professional engineering practice. The benchmark quantifies what many practitioners have observed informally: AI agents are helpful for contained, well-specified tasks but struggle with the contextual reasoning, multi-file coordination, and implicit specification inference that characterize real engineering work.
For researchers building coding agents, the benchmark identifies specific capability gaps worth targeting. For organizations evaluating whether to deploy coding agents, it provides a more realistic baseline than public benchmark scores suggest.
Explore related work through ORAA ResearchBrain.
λ©΄μ±
μ‘°ν: μ΄ κ²μλ¬Όμ μ 보 μ 곡μ λͺ©μ μΌλ‘ ν μ°κ΅¬ λν₯ κ°μμ΄λ€. νμ μ°κ΅¬μμ μΈμ©νκΈ° μ μ ꡬ체μ μΈ μ°κ΅¬ κ²°κ³Ό, ν΅κ³ λ° μ£Όμ₯μ μλ³Έ λ
Όλ¬Έμ ν΅ν΄ λ°λμ κ²μ¦ν΄μΌ νλ€.
SWE-Bench Pro: AI μ½λ© μμ΄μ νΈκ° μ€μ κΈ°μ
μ½λμμ μ΄λ €μμ κ²ͺλ μ΄μ
μ€κ·λͺ¨ κΈ°μ
μ ν μννΈμ¨μ΄ μμ§λμ΄κ° ν°μΌμ λ°λλ€: λ€μ€ ν
λνΈ SSOλ₯Ό μ§μνλλ‘ μΈμ¦ λͺ¨λμ 리ν©ν°λ§νκ³ , ν
μ€νΈ μ€μνΈλ₯Ό μ
λ°μ΄νΈνλ©°, μΈ κ°μ λ κ±°μ μλν¬μΈνΈμμ νμ νΈνμ±μ 보μ₯νκ³ , λ³κ²½ μ¬νμ λ¬ΈμννλΌ. ν΄λΉ μμ
μ λ€ κ°μ λλ ν°λ¦¬μ κ±ΈμΉ μ΄λ κ°μ νμΌμ κ΄λ ¨λλ©°, λ¬Έμνλμ§ μμ λ΄λΆ APIμ λν μ΄ν΄λ₯Ό νμλ‘ νλ€. μμ§λμ΄λ μ΄νμ κ±Έμ³ μμ
μ μλ£νλ€. AI μ½λ© μμ΄μ νΈλ μ΄λ₯Ό ν΄λΌ μ μμκΉ?
Deng et al. (2025)μ λ°λ₯΄λ©΄, μμ§ν λ΅μ 'μλ§λ μλ κ²'μ΄λ€. AI μ½λ© μμ΄μ νΈκ° ꡬ쑰νλ λ²€μΉλ§ν¬ μμ
μμ μ€μ§μ μΈ λ°μ μ μ΄λ£¨μ΄ μ¨ κ²μ μ¬μ€μ΄μ§λ§, μ€μ μμ§λμ΄λ§ λ°±λ‘κ·Έλ₯Ό μ±μ°λ μ’
λ₯μ μμ
βμ¬λ¬ νμΌμ κ±ΈμΉ μ₯κΈ°μ (long-horizon) μμ
, λκ·λͺ¨ μ½λλ² μ΄μ€μ λν λ§₯λ½μ μ΄ν΄ μꡬ, μμ€ν
μ μμ¬μ μ μ½μ λν μΉμν¨μμ λΉλ‘―λλ μν€ν
μ²μ νλ¨βμ μ§λ©΄νμ λ μ±λ₯μ΄ μλΉν μ νλλ€.
μ°κ΅¬ νν©
SWE-Bench Proλ 41κ°μ μ μ₯μμμ μΆμΆν μ½ 1,900κ°μ λ¬Έμ λ‘ κ΅¬μ±λ λ²€μΉλ§ν¬λ‘, μΈ κ°μ§ κ³μΈ΅μΌλ‘ ꡬμ±λλ€: κ³΅κ° μ μ₯μ(νμ΅ λ°μ΄ν°λ₯Ό ν΅ν μ€μΌμ΄ κ°λ₯ν κ²½μ°), λΉκ³΅κ° μ μ₯μ(νμ΅ μ€ κ³΅κ°μ μΌλ‘ μ΄μ© λΆκ°λ₯ν κ²½μ°), μμ
μ© κΈ°μ
μ½λλ² μ΄μ€(μ΄λ€ λͺ¨λΈλ μ ν μ μλ λ
μ μ½λ). λͺ¨λ μμ
μ μΆ©λΆν λ§₯λ½μ΄ μ 곡λ μνμμ μΈκ° κ²μ¦μ κ±°μ³€μΌλ©°, λ²€μΉλ§ν¬λ νΉν μ₯κΈ°μ μμ
μ λμμΌλ‘ νλ€βμ λ¬Έ μννΈμ¨μ΄ μμ§λμ΄κ° μλ£νλ λ° μ μκ°μμ μ μΌμ΄ 걸릴 μ μλ λ¬Έμ λ‘, μ’
μ’
μ¬λ¬ νμΌμ κ±ΈμΉ ν¨μΉμ μλΉν μ½λ μμ μ ν¬ν¨νλ€.
λ²€μΉλ§ν¬ μ€κ³λ κΈ°μ‘΄ SWE-Benchμ κ°μ νκ°μ μλ €μ§ μ½μ μ 보μνλ€: λλΆλΆμ κΈ°μ‘΄ λ²€μΉλ§ν¬λ λ‘컬 λ§₯λ½μμ ν΄κ²°μ±
μ μΆλ‘ ν μ μλ λΉκ΅μ λ¨μΌ νμΌ μμ μμ
μ ν
μ€νΈνλ€. μ€μ μννΈμ¨μ΄ μμ§λμ΄λ§μ μ΄λ° λ°©μμΌλ‘ κ±°μ μλνμ§ μλλ€. νλ‘λμ
μ½λμ λ²κ·Έ μμ μ μ’
μ’
μ»΄ν¬λνΈκ° μμ€ν
μ λ©λ¦¬ λ¨μ΄μ§ λΆλΆκ³Ό μ΄λ»κ² μνΈμμ©νλμ§ μ΄ν΄νκ³ , λ¬Έμνλμ§ μκ³ μ묡μ μΈ λΆλ³ 쑰건μ μ‘΄μ€νλ©°, μ½λλ² μ΄μ€μ μν€ν
μ² ν¨ν΄κ³Ό μΌκ΄λ λ³κ²½μ μννλ κ²μ μꡬνλ€.
μ±λ₯ 격차
ν΅μ¬ λ°κ²¬μ AI μ½λ© λ²€μΉλ§ν¬λ₯Ό μΆμ νλ μ΄λ€μκ² λμ ν νμ€μ 보μ¬μ€λ€. νμ΅ λ°μ΄ν°μ κΈ°μ‘΄ λ²€μΉλ§ν¬μ λ±μ₯νλ μ’
λ₯μ μ½λμΈ κ³΅κ° μ μ₯μμμ, νμ¬ μμ΄μ νΈλ μ½ 43.6%μ ν΄κ²°λ₯ μ λ¬μ±νλ€. μ΄λ μ€μ§μ μΈ μλμ΄λ€: νμ¬ μμ€ν
μ΄ κ΅¬μ‘°νλ μμ§λμ΄λ§ μμ
μ κ±°μ μ λ°μ μ²λ¦¬ν μ μλ€.
κ·Έλ¬λ μμ
μ© κΈ°μ
μ½λλ² μ΄μ€μμλ μ±λ₯μ΄ ν¬κ² λ¨μ΄μ§λ€. μ μλ€μ μμ΄μ νΈκ° κΈ°μ
μννΈμ¨μ΄λ₯Ό μ μνλ νΉμ±λ€βλ³κ²½ μ¬νμ΄ λͺ¨λ κ²½κ³λ₯Ό λμ΄ μ νλλ λκ·λͺ¨ μνΈ μ°κ²° μ½λλ² μ΄μ€, μ€νμμ€ κ·λ²κ³Ό λ€λ₯Έ λλ©μΈ νΉν κ΄λ‘, μ΄μ μ€λͺ
μ ν¬μ°©λμ§ μλ μ묡μ μꡬ μ¬νβμμ μ΄λ €μμ κ²ͺλλ€κ³ λ³΄κ³ νλ€.
μ€ν¨ ν¨ν΄ λΆμ
Deng et al.μ λ²€μΉλ§ν¬μμ AI μμ΄μ νΈμ μ€ν¨ μμμ λΆμνκ³ λͺ κ°μ§ λ°λ³΅μ μΈ ν¨ν΄μ νμΈνλ€:
- λ§₯λ½ νμ μ€ν¨: μμ΄μ νΈκ° λκ·λͺ¨ μ μ₯μμμ κ΄λ ¨ μ½λλ₯Ό μ°Ύμλ΄μ§ λͺ»νκ³ , μλͺ»λ λλ ν°λ¦¬λ₯Ό νμνκ±°λ νλ©΄μ μΌλ‘λ μ μ¬νμ§λ§ μλͺ»λ νμΌμ μ§μ°©νλ λ° μμ°μ μμ§νλ€.
- λ€μ€ νμΌ μ‘°μ μ€ν¨: μμ΄μ νΈκ° ν νμΌμμ νμν λ³κ²½ μ¬νμ μ¬λ°λ₯΄κ² νμ
νλλΌλ, κ΄λ ¨ λ³κ²½ μ¬νμ μ’
μ νμΌμ μ ννμ§ λͺ»νλ€βμΈκ° μμ§λμ΄κ° μμ€ν
μμ€μ μ΄ν΄λ₯Ό ν΅ν΄ μ²λ¦¬νλ μ’
λ₯μ ν‘λ¨ κ΄μ¬μ¬(cross-cutting) μμ μ΄λ€.
- λͺ
μΈ μΆλ‘ μ€ν¨: μν°νλΌμ΄μ¦ μμ
μλ νν μ묡μ μΈ μꡬμ¬ν("μ²κ΅¬ ν΅ν©μ λ§κ°λ¨λ¦¬μ§ λ§ κ²" λλ "API v2μμ νμ νΈνμ±μ μ μ§ν κ²")μ΄ μ‘΄μ¬νλλ°, μ΄λ¬ν μꡬμ¬νμ μμ
μ€λͺ
μ λͺ
μλμ΄ μμ§ μμ§λ§ ν΄λΉ μμ€ν
μ μ΅μν κ°λ°μλΌλ©΄ λΉμ°ν μμμΌ ν μ¬νμ΄λ€.
- μ₯κΈ° κ³ν μ€ν¨: μΌλ ¨μ μ‘°μ¨λ λ³κ²½βλ¨Όμ μ΄κ²μ 리ν©ν°λ§νκ³ , κ·Έλ€μ μ κ²μ μ
λ°μ΄νΈνκ³ , κ·Έλ€μ ν
μ€νΈλ₯Ό μΆκ°νλβμ μꡬνλ μμ
μ νμ¬ μμ΄μ νΈμ μ νλ κ³ν λ₯λ ₯μ λλ¬λΈλ€. νμ¬ μμ΄μ νΈλ€μ μΌκ΄λ κ³νμ μΌλΆλ‘ λ³κ²½μ μλνκΈ°λ³΄λ€ κ°λ³μ μΌλ‘ μλνλ κ²½ν₯μ΄ μλ€.
λΉνμ λΆμ: μ£Όμ₯κ³Ό κ·Όκ±°
<
| μ£Όμ₯ | κ·Όκ±° | νμ |
|---|
| μμ΄μ νΈλ€μ΄ κ³΅κ° λ²€μΉλ§ν¬ μμ
μμ ~43.6%λ₯Ό λ¬μ± | κ³΅κ° μ μ₯μ λΆν μμ λ€μμ μ λμ μμ΄μ νΈλ₯Ό λμμΌλ‘ νκ° | β
μ§μ§λ¨ |
| μν°νλΌμ΄μ¦ μ½λλ² μ΄μ€μμ μ±λ₯μ΄ ν¬κ² νλ½ | 곡κ°, λΉκ³΅κ°, μμ
μ© κ³μΈ΅μ κ±ΈμΉ λΉκ΅ | β
μ§μ§λ¨ |
| μ₯κΈ°μ λ€μ€ νμΌ μμ
μ΄ μ£Όμ λμ | μ½ 1,900κ° μμ
μ κ±ΈμΉ μ€ν¨ ν¨ν΄ λΆμ | β
μ§μ§λ¨ |
| λ²€μΉλ§ν¬κ° μ€μΌμ κ°ν¨ | μμ
μ© μ½λλ² μ΄μ€ λΆν μ λ
μ μ μ΄λ©° λ―Έκ³΅κ° | β
μ€κ³μ μ§μ§λ¨ |
| νμ¬ μμ΄μ νΈλ€μ΄ μν€ν
μ² μΆλ‘ λ₯λ ₯ λΆμ‘± | μ€ν¨ ν¨ν΄μμ μΆλ‘ λ κ²μΌλ‘, μ§μ μΈ‘μ λμ§ μμ | β οΈ νλΉνλ κ°μ μ μΌλ‘λ§ λ·λ°μΉ¨λ¨ |
λ²€μΉλ§ν¬ μ€κ³λ κ²¬κ³ νλ€: μ¬λμ΄ κ²μ¦ν μμ
, κ³μΈ΅νλ λμ΄λ, κ·Έλ¦¬κ³ λ
μ μ½λλ₯Ό ν΅ν μ€μΌ μ νμ±μ κΈ°μ‘΄ μ°κ΅¬ λλΉ λ°©λ²λ‘ μ κ°μ μ΄λ€. μ£Όμ νκ³λ νΉμ μ±λ₯ μμΉκ° νκ°λ νΉμ μμ΄μ νΈμ κ·μλλ©°, μλ‘μ΄ μμ€ν
μ΄ λ±μ₯ν¨μ λ°λΌ λ³νν κ°λ₯μ±μ΄ λλ€λ μ μ΄λ€. μν°νλΌμ΄μ¦ μ½λκ° μ€νμμ€ λ²€μΉλ§ν¬ μ½λλ³΄λ€ μ€μ§μ μΌλ‘ λ μ΄λ ΅λ€λ μ§μ λ°κ²¬μ λ³΄λ€ μ§μμ μΈ κ°μΉλ₯Ό μ§λλ€.
λ―Έν΄κ²° μ§λ¬Έ
μ μ₯μλ³ μ μ: μμ΄μ νΈκ° νΉμ μ½λλ² μ΄μ€μ "μ¨λ³΄λ©"λ μ μλ€λ©΄βλ¬Έμ, μν€ν
μ² λ€μ΄μ΄κ·Έλ¨, μ»€λ° μ΄λ ₯μ μ κ·Ό κΆνμ λΆμ¬λ°λλ€λ©΄βμ±λ₯ 격차λ₯Ό μ€μΌ μ μμκΉ? νμ¬ νκ°λ μμ΄μ νΈκ° κ° μ μ₯μλ₯Ό μ²μ μ νλ μν©μ κ°μ νλ€.μ묡μ μ§μμ μν : μν°νλΌμ΄μ¦ μ±λ₯ 격차 μ€ μΌλ§λ λ§μ λΆλΆμ΄ λλ½λ μ 보(λ¬Έμνλμ§ μμ κ΄ν, μ묡μ μ§μ)μμ λΉλ‘―λλκ°, μλλ©΄ μ€μ μΆλ‘ νκ³μμ λΉλ‘―λλκ°? μμ΄μ νΈμκ² μλ²½ν λ¬Έμκ° μ 곡λλ€λ©΄ μ±λ₯μ΄ μΌλ§λ ν₯μλ κΉ?볡ν©μ κ³ν: νμ¬μ μΆλ‘ λ°©μ(chain-of-thought, tree search)μ΄ μ₯κΈ°μ μμ§λμ΄λ§ μμ
μ νμν λ€λ¨κ³ κ³νμΌλ‘ νμ₯λ μ μμκΉ, μλλ©΄ κ·Όλ³Έμ μΌλ‘ λ€λ₯Έ κ³ν μν€ν
μ²κ° νμν κΉ?νκ° μμ μ±: AI μμ΄μ νΈκ° λ°μ ν¨μ λ°λΌ λ²€μΉλ§ν¬κ° λ³λ³λ ₯μ μ μ§ν μ μμκΉ, μλλ©΄ λ μ΄λ €μ΄ μμ
μΌλ‘ μ§μμ μΈ μ
λ°μ΄νΈκ° νμν κΉβμ΄λ λ€λ₯Έ AI λ²€μΉλ§ν¬μμλ μ΅μν νΈλ λλ° λ¬Έμ μ΄λ€.μΈκ°-AI νμ
: μ΄λΆλ²μ ꡬλ(μμ΄μ νΈ μ±κ³΅ λλ μ€ν¨)λ κ°μ₯ μ€μ©μ μΈ νμ© μ¬λ‘λ₯Ό λμΉ μ μλ€: μμ
μ μΌμμ μΈ μΈ‘λ©΄μ μ²λ¦¬νλ λμμ μν€ν
μ² κ²°μ μ μΈκ° κ²ν λ₯Ό μν΄ νλκ·Έλ‘ νμνλ μμ΄μ νΈ. νμ
μ νκ°λ κ²°κ³Όλ₯Ό μ΄λ»κ² λ³νμν¬κΉ?μ°κ΅¬μλ₯Ό μν μμ¬μ
SWE-Bench Proλ AI μ½λ© μμ΄μ νΈκ° μΈκ° μμ€μ μννΈμ¨μ΄ μμ§λμ΄λ§μ κ·Όμ νκ³ μλ€λ μμ¬μ μ μ©ν κ΅μ μ μ 곡νλ€. μ μ΄λ μ λ¬Έ μμ§λμ΄λ§ μ€λ¬΄λ₯Ό μ μνλ μ’
λ₯μ μμ
μμλ κ·Έλ μ§ μλ€. μ΄ λ²€μΉλ§ν¬λ λ§μ μ€λ¬΄μλ€μ΄ λΉκ³΅μμ μΌλ‘ κ΄μ°°ν΄ μ¨ κ²μ μμΉννλ€: AI μμ΄μ νΈλ λ²μκ° νμ λκ³ λͺ
ννκ² λͺ
μΈλ μμ
μλ μ μ©νμ§λ§, μ€μ μμ§λμ΄λ§ μμ
μ νΉμ±μΈ λ§₯λ½μ μΆλ‘ , λ€μ€ νμΌ μ‘°μ¨, μ묡μ λͺ
μΈ μΆλ‘ μλ μ΄λ €μμ κ²ͺλλ€.
μ½λ© μμ΄μ νΈλ₯Ό κ°λ°νλ μ°κ΅¬μλ€μκ² μ΄ λ²€μΉλ§ν¬λ μ§μ€μ μΌλ‘ λ€λ£° λ§ν ꡬ체μ μΈ λ₯λ ₯ 격차λ₯Ό μλ³ν΄ μ€λ€. μ½λ© μμ΄μ νΈ λ°°ν¬ μ¬λΆλ₯Ό νκ°νλ μ‘°μ§μκ²λ κ³΅κ° λ²€μΉλ§ν¬ μ μκ° μμ¬νλ κ²λ³΄λ€ λ νμ€μ μΈ κΈ°μ€μ μ μ 곡νλ€.
κ΄λ ¨ μ°κ΅¬λ ORAA ResearchBrainμ ν΅ν΄ νμν μ μλ€.
References (1)
[1] Deng, X., Da, J., Pan, E., He, Y.Y., Ide, C., Garg, K., ... & Kenstler, B. (2025). SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv:2509.16941.