Critical ReviewAI & Machine Learning

SWE-Bench Pro: Why AI Coding Agents Struggle with Real Enterprise Code

AI coding agents solve 43.6% of public benchmark tasks—but how do they fare on real enterprise codebases? SWE-Bench Pro reveals that performance drops steeply when agents face long-horizon, multi-file engineering tasks drawn from commercial repositories, exposing a significant gap between benchmark scores and practical capability.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

A software engineer at a mid-size company opens a ticket: refactor the authentication module to support multi-tenant SSO, update the test suite, ensure backward compatibility with three legacy endpoints, and document the changes. The task touches twelve files across four directories and requires understanding an undocumented internal API. It takes the engineer two days. Could an AI coding agent do it?

The honest answer, according to Deng et al. (2025), is: probably not. While AI coding agents have made genuine progress on structured benchmark tasks, their performance degrades substantially when confronted with the kind of work that fills actual engineering backlogs—long-horizon tasks spanning multiple files, requiring contextual understanding of large codebases, and demanding the kind of architectural judgment that comes from familiarity with a system's history and constraints.

The Research Landscape

SWE-Bench Pro is a benchmark containing nearly 1,900 problems drawn from 41 repositories, organized into three tiers: public repositories (where contamination through training data is possible), held-out repositories (not publicly available during training), and commercial enterprise codebases (proprietary code that no model has seen). All tasks are human-verified with sufficient context provided, and the benchmark specifically targets long-horizon tasks—problems that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications.

The benchmark design addresses a known weakness of existing evaluations like the original SWE-Bench: most prior benchmarks test relatively contained, single-file modifications where the solution can be inferred from local context. Real software engineering rarely works this way. A bug fix in production code often requires understanding how a component interacts with distant parts of the system, respecting invariants that are implicit rather than documented, and making changes that are consistent with the codebase's architectural patterns.

The Performance Gap

The central finding is sobering for those who track AI coding benchmarks. On public repositories—the kind of code that appears in training data and existing benchmarks—current agents achieve resolution rates around 43.6%. This is a genuine capability: nearly half of structured engineering tasks can be addressed by current systems.

But performance drops significantly on commercial enterprise codebases. The authors report that agents struggle with the characteristics that define enterprise software: large interconnected codebases where changes propagate across module boundaries, domain-specific conventions that differ from open-source norms, and implicit requirements that are not captured in issue descriptions.

Failure Pattern Analysis

Deng et al. analyze the failure modes of AI agents on the benchmark and identify several recurring patterns:

Context navigation failures: Agents fail to locate the relevant code in large repositories, spending their budget exploring wrong directories or fixating on superficially similar but incorrect files.
Multi-file coordination failures: Even when agents correctly identify the change needed in one file, they fail to propagate related changes to dependent files—the kind of cross-cutting modification that human engineers handle through system-level understanding.
Specification inference failures: Enterprise tasks often have implicit requirements ("don't break the billing integration" or "maintain backward compatibility with API v2") that are not stated in the task description but would be obvious to a developer familiar with the system.
Long-horizon planning failures: Tasks requiring a sequence of coordinated changes—first refactor this, then update that, then add tests—expose the limited planning capability of current agents, which tend to attempt changes in isolation rather than as part of a coherent plan.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Agents achieve ~43.6% on public benchmark tasks	Evaluated across multiple leading agents on public repository split	✅ Supported
Performance drops significantly on enterprise codebases	Comparison across public, held-out, and commercial tiers	✅ Supported
Long-horizon multi-file tasks are the primary difficulty	Failure pattern analysis across ~1,900 tasks	✅ Supported
The benchmark is contamination-resistant	Commercial codebase split is proprietary and unseen	✅ Supported by design
Current agents lack architectural reasoning capability	Inferred from failure patterns, not directly measured	⚠️ Plausible but indirectly evidenced

The benchmark design is strong: human-verified tasks, tiered difficulty, and contamination resistance through proprietary code are methodological improvements over prior work. The primary limitation is that the specific performance numbers are bound to the particular agents evaluated and will likely shift as new systems emerge. The qualitative finding—that enterprise code is substantially harder than open-source benchmark code—is more durable.

Open Questions

Repository-specific adaptation: Would agents that can be "onboarded" to a specific codebase—given access to documentation, architecture diagrams, and commit history—close the gap? The current evaluation assumes agents encounter each repository cold.

The role of implicit knowledge: How much of the enterprise performance gap comes from missing information (undocumented conventions, tribal knowledge) versus genuine reasoning limitations? If agents were given perfect documentation, how much would performance improve?

Compositional planning: Can current reasoning approaches (chain-of-thought, tree search) scale to the multi-step planning required for long-horizon engineering tasks, or is a fundamentally different planning architecture needed?

Evaluation stability: As AI agents improve, will the benchmark maintain its discriminative power, or will it need continuous updates with harder tasks—a treadmill problem familiar from other AI benchmarks?

Human-AI collaboration: The binary framing (agent succeeds or fails) may miss the most practical use case: agents that handle routine aspects of a task while flagging architectural decisions for human review. How would a collaborative evaluation change the results?

What This Means for Your Research

SWE-Bench Pro provides a useful corrective to the narrative that AI coding agents are approaching human-level software engineering. They are not—at least not for the kind of work that defines professional engineering practice. The benchmark quantifies what many practitioners have observed informally: AI agents are helpful for contained, well-specified tasks but struggle with the contextual reasoning, multi-file coordination, and implicit specification inference that characterize real engineering work.

For researchers building coding agents, the benchmark identifies specific capability gaps worth targeting. For organizations evaluating whether to deploy coding agents, it provides a more realistic baseline than public benchmark scores suggest.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문을 통해 반드시 검증해야 한다.

SWE-Bench Pro: AI 코딩 에이전트가 실제 기업 코드에서 어려움을 겪는 이유

중규모 기업의 한 소프트웨어 엔지니어가 티켓을 받는다: 다중 테넌트 SSO를 지원하도록 인증 모듈을 리팩터링하고, 테스트 스위트를 업데이트하며, 세 개의 레거시 엔드포인트와의 하위 호환성을 보장하고, 변경 사항을 문서화하라. 해당 작업은 네 개의 디렉터리에 걸친 열두 개의 파일에 관련되며, 문서화되지 않은 내부 API에 대한 이해를 필요로 한다. 엔지니어는 이틀에 걸쳐 작업을 완료한다. AI 코딩 에이전트는 이를 해낼 수 있을까?

Deng et al. (2025)에 따르면, 솔직한 답은 '아마도 아닐 것'이다. AI 코딩 에이전트가 구조화된 벤치마크 작업에서 실질적인 발전을 이루어 온 것은 사실이지만, 실제 엔지니어링 백로그를 채우는 종류의 작업—여러 파일에 걸친 장기적(long-horizon) 작업, 대규모 코드베이스에 대한 맥락적 이해 요구, 시스템의 역사와 제약에 대한 친숙함에서 비롯되는 아키텍처적 판단—에 직면했을 때 성능이 상당히 저하된다.

연구 현황

SWE-Bench Pro는 41개의 저장소에서 추출한 약 1,900개의 문제로 구성된 벤치마크로, 세 가지 계층으로 구성된다: 공개 저장소(학습 데이터를 통한 오염이 가능한 경우), 비공개 저장소(학습 중 공개적으로 이용 불가능한 경우), 상업용 기업 코드베이스(어떤 모델도 접한 적 없는 독점 코드). 모든 작업은 충분한 맥락이 제공된 상태에서 인간 검증을 거쳤으며, 벤치마크는 특히 장기적 작업을 대상으로 한다—전문 소프트웨어 엔지니어가 완료하는 데 수 시간에서 수 일이 걸릴 수 있는 문제로, 종종 여러 파일에 걸친 패치와 상당한 코드 수정을 포함한다.

벤치마크 설계는 기존 SWE-Bench와 같은 평가의 알려진 약점을 보완한다: 대부분의 기존 벤치마크는 로컬 맥락에서 해결책을 추론할 수 있는 비교적 단일 파일 수정 작업을 테스트한다. 실제 소프트웨어 엔지니어링은 이런 방식으로 거의 작동하지 않는다. 프로덕션 코드의 버그 수정은 종종 컴포넌트가 시스템의 멀리 떨어진 부분과 어떻게 상호작용하는지 이해하고, 문서화되지 않고 암묵적인 불변 조건을 존중하며, 코드베이스의 아키텍처 패턴과 일관된 변경을 수행하는 것을 요구한다.

성능 격차

핵심 발견은 AI 코딩 벤치마크를 추적하는 이들에게 냉정한 현실을 보여준다. 학습 데이터와 기존 벤치마크에 등장하는 종류의 코드인 공개 저장소에서, 현재 에이전트는 약 43.6%의 해결률을 달성한다. 이는 실질적인 역량이다: 현재 시스템이 구조화된 엔지니어링 작업의 거의 절반을 처리할 수 있다.

그러나 상업용 기업 코드베이스에서는 성능이 크게 떨어진다. 저자들은 에이전트가 기업 소프트웨어를 정의하는 특성들—변경 사항이 모듈 경계를 넘어 전파되는 대규모 상호 연결 코드베이스, 오픈소스 규범과 다른 도메인 특화 관례, 이슈 설명에 포착되지 않는 암묵적 요구 사항—에서 어려움을 겪는다고 보고한다.

실패 패턴 분석

Deng et al.은 벤치마크에서 AI 에이전트의 실패 양상을 분석하고 몇 가지 반복적인 패턴을 확인한다:

맥락 탐색 실패: 에이전트가 대규모 저장소에서 관련 코드를 찾아내지 못하고, 잘못된 디렉터리를 탐색하거나 표면적으로는 유사하지만 잘못된 파일에 집착하는 데 예산을 소진한다.
다중 파일 조정 실패: 에이전트가 한 파일에서 필요한 변경 사항을 올바르게 파악하더라도, 관련 변경 사항을 종속 파일에 전파하지 못한다—인간 엔지니어가 시스템 수준의 이해를 통해 처리하는 종류의 횡단 관심사(cross-cutting) 수정이다.
명세 추론 실패: 엔터프라이즈 작업에는 흔히 암묵적인 요구사항("청구 통합을 망가뜨리지 말 것" 또는 "API v2와의 하위 호환성을 유지할 것")이 존재하는데, 이러한 요구사항은 작업 설명에 명시되어 있지 않지만 해당 시스템에 익숙한 개발자라면 당연히 알아야 할 사항이다.
장기 계획 실패: 일련의 조율된 변경—먼저 이것을 리팩터링하고, 그다음 저것을 업데이트하고, 그다음 테스트를 추가하는—을 요구하는 작업은 현재 에이전트의 제한된 계획 능력을 드러낸다. 현재 에이전트들은 일관된 계획의 일부로 변경을 시도하기보다 개별적으로 시도하는 경향이 있다.

비판적 분석: 주장과 근거

주장	근거	판정
에이전트들이 공개 벤치마크 작업에서 ~43.6%를 달성	공개 저장소 분할에서 다수의 선도적 에이전트를 대상으로 평가	✅ 지지됨
엔터프라이즈 코드베이스에서 성능이 크게 하락	공개, 비공개, 상업용 계층에 걸친 비교	✅ 지지됨
장기적 다중 파일 작업이 주요 난점	약 1,900개 작업에 걸친 실패 패턴 분석	✅ 지지됨
벤치마크가 오염에 강함	상업용 코드베이스 분할은 독점적이며 미공개	✅ 설계상 지지됨
현재 에이전트들이 아키텍처 추론 능력 부족	실패 패턴에서 추론된 것으로, 직접 측정되지 않음	⚠️ 타당하나 간접적으로만 뒷받침됨

벤치마크 설계는 견고하다: 사람이 검증한 작업, 계층화된 난이도, 그리고 독점 코드를 통한 오염 저항성은 기존 연구 대비 방법론적 개선이다. 주요 한계는 특정 성능 수치가 평가된 특정 에이전트에 귀속되며, 새로운 시스템이 등장함에 따라 변화할 가능성이 높다는 점이다. 엔터프라이즈 코드가 오픈소스 벤치마크 코드보다 실질적으로 더 어렵다는 질적 발견은 보다 지속적인 가치를 지닌다.

미해결 질문

저장소별 적응: 에이전트가 특정 코드베이스에 "온보딩"될 수 있다면—문서, 아키텍처 다이어그램, 커밋 이력에 접근 권한을 부여받는다면—성능 격차를 줄일 수 있을까? 현재 평가는 에이전트가 각 저장소를 처음 접하는 상황을 가정한다.

암묵적 지식의 역할: 엔터프라이즈 성능 격차 중 얼마나 많은 부분이 누락된 정보(문서화되지 않은 관행, 암묵적 지식)에서 비롯되는가, 아니면 실제 추론 한계에서 비롯되는가? 에이전트에게 완벽한 문서가 제공된다면 성능이 얼마나 향상될까?

복합적 계획: 현재의 추론 방식(chain-of-thought, tree search)이 장기적 엔지니어링 작업에 필요한 다단계 계획으로 확장될 수 있을까, 아니면 근본적으로 다른 계획 아키텍처가 필요할까?

평가 안정성: AI 에이전트가 발전함에 따라 벤치마크가 변별력을 유지할 수 있을까, 아니면 더 어려운 작업으로 지속적인 업데이트가 필요할까—이는 다른 AI 벤치마크에서도 익숙한 트레드밀 문제이다.

인간-AI 협업: 이분법적 구도(에이전트 성공 또는 실패)는 가장 실용적인 활용 사례를 놓칠 수 있다: 작업의 일상적인 측면을 처리하는 동시에 아키텍처 결정을 인간 검토를 위해 플래그로 표시하는 에이전트. 협업적 평가는 결과를 어떻게 변화시킬까?

연구자를 위한 시사점

SWE-Bench Pro는 AI 코딩 에이전트가 인간 수준의 소프트웨어 엔지니어링에 근접하고 있다는 서사에 유용한 교정을 제공한다. 적어도 전문 엔지니어링 실무를 정의하는 종류의 작업에서는 그렇지 않다. 이 벤치마크는 많은 실무자들이 비공식적으로 관찰해 온 것을 수치화한다: AI 에이전트는 범위가 한정되고 명확하게 명세된 작업에는 유용하지만, 실제 엔지니어링 작업의 특성인 맥락적 추론, 다중 파일 조율, 암묵적 명세 추론에는 어려움을 겪는다.

코딩 에이전트를 개발하는 연구자들에게 이 벤치마크는 집중적으로 다룰 만한 구체적인 능력 격차를 식별해 준다. 코딩 에이전트 배포 여부를 평가하는 조직에게는 공개 벤치마크 점수가 시사하는 것보다 더 현실적인 기준선을 제공한다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (1)

[1] Deng, X., Da, J., Pan, E., He, Y.Y., Ide, C., Garg, K., ... & Kenstler, B. (2025). SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv:2509.16941.

DOI Scholar