Paper ReviewAI & Machine LearningMachine/Deep Learning

Proving Code Correct: Where Formal Verification Meets AI-Generated Software

AI can write code faster than humans—but can it prove that code is correct? PatchPilot combines AI patching agents with formal verification, while FVAPPS benchmarks the emerging capability of AI to generate both code and correctness proofs.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Software bugs impose enormous costs on the global economy—estimates range widely but consistently reach into the hundreds of billions of dollars annually (CISQ, 2022). Safety-critical systems—medical devices, autonomous vehicles, aviation control, nuclear plant monitors—demand correctness guarantees that testing alone cannot provide. Testing shows the presence of bugs, not their absence. Formal verification, which mathematically proves that code satisfies its specification, provides the absence guarantee—but at a cost in effort and expertise that has historically limited its application to only the most critical systems.

The convergence of AI code generation with formal verification creates an intriguing possibility: AI systems that not only write code but prove that the code is correct. This is not a distant aspiration. PatchPilot (Li et al.) already integrates formal verification into an AI software engineering agent, and FVAPPS (Dougherty & Mehta) provides the benchmark that measures progress toward this goal.

PatchPilot: Verified Patching at Scale

PatchPilot is a multi-agent system designed for software patching—fixing bugs in existing codebases—with early integration of formal verification. The system operates as a pipeline:

Bug localization: Analyze the bug report and codebase to identify the likely location of the defect

Patch generation: Generate candidate fixes using LLM-based code generation

Testing: Validate patches against existing test suites

Formal verification (early-stage): For critical code paths, attempt to prove that the patch satisfies formal correctness properties

The formal verification component is admittedly early-stage—it works for relatively simple correctness properties on well-typed codebases. But the architectural integration is the important contribution: by making verification a standard step in the patching pipeline, PatchPilot establishes a workflow that will accommodate increasingly powerful verification tools as they mature.

The cost efficiency is notable. PatchPilot achieves competitive results on the SWE-bench benchmark (the standard evaluation for AI software engineering agents) while using fewer LLM API calls than comparable systems—a practical consideration given that API costs accumulate rapidly for complex patching tasks.

FVAPPS: The Correctness Benchmark

Dougherty & Mehta's FVAPPS benchmark provides programming problems where the task is not just to write code but to prove it correct. Each problem comes with a specification in Lean 4, and the AI system must produce both a program and a formal proof that the program satisfies the specification.

This is a substantially harder challenge than standard code generation benchmarks (HumanEval, MBPP), which evaluate only whether code produces correct output on test cases. FVAPPS requires the model to reason about all possible inputs—proving universal correctness rather than testing specific cases. Current LLM performance on FVAPPS is modest, establishing a challenging benchmark that will drive progress for years.

Automotive Safety: Where the Stakes Are Highest

Pan et al. focus on a domain where rigorous verification is not optional: automotive software. Modern vehicles contain enormous amounts of software code, and software failures can cause crashes, injuries, and deaths. Standards such as ISO 26262 mandate structured development processes—including model-based methods like AUTOSAR, SysML, and model-based design—for the highest safety integrity levels.

Their approach combines generative AI (for rapid code production) with model-based methods (for structured verification and design consistency) in a complementary workflow:

AI generates code from natural language specifications, producing candidates quickly
Model-based methods validate the generated code against architectural models and safety properties, catching structural errors that testing might miss
AI refines code based on model-checking feedback, creating a closed loop between generation and verification

The synergy is bidirectional: AI makes model-based development more accessible by automating specification and boilerplate generation, while model-based constraints make AI-generated code more architecturally coherent and standards-compliant.

Claims and Evidence

Claim	Evidence	Verdict
AI agents can integrate formal verification into software patching	PatchPilot demonstrates pipeline integration	✅ Demonstrated (early stage)
Current LLMs can generate both code and correctness proofs	FVAPPS shows limited but non-zero capability	⚠️ Emerging capability
GenAI + model-based methods synergy improves automotive software safety	Pan et al. describe workflow; limited deployment evidence	⚠️ Architecturally sound
Formal verification scales to large AI-generated codebases	Current tools handle small programs; scaling remains challenging	⚠️ Significant gap
Testing is sufficient for safety-critical software	Formal methods community consensus: testing alone is insufficient	❌ Not sufficient

Open Questions

Specification completeness: Formal verification proves code correct with respect to a specification. But who writes the specification, and how do we verify that the specification captures the actual requirements? The specification gap is often larger than the implementation gap.

Verification scalability: Current formal verification tools handle programs of hundreds to thousands of lines. Real software systems contain millions of lines. How do we scale verification to production codebases?

Partial verification: If full verification is infeasible, can partial verification—proving critical properties while leaving non-critical behavior unverified—provide meaningful safety improvements at lower cost?

Verification of neural networks: AI-generated code may call neural network components (ML models, classifiers). Can we formally verify properties of code that includes non-deterministic neural components?

Developer adoption: Formal verification requires expertise that most software developers do not have. Can AI-mediated verification lower the barrier to adoption, or does it merely shift the expertise requirement from verification to AI tool configuration?

What This Means for Your Research

For software engineering researchers, the PatchPilot + FVAPPS combination establishes both a practical tool and a benchmark for the emerging field of AI-assisted formal verification. The benchmark is challenging enough to drive progress for years—a valuable resource for anyone working on verified code generation.

For safety-critical system developers, the automotive application (Pan et al.) provides a concrete integration pattern: use AI for rapid prototyping and model-based methods for structured verification, creating a workflow that is both fast (AI generation) and architecturally disciplined (model compliance).

For the AI code generation community, FVAPPS represents a qualitative step beyond current benchmarks. Generating code that works on test cases is a necessary but insufficient bar. Generating code that is provably correct is the standard that safety-critical applications demand—and it is the standard toward which the field should be moving.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장은 원본 논문과 대조하여 검증해야 한다.

코드의 정확성 증명: 형식 검증과 AI 생성 소프트웨어의 만남

소프트웨어 버그는 전 세계 경제에 막대한 비용을 초래한다. 추정치는 다양하지만 매년 수천억 달러에 달하는 것으로 일관되게 나타난다(CISQ, 2022). 의료 기기, 자율 주행 차량, 항공 제어, 원자력 발전소 모니터와 같은 안전 필수 시스템은 테스트만으로는 제공할 수 없는 정확성 보장을 요구한다. 테스트는 버그의 존재를 보여줄 뿐, 버그의 부재를 보여주지는 못한다. 코드가 사양을 만족한다는 것을 수학적으로 증명하는 형식 검증(formal verification)은 부재에 대한 보장을 제공하지만, 그 노력과 전문 지식에 드는 비용 때문에 역사적으로 가장 중요한 시스템에만 적용이 제한되어 왔다.

AI 코드 생성과 형식 검증의 융합은 흥미로운 가능성을 만들어낸다. 바로 코드를 작성할 뿐만 아니라 그 코드가 올바르다는 것을 증명하는 AI 시스템이다. 이는 먼 미래의 열망이 아니다. PatchPilot(Li et al.)은 이미 형식 검증을 AI 소프트웨어 엔지니어링 에이전트에 통합하고 있으며, FVAPPS(Dougherty & Mehta)는 이 목표를 향한 진척도를 측정하는 벤치마크를 제공한다.

PatchPilot: 대규모 검증된 패칭

PatchPilot은 형식 검증의 초기 통합을 갖춘 소프트웨어 패칭, 즉 기존 코드베이스의 버그 수정을 위해 설계된 다중 에이전트 시스템이다. 이 시스템은 파이프라인 방식으로 작동한다.

버그 국소화: 버그 보고서와 코드베이스를 분석하여 결함의 가능한 위치를 식별

패치 생성: LLM 기반 코드 생성을 사용하여 후보 수정안 생성

테스트: 기존 테스트 스위트에 대해 패치 검증

형식 검증 (초기 단계): 중요한 코드 경로에 대해 패치가 형식적 정확성 속성을 만족한다는 것을 증명 시도

형식 검증 구성 요소는 인정하건대 초기 단계이며, 잘 타입이 지정된 코드베이스에서 비교적 단순한 정확성 속성에 대해 작동한다. 그러나 아키텍처적 통합이 중요한 기여이다. 검증을 패칭 파이프라인의 표준 단계로 만듦으로써 PatchPilot은 점점 더 강력해지는 검증 도구들이 성숙해질수록 이를 수용할 수 있는 워크플로를 확립한다.

비용 효율성도 주목할 만하다. PatchPilot은 SWE-bench 벤치마크(AI 소프트웨어 엔지니어링 에이전트의 표준 평가)에서 경쟁력 있는 결과를 달성하면서도 비슷한 시스템보다 더 적은 LLM API 호출을 사용한다. 이는 복잡한 패칭 작업에서 API 비용이 빠르게 누적된다는 점을 고려할 때 실용적인 고려 사항이다.

FVAPPS: 정확성 벤치마크

Dougherty & Mehta의 FVAPPS 벤치마크는 코드를 작성하는 것뿐만 아니라 그것이 올바르다는 것을 증명하는 것이 과제인 프로그래밍 문제들을 제공한다. 각 문제는 Lean 4로 작성된 사양과 함께 제공되며, AI 시스템은 프로그램과 해당 프로그램이 사양을 만족한다는 형식적 증명을 모두 생성해야 한다.

이는 표준 코드 생성 벤치마크(HumanEval, MBPP)보다 실질적으로 더 어려운 과제이다. 표준 벤치마크는 코드가 테스트 케이스에서 올바른 출력을 생성하는지만 평가한다. FVAPPS는 모델이 모든 가능한 입력에 대해 추론하도록 요구한다. 즉, 특정 케이스를 테스트하는 것이 아니라 보편적 정확성을 증명해야 한다. FVAPPS에서 현재 LLM의 성능은 수수한 수준이며, 향후 수년간 발전을 이끌 도전적인 벤치마크를 확립하고 있다.

자동차 안전: 가장 큰 것이 걸려 있는 곳

Pan et al.은 엄격한 검증이 선택 사항이 아닌 분야에 초점을 맞춘다. 바로 자동차 소프트웨어이다. 현대 차량에는 방대한 양의 소프트웨어 코드가 포함되어 있으며, 소프트웨어 장애는 충돌 사고, 부상, 사망을 초래할 수 있다. ISO 26262와 같은 표준은 최고 수준의 안전 무결성을 위해 AUTOSAR, SysML, 모델 기반 설계와 같은 모델 기반 방법을 포함한 체계적인 개발 프로세스를 의무화하고 있다. 이들의 접근 방식은 생성형 AI(신속한 코드 생성을 위한)와 모델 기반 방법론(구조화된 검증 및 설계 일관성을 위한)을 상호 보완적인 워크플로우로 결합한다:

AI가 코드를 생성한다 — 자연어 명세로부터 후보 코드를 신속하게 생성
모델 기반 방법론이 생성된 코드를 검증한다 — 아키텍처 모델 및 안전 속성에 대조하여 테스팅이 놓칠 수 있는 구조적 오류를 포착
AI가 코드를 개선한다 — 모델 체킹 피드백을 기반으로, 생성과 검증 사이의 폐쇄 루프를 형성

이 시너지는 양방향으로 작용한다: AI는 명세 및 보일러플레이트 생성을 자동화함으로써 모델 기반 개발을 더 접근하기 쉽게 만들고, 모델 기반 제약 조건은 AI가 생성한 코드를 아키텍처적으로 더 일관성 있고 표준에 부합하도록 만든다.

주장과 근거

주장	근거	평가
AI 에이전트는 소프트웨어 패칭에 형식 검증을 통합할 수 있다	PatchPilot이 파이프라인 통합을 시연	✅ 시연됨 (초기 단계)
현재 LLM은 코드와 정확성 증명을 모두 생성할 수 있다	FVAPPS는 제한적이지만 0이 아닌 역량을 보여줌	⚠️ 발전 중인 역량
GenAI + 모델 기반 방법론의 시너지가 자동차 소프트웨어 안전성을 향상시킨다	Pan 등이 워크플로우를 기술; 실제 배포 근거는 제한적	⚠️ 아키텍처적으로 타당
형식 검증이 대규모 AI 생성 코드베이스로 확장된다	현재 도구는 소규모 프로그램을 처리; 확장성은 여전히 과제	⚠️ 상당한 격차 존재
테스팅은 안전 필수 소프트웨어에 충분하다	형식 방법론 커뮤니티의 합의: 테스팅만으로는 불충분	❌ 충분하지 않음

미해결 질문

명세의 완전성: 형식 검증은 코드가 명세에 대해 올바름을 증명한다. 그러나 명세는 누가 작성하며, 명세가 실제 요구사항을 포착하고 있는지는 어떻게 검증하는가? 명세 격차는 종종 구현 격차보다 크다.

검증의 확장성: 현재 형식 검증 도구는 수백에서 수천 줄 규모의 프로그램을 처리한다. 실제 소프트웨어 시스템은 수백만 줄을 포함한다. 검증을 프로덕션 코드베이스 수준으로 어떻게 확장할 것인가?

부분 검증: 완전한 검증이 실현 불가능하다면, 부분 검증 — 핵심 속성은 증명하되 비핵심 동작은 미검증으로 남기는 방식 — 이 더 낮은 비용으로 의미 있는 안전성 향상을 제공할 수 있는가?

신경망의 검증: AI가 생성한 코드는 신경망 구성 요소(ML 모델, 분류기)를 호출할 수 있다. 비결정론적 신경망 구성 요소를 포함하는 코드의 속성을 형식적으로 검증할 수 있는가?

개발자 수용: 형식 검증은 대부분의 소프트웨어 개발자가 보유하지 않은 전문 지식을 요구한다. AI 매개 검증이 수용 장벽을 낮출 수 있는가, 아니면 단순히 전문 지식의 요구를 검증에서 AI 도구 구성으로 이전시키는 것에 불과한가?

연구자를 위한 시사점

소프트웨어 공학 연구자들에게, PatchPilot과 FVAPPS의 조합은 AI 보조 형식 검증이라는 신흥 분야에서 실용적인 도구와 벤치마크를 동시에 확립한다. 이 벤치마크는 수년간 연구 발전을 이끌 만큼 충분히 도전적이며, 검증된 코드 생성 분야에 종사하는 모든 연구자에게 귀중한 자원이다.

안전 필수 시스템 개발자들에게, 자동차 분야 적용 사례(Pan 등)는 구체적인 통합 패턴을 제시한다: AI는 신속한 프로토타이핑에, 모델 기반 방법론은 구조화된 검증에 활용함으로써, 빠르면서도(AI 생성) 아키텍처적으로 규율 있는(모델 준수) 워크플로우를 구현한다.

AI 코드 생성 커뮤니티에게, FVAPPS는 현재 벤치마크를 질적으로 뛰어넘는 기준을 제시한다. 테스트 케이스에서 동작하는 코드를 생성하는 것은 필요하지만 불충분한 기준이다. 증명 가능하게 정확한 코드를 생성하는 것이 안전 필수 애플리케이션이 요구하는 표준이며, 이것이 해당 분야가 나아가야 할 방향이다.

References (3)

[1] Li, H., Tang, Y., Wang, S. et al. (2025). PatchPilot: A Cost-Efficient Software Engineering Agent with Early Attempts on Formal Verification. Semantic Scholar.

Scholar

[2] Dougherty, Q. & Mehta, R. (2025). Proving the Coding Interview: A Benchmark for Formally Verified Code Generation. IEEE LLM4Code.

DOI Scholar

[3] Pan, F., Song, Y., Wen, L. et al. (2025). Automating Automotive Software Development: A Synergy of Generative AI and Model-Based Methods. arXiv:2505.02500.