Trend AnalysisComputer SystemsMixed Methods

DARPA TRACTOR and the C-to-Rust Translation Challenge: Can We Automate Memory Safety?

The U.S. Department of Defense has a problem measured in billions of lines of code.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The U.S. Department of Defense has a problem measured in billions of lines of code. Critical infrastructure—from weapons systems to communication networks—runs on C and C++, languages that provide performance and hardware control at the cost of memory safety. Buffer overflows, use-after-free errors, and null pointer dereferences account for a large share of security vulnerabilities in systems software, according to estimates from Microsoft and Google. DARPA's Translating All C to Rust (TRACTOR) program, announced in 2024, represents the most ambitious attempt to address this problem: automatically translating legacy C codebases into memory-safe Rust at scale. But can automated translation deliver safe, idiomatic Rust? The research literature suggests the answer is: partially, with significant caveats.

The Research Landscape

The Scale of the Problem

Hong and Ryu (2025) frame the core challenge precisely. Legacy C codebases have accumulated decades of implicit assumptions about memory layout, pointer arithmetic, and undefined behavior. Rust's ownership model—where every value has a single owner, borrowing is tracked at compile time, and lifetimes are explicit—is fundamentally incompatible with C's permissive memory model. A mechanical translation that preserves C semantics in Rust produces code wrapped in unsafe blocks that eliminates Rust's safety guarantees while adding Rust's syntactic overhead. The goal, therefore, is not just translation but transformation: converting C memory patterns into idiomatic Rust ownership patterns.

LLM-Assisted Translation: Promise and Limitations

The most active research direction combines large language models with static analysis to automate C-to-Rust translation. The results are instructive about both the capabilities and limitations of LLM-based code transformation.

Shetty et al. (2024) present Syzygy, a dual code-test translation approach that uses LLMs and dynamic analysis to translate C to safe Rust. Their key insight is that translating code and translating tests simultaneously provides a verification mechanism: if the translated Rust code passes the translated tests, confidence in semantic preservation increases. Syzygy achieves safe Rust output for a majority of the functions in their benchmark, but the remainder requires unsafe blocks or manual intervention—a ratio that illustrates the current state of the art.

Cai et al. (2025) introduce RustMap, a project-scale C-to-Rust migration tool that combines program analysis with LLM-based translation. RustMap addresses a limitation of function-level translators: real C projects have complex inter-procedural dependencies, global state, and build system configurations that function-level translation ignores. Their approach first analyzes the project's dependency graph, then translates functions in topological order so that each translated function can reference previously translated dependencies. The method handles projects up to tens of thousands of lines but struggles with deeply intertwined global state.

Khatry et al. (2025) contribute CRUST-Bench, a benchmark for evaluating C-to-safe-Rust transpilation. This is significant infrastructure work: without standardized benchmarks, it is impossible to compare different translation approaches rigorously. CRUST-Bench includes 100 C programs with test suites, and their evaluation of frontier LLMs (GPT-4, Claude) shows that even the strongest models achieve only modest success rates on producing safe Rust that passes all tests—substantially lower than the function-level results reported by Syzygy, suggesting that benchmark design significantly affects reported performance.

Shiraishi et al. (2024) present SmartC2Rust, an iterative feedback-driven approach. Rather than translating in a single pass, SmartC2Rust generates an initial translation, compiles it, feeds compiler errors back to the LLM, and iterates. This "compile-and-fix" loop improves success rates by meaningful improvement over single-pass translation, suggesting that the LLM's understanding of Rust's type system improves when given concrete error feedback.

Luo et al. (2025) propose integrating rule-based static analysis with LLM-based semantic understanding. Pure rule-based approaches have limited coverage (they handle common patterns but miss complex cases), while pure LLM approaches lack reliability (they sometimes produce syntactically correct but semantically wrong code). Their hybrid approach achieves higher coverage than either method alone.

The Safety Verification Problem

Translation is only half the challenge. The other half is verifying that the translated code preserves the semantics of the original while actually achieving memory safety.

Sirlanci et al. (2025) address this with C2RUST-BENCH, a minimized dataset designed specifically for evaluating semantic equivalence between C originals and Rust translations. Their benchmark highlights a subtle problem: some C programs rely on undefined behavior that happens to produce consistent results on specific platforms. Translating such programs to Rust, which has defined behavior for the same operations, can change program semantics in ways that are difficult to detect through testing alone.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
LLMs can translate individual C functions to safe Rust	Shetty et al. Syzygy — high success rate	Partially supported — function-level translation feasible but not complete
Project-scale translation is achievable	Cai et al. RustMap	Partially supported — works for moderate-sized projects, struggles with complex global state
Frontier LLMs achieve 15-25% on rigorous benchmarks	Khatry et al. CRUST-Bench	Supported — and the gap between easy benchmarks and rigorous ones is large
Iterative compilation feedback improves translation	Shiraishi et al. SmartC2Rust	Supported — 15-20% improvement over single-pass
Automated translation can fully replace manual migration	No current evidence	Not supported — all approaches require human review for safety-critical code

Open Questions and Future Directions

The unsafe residual. Even the best automated tools produce a significant share of functions requiring unsafe Rust or manual intervention. Can this residual be reduced to acceptable levels for safety-critical systems, or will automated translation always require human oversight?

Undefined behavior preservation. C programs that rely on undefined behavior present a fundamental translation challenge. Should the translated Rust preserve the observed behavior (platform-specific) or reject such patterns (losing functionality)?

DARPA TRACTOR at scale. The academic results translate programs of thousands to tens of thousands of lines. DARPA's target—defense infrastructure codebases—involves millions of lines with decades of accumulated complexity. The scaling gap between research benchmarks and deployment targets remains vast.

Incremental adoption. Rather than wholesale translation, a practical path may involve translating security-critical components to Rust while maintaining C for performance-critical code. Rust's FFI (Foreign Function Interface) supports this, but the boundary between safe Rust and unsafe C becomes a new attack surface.

Verification guarantees. Testing can demonstrate the absence of specific bugs but cannot prove semantic equivalence. Formal verification of translated code remains computationally expensive and requires specification effort that may exceed the cost of manual translation for small programs.

What This Means for Systems Engineers

The TRACTOR vision—automated, correct, safe translation of legacy C to Rust—remains aspirational. Current tools can accelerate the process, particularly for well-structured code with good test coverage, but they cannot replace human judgment for safety-critical translations. The practical recommendation is to treat automated translation as a starting point that reduces manual effort by substantially rather than as a complete solution.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 특정 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

DARPA TRACTOR와 C-to-Rust 번역 과제: 메모리 안전성을 자동화할 수 있는가?

미국 국방부는 수십억 줄의 코드로 측정되는 문제를 안고 있다. 무기 체계부터 통신 네트워크에 이르는 핵심 인프라가 C와 C++로 구동되는데, 이 언어들은 메모리 안전성을 희생하는 대가로 성능과 하드웨어 제어를 제공한다. Microsoft와 Google의 추정에 따르면, 버퍼 오버플로우, use-after-free 오류, 그리고 null 포인터 역참조가 시스템 소프트웨어 보안 취약점의 상당한 비중을 차지한다. 2024년에 발표된 DARPA의 TRACTOR(Translating All C to Rust) 프로그램은 이 문제를 해결하기 위한 가장 야심 찬 시도를 대표한다: 레거시 C 코드베이스를 대규모로 메모리 안전한 Rust로 자동 번역하는 것이다. 그러나 자동화된 번역이 안전하고 관용적인 Rust를 제공할 수 있는가? 연구 문헌은 그 답이 상당한 단서를 달아 '부분적으로' 가능하다고 시사한다.

연구 동향

문제의 규모

Hong과 Ryu(2025)는 핵심 과제를 정확하게 제시한다. 레거시 C 코드베이스는 메모리 레이아웃, 포인터 산술 연산, 그리고 미정의 동작에 관한 암묵적 가정을 수십 년에 걸쳐 축적해 왔다. Rust의 소유권 모델—모든 값은 단일 소유자를 가지며, 빌림은 컴파일 타임에 추적되고, 라이프타임은 명시적이다—은 C의 허용적인 메모리 모델과 근본적으로 양립할 수 없다. C의 의미론을 Rust에서 보존하는 기계적 번역은 unsafe 블록으로 감싼 코드를 생성하는데, 이는 Rust의 안전성 보장을 제거하면서 Rust의 구문적 부담만 추가한다. 따라서 목표는 단순한 번역이 아니라 변환, 즉 C의 메모리 패턴을 관용적인 Rust의 소유권 패턴으로 전환하는 것이다.

LLM 보조 번역: 가능성과 한계

가장 활발한 연구 방향은 대규모 언어 모델(LLM)과 정적 분석을 결합하여 C-to-Rust 번역을 자동화하는 것이다. 그 결과는 LLM 기반 코드 변환의 역량과 한계 모두에 대해 시사하는 바가 크다.

Shetty et al.(2024)은 LLM과 동적 분석을 활용하여 C를 안전한 Rust로 번역하는 이중 코드-테스트 번역 방식인 Syzygy를 제시한다. 이들의 핵심 통찰은 코드와 테스트를 동시에 번역하면 검증 메커니즘이 제공된다는 것이다: 번역된 Rust 코드가 번역된 테스트를 통과하면 의미론적 보존에 대한 신뢰도가 높아진다. Syzygy는 벤치마크 내 함수의 대다수에서 안전한 Rust 출력을 달성하지만, 나머지는 unsafe 블록이나 수동 개입을 필요로 한다—이 비율은 현재 기술 수준을 잘 보여준다.

Cai et al.(2025)은 프로그램 분석과 LLM 기반 번역을 결합한 프로젝트 규모의 C-to-Rust 마이그레이션 도구인 RustMap을 소개한다. RustMap은 함수 수준 번역기의 한계를 다룬다: 실제 C 프로젝트는 함수 수준 번역이 무시하는 복잡한 프로시저 간 의존성, 전역 상태, 그리고 빌드 시스템 구성을 가진다. 이들의 접근 방식은 먼저 프로젝트의 의존성 그래프를 분석한 후, 각 번역된 함수가 이전에 번역된 의존성을 참조할 수 있도록 함수들을 위상 정렬 순서로 번역한다. 이 방법은 수만 줄 규모의 프로젝트를 처리하지만 깊이 얽힌 전역 상태에서는 어려움을 겪는다.

Khatry et al.(2025)은 C-to-safe-Rust 트랜스파일레이션 평가를 위한 벤치마크인 CRUST-Bench를 제공한다. 이는 중요한 인프라 작업이다: 표준화된 벤치마크 없이는 서로 다른 번역 방식을 엄밀하게 비교하는 것이 불가능하다. CRUST-Bench는 테스트 스위트를 갖춘 100개의 C 프로그램을 포함하며, 최전선 LLM(GPT-4, Claude)에 대한 이들의 평가는 가장 강력한 모델조차 모든 테스트를 통과하는 안전한 Rust를 생성하는 데 있어 보통 수준의 성공률만을 달성함을 보여준다—이는 Syzygy가 보고한 함수 수준 결과보다 실질적으로 낮은 수치로, 벤치마크 설계가 보고된 성능에 상당한 영향을 미침을 시사한다. Shiraishi et al. (2024)은 반복적 피드백 기반 접근 방식인 SmartC2Rust를 제시한다. SmartC2Rust는 단일 패스로 번역하는 대신 초기 번역본을 생성하고, 이를 컴파일한 뒤 컴파일러 오류를 LLM에 다시 피드백하여 반복하는 방식을 취한다. 이 "컴파일-수정" 루프는 단일 패스 번역 대비 성공률을 의미 있는 수준으로 향상시키며, 이는 LLM이 구체적인 오류 피드백을 받을 때 Rust의 타입 시스템에 대한 이해가 개선됨을 시사한다.

Luo et al. (2025)은 규칙 기반 정적 분석과 LLM 기반 의미론적 이해를 통합하는 방식을 제안한다. 순수 규칙 기반 접근 방식은 커버리지가 제한적이고(일반적인 패턴은 처리하지만 복잡한 사례는 놓침), 순수 LLM 기반 접근 방식은 신뢰성이 부족하다(구문적으로는 올바르지만 의미론적으로 잘못된 코드를 생성하는 경우가 있음). 이들의 하이브리드 접근 방식은 어느 한 방법만 사용하는 것보다 더 높은 커버리지를 달성한다.

안전성 검증 문제

번역은 도전 과제의 절반에 불과하다. 나머지 절반은 번역된 코드가 원본의 의미를 보존하면서 실제로 메모리 안전성을 달성하는지 검증하는 것이다.

Sirlanci et al. (2025)은 C 원본과 Rust 번역본 간의 의미적 동등성을 평가하기 위해 특별히 설계된 최소화 데이터셋인 C2RUST-BENCH로 이 문제를 다룬다. 이들의 벤치마크는 미묘한 문제를 부각시킨다. 일부 C 프로그램은 특정 플랫폼에서 일관된 결과를 우연히 산출하는 미정의 동작(undefined behavior)에 의존한다는 것이다. 동일한 연산에 대해 정의된 동작을 갖는 Rust로 이러한 프로그램을 번역하면 테스트만으로는 감지하기 어려운 방식으로 프로그램 의미가 변경될 수 있다.

비판적 분석: 주장과 근거

주장	근거	평결
LLM이 개별 C 함수를 안전한 Rust로 번역할 수 있다	Shetty et al. Syzygy — 높은 성공률	부분적으로 지지됨 — 함수 수준 번역은 가능하지만 완전하지 않음
프로젝트 규모의 번역이 가능하다	Cai et al. RustMap	부분적으로 지지됨 — 중간 규모 프로젝트에는 작동하지만 복잡한 전역 상태에서는 어려움을 겪음
최신 LLM이 엄격한 벤치마크에서 15-25%를 달성한다	Khatry et al. CRUST-Bench	지지됨 — 쉬운 벤치마크와 엄격한 벤치마크 간의 격차가 큼
반복적 컴파일 피드백이 번역을 개선한다	Shiraishi et al. SmartC2Rust	지지됨 — 단일 패스 대비 15-20% 향상
자동화된 번역이 수동 마이그레이션을 완전히 대체할 수 있다	현재 근거 없음	지지되지 않음 — 모든 접근 방식이 안전 중요 코드에 대한 인간 검토를 필요로 함

미해결 질문과 향후 방향

unsafe 잔여 문제. 최고의 자동화 도구조차도 unsafe Rust 또는 수동 개입이 필요한 함수를 상당 부분 산출한다. 이 잔여 부분을 안전 중요 시스템에 허용 가능한 수준으로 줄일 수 있을까, 아니면 자동화된 번역에는 항상 인간의 감독이 필요할까?

미정의 동작 보존. 미정의 동작에 의존하는 C 프로그램은 번역에 있어 근본적인 도전 과제를 제시한다. 번역된 Rust가 관찰된 동작(플랫폼 특정)을 보존해야 할까, 아니면 그러한 패턴을 거부해야 할까(기능 손실)?

DARPA TRACTOR의 대규모 적용. 학술적 결과는 수천에서 수만 줄 규모의 프로그램을 번역한다. DARPA의 목표인 국방 인프라 코드베이스는 수십 년간 축적된 복잡성을 가진 수백만 줄 규모를 포함한다. 연구 벤치마크와 배포 목표 간의 규모 격차는 여전히 방대하다.

점진적 도입. 전면적인 번역보다는 성능 중요 코드에 C를 유지하면서 보안 중요 컴포넌트를 Rust로 번역하는 것이 실용적인 경로일 수 있다. Rust의 FFI(Foreign Function Interface)가 이를 지원하지만, 안전한 Rust와 안전하지 않은 C 간의 경계가 새로운 공격 표면이 된다.

검증 보장. 테스트는 특정 버그의 부재를 입증할 수 있지만 의미적 동등성을 증명할 수는 없다. 번역된 코드의 형식 검증은 계산 비용이 높으며, 소규모 프로그램의 경우 수동 번역 비용을 초과할 수 있는 명세 작업을 필요로 한다.

시스템 엔지니어에게 주는 시사점

자동화되고, 정확하며, 안전한 레거시 C의 Rust 번역이라는 TRACTOR 비전은 여전히 이상적인 목표로 남아 있다. 현재의 도구들은 특히 테스트 커버리지가 충분하고 잘 구조화된 코드에 대해 프로세스를 가속화할 수 있지만, 안전 필수적(safety-critical) 번역에서 인간의 판단을 대체할 수는 없다. 실용적인 권고 사항은 자동화 번역을 완전한 해결책이 아니라 수작업 노력을 상당 부분 줄여주는 출발점으로 취급하는 것이다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (7)

[1] Hong, J. & Ryu, S. (2025). Automatically Translating C to Rust. ACM TOPLAS.

DOI Scholar

[2] Shetty, M., Jain, N., & Godbole, A. (2024). Syzygy: Dual Code-Test C to (safe) Rust Translation using LLMs and Dynamic Analysis. arXiv preprint.

DOI Scholar

[3] Cai, X., Liu, J., & Huang, X. (2025). RustMap: Towards Project-Scale C-to-Rust Migration via Program Analysis and LLM. arXiv preprint.

DOI Scholar

[4] Khatry, A., Zhang, R., & Pan, J. (2025). CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation. arXiv preprint.

DOI Scholar

[5] Shiraishi, M., Cao, Y., & Shinagawa, T. (2024). SmartC2Rust: Iterative, Feedback-Driven C-to-Rust Translation via Large Language Models. ACM CCS.

DOI Scholar

[6] Luo, F., Ji, K., & Gao, C. (2025). Integrating Rules and Semantics for LLM-Based C-to-Rust Translation. IEEE ICSME.

DOI Scholar

[7] Sirlanci, M., Yagemann, C., & Lin, Z. (2025). C2RUST-BENCH: A Minimized, Representative Dataset for C-to-Rust Transpilation Evaluation. arXiv preprint.

DOI Scholar

DARPA TRACTOR and the C-to-Rust Translation Challenge: Can We Automate Memory Safety?

The Research Landscape

The Scale of the Problem

LLM-Assisted Translation: Promise and Limitations

The Safety Verification Problem

Critical Analysis: Claims and Evidence

Open Questions and Future Directions

What This Means for Systems Engineers

DARPA TRACTOR와 C-to-Rust 번역 과제: 메모리 안전성을 자동화할 수 있는가?

연구 동향

문제의 규모

LLM 보조 번역: 가능성과 한계

안전성 검증 문제

비판적 분석: 주장과 근거

미해결 질문과 향후 방향

시스템 엔지니어에게 주는 시사점

References (7)

Explore this topic deeper