Trend AnalysisEngineering

End-to-End Autonomous Driving with RL: Can World Models Close the Sim-to-Real Gap?

Imitation learning taught autonomous vehicles to drive like their trainers—including the trainers' mistakes. Reinforcement learning promises to teach them to drive better. Three new frameworks use world models to close the sim-to-real gap, with RAD's 3D Gaussian splatting approach accumulating 50 citations in months.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

End-to-end autonomous driving—mapping raw sensor inputs directly to control commands without hand-engineered intermediate representations—has been dominated by imitation learning (IL): train a neural network to mimic expert human drivers. This paradigm has a well-known failure mode. IL systems learn the distribution of expert behavior, not the objective of safe driving. When the vehicle encounters a situation outside the training distribution—a construction zone, an aggressive merging vehicle, a child running into the street—the model has no mechanism to reason about consequences. It can only replay the nearest memorized behavior, which may be catastrophically wrong.

Reinforcement learning (RL) offers a different philosophy: instead of imitating demonstrations, learn a policy that maximizes a reward signal encoding safe, efficient driving. In principle, RL can discover driving strategies that exceed human performance by exploring and optimizing in simulation. In practice, RL for autonomous driving has been stymied by two problems: the expense of training in the real world (where crashes have real consequences) and the fidelity gap between simulation and reality (where a policy that drives perfectly in CARLA may fail on an actual road).

A wave of recent work is attacking both problems through world models—learned simulators that generate realistic driving scenarios from neural representations of the physical world.

RAD: 3D Gaussian Splatting Meets Reinforcement Learning

Gao et al. (2025) introduce RAD (Reinforcement learning for Autonomous Driving), a framework that trains end-to-end driving policies using 3D Gaussian Splatting (3DGS) as the simulation backbone. With in its first months—a notable reception for an autonomous driving paper—RAD represents a new approach to the sim-to-real gap.

The core insight: traditional driving simulators (CARLA, nuScenes) render synthetic scenes that differ systematically from real-world sensor data in texture, lighting, object appearance, and sensor noise. 3DGS, by contrast, reconstructs photorealistic 3D scenes from real driving logs, enabling RL training in environments that are visually indistinguishable from the real world because they are the real world—reconstructed, navigable, and augmentable.

RAD's architecture has three stages:

Scene reconstruction: Real driving logs are converted to 3DGS representations that enable novel-view rendering at any camera angle.

Closed-loop RL training: The driving policy (a vision transformer) interacts with the 3DGS environment, receiving rewards for safe lane-keeping, collision avoidance, and traffic rule compliance.

Real-world deployment: The trained policy transfers directly to the physical vehicle because the training and deployment visual distributions are matched.

The results on the authors' custom 3DGS-based evaluation benchmark show meaningful improvements over IL baselines: a 3x lower collision rate compared to imitation learning approaches. Note that nuScenes data serves as input for scene reconstruction, not as the evaluation benchmark itself. These numbers are significant but the approach is still in its early stages—evaluated on a custom benchmark rather than standardized public benchmarks.

World Model Alignment: Raw2Drive

Yang et al. (2025) address a subtler problem: even when world models are photorealistic, their dynamics may diverge from reality. A world model that renders beautiful images but simulates incorrect physics (e.g., wrong friction coefficients, unrealistic vehicle dynamics) will produce RL policies that exploit model artifacts rather than learning generalizable driving skills.

Raw2Drive tackles this with an "aligned world model" that is jointly trained on visual reconstruction and dynamics prediction, ensuring that the model's internal physics match real-world vehicle behavior. Published on arXiv the framework achieves state-of-the-art results on CARLA v2—a notably harder benchmark than CARLA v1, with more complex scenarios and stricter evaluation criteria.

Key technical contributions:

Dynamics alignment loss: A regularization term that penalizes discrepancies between the world model's predicted state transitions and recorded real-world state transitions.
Latent-space planning: Rather than rendering full images at each planning step, Raw2Drive plans in the world model's latent space—a compressed representation that is computationally efficient but retains task-relevant information.
RL fine-tuning: After pre-training via imitation learning, the policy is fine-tuned with RL in the aligned world model, specifically targeting scenarios where IL fails (e.g., near-collisions, complex intersections).

Impartial World Models: AD-R1

Yan et al. (2025) identify a bias in world model-based RL that previous work overlooked: world models trained on expert driving logs learn to predict what happens when the car is driven well. They are poor at predicting what happens when the car makes a mistake—precisely the regime where RL training is most informative.

AD-R1 addresses this with an "impartial world model" trained on both expert and non-expert driving data, including near-crash scenarios and recovery maneuvers. With (early but growing), this work argues that the diversity of the world model's training data is at least as important as its visual fidelity.

The framework uses a curriculum-based RL approach:

Phase 1: Train on easy scenarios (straight roads, light traffic) where IL pre-training provides a good initialization.

Phase 2: Progressively introduce harder scenarios (dense traffic, adverse weather, construction zones) where the impartial world model generates realistic failure modes.

Phase 3: Adversarial scenario generation, where the world model actively creates challenging situations to stress-test the driving policy.

This curriculum mimics how human drivers learn: easy roads first, highway merging later, ice and snow last. The key empirical finding is that Phase 3 (adversarial training) substantially reduces safety violations in long-tail scenarios compared to policies trained only on Phases 1 and 2—the paper demonstrates this on a Risk Foreseeing Benchmark but does not report a single headline percentage in the abstract.

Adaptive Reasoning: AdaThinkDrive

Luo et al. (2025) take a different angle on the RL-for-driving problem: rather than improving the simulator, improve the reasoning of the driving agent. AdaThinkDrive integrates chain-of-thought (CoT) reasoning from vision-language models with RL fine-tuning, enabling the driving agent to articulate its decision-making process before executing actions.

With , this work addresses a practical concern: in simple driving scenarios (straight road, no traffic), chain-of-thought reasoning adds latency without benefit. AdaThinkDrive uses RL to learn when to think—engaging CoT reasoning only in complex situations that warrant deliberation, and bypassing it for routine driving.

The adaptive reasoning mechanism achieves a useful balance: in complex scenarios, the CoT module explains the agent's decision (e.g., "pedestrian detected at crosswalk, reducing speed, checking oncoming lane before deviation"), improving both performance and interpretability. In simple scenarios, the agent acts reflexively, maintaining the low latency required for real-time control.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
3DGS-based RL improves over IL for driving	3x lower collision rate on custom 3DGS benchmark (Gao et al.)	✅ Supported
Aligned world models reduce sim-to-real gap	CARLA v2 state-of-the-art (Yang et al.)	✅ Supported (simulation benchmark)
Impartial world models improve long-tail handling	Substantial reduction in safety violations on Risk Foreseeing Benchmark (Yan et al.)	✅ Supported (simulation)
Adaptive CoT improves complex-scenario reasoning	Interpretable decisions with maintained latency (Luo et al.)	✅ Supported
RL-trained policies are ready for public roads	No real-world deployment studies published	❌ Refuted (currently)

The Safety Verification Gap

The elephant in the room for all RL-based autonomous driving work is safety verification. Traditional safety engineering relies on formal methods, fault trees, and failure mode analysis—tools that require explicit, interpretable system specifications. An RL-trained neural network policy provides none of these. It is a black box that maps sensor inputs to steering angles through millions of learned parameters.

How do you certify that such a system is safe enough for public roads? The regulatory answer is unclear. ISO 21448 (Safety of the Intended Functionality) provides a framework for handling insufficiencies in AI-based driving functions, but it was designed for modular perception-planning-control architectures, not end-to-end learned policies. Adapting safety standards to RL-trained systems is an open regulatory and engineering challenge.

Open Questions and Future Directions

Can world models generalize to unseen cities? Current results are benchmarked on specific datasets (nuScenes: Boston, Singapore; CARLA: synthetic). Generalization to novel urban layouts, driving conventions (left-hand vs. right-hand traffic), and road conditions is untested.

What reward function is "safe enough"? RL performance is only as good as its reward function. Encoding the full complexity of safe driving—including edge cases like emergency vehicles, road debris, and unusual pedestrian behavior—into a scalar reward signal is a formidable design challenge.

How do we handle the liability question? If an RL-trained vehicle causes an accident due to a scenario not covered by its training reward, who is liable? The manufacturer? The RL algorithm designer? The training data provider?

Can RL and IL be combined optimally? Several of these works use IL for pre-training and RL for fine-tuning. Is there a principled framework for determining which driving behaviors should be learned from demonstration versus discovered through optimization?

What is the compute cost of 3DGS-based training? RAD requires reconstructing 3D scenes from driving logs—a process that is computationally expensive. Can the approach scale to the millions of driving hours needed for a production system?

Implications for the Autonomous Driving Industry

The shift from imitation learning to reinforcement learning in autonomous driving mirrors a broader trend in AI: moving from systems that replicate human behavior to systems that optimize for objectives. This shift promises better handling of edge cases, more robust safety properties, and ultimately superior driving performance.

But it also introduces new risks. An RL policy that optimizes for a poorly specified reward function may discover strategies that are technically "optimal" but practically dangerous—cutting corners too aggressively, braking too late to maximize throughput, or exploiting gaps in traffic that a human driver would consider too narrow. Reward engineering for autonomous driving may prove to be as difficult as the driving problem itself.

The field is making rapid progress. The for RAD, the CARLA v2 results from Raw2Drive, and the adaptive reasoning of AdaThinkDrive all represent tangible advances. What remains is the hardest part: translating simulation benchmarks into real-world safety—a challenge that no amount of photorealistic rendering can fully address.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 학술 연구에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

강화학습 기반 엔드-투-엔드 자율주행: 월드 모델이 시뮬레이션-실제 환경 간 격차를 해소할 수 있는가?

원시 센서 입력을 수작업으로 설계된 중간 표현 없이 제어 명령으로 직접 매핑하는 엔드-투-엔드(end-to-end) 자율주행은 모방 학습(imitation learning, IL)이 주도해 왔다. 즉, 전문 인간 운전자를 모방하도록 신경망을 학습시키는 방식이다. 이 패러다임에는 잘 알려진 실패 유형이 존재한다. IL 시스템은 안전 운전의 목표가 아니라 전문가 행동의 분포를 학습한다. 차량이 학습 분포 밖의 상황—공사 구역, 공격적으로 끼어드는 차량, 도로로 뛰어드는 어린이—을 만났을 때, 모델에는 결과를 추론하는 메커니즘이 없다. 가장 가까운 기억된 행동을 재현할 수 있을 뿐이며, 이는 치명적인 오류로 이어질 수 있다.

강화학습(reinforcement learning, RL)은 다른 철학을 제시한다. 시연을 모방하는 대신, 안전하고 효율적인 주행을 인코딩한 보상 신호를 최대화하는 정책을 학습하는 것이다. 원칙적으로 RL은 시뮬레이션에서 탐색하고 최적화함으로써 인간 성능을 뛰어넘는 주행 전략을 발견할 수 있다. 그러나 실제로 자율주행에 대한 RL은 두 가지 문제로 인해 어려움을 겪어 왔다. 실제 세계에서의 학습 비용(충돌이 실질적인 결과를 초래하는 환경)과 시뮬레이션과 현실 사이의 충실도 격차(CARLA에서 완벽하게 주행하는 정책이 실제 도로에서는 실패할 수 있는 환경)가 그것이다.

최근의 일련의 연구들은 월드 모델(world model)—물리적 세계의 신경 표현으로부터 현실적인 주행 시나리오를 생성하는 학습된 시뮬레이터—을 통해 두 문제 모두를 해결하려 하고 있다.

RAD: 3D 가우시안 스플래팅과 강화학습의 결합

Gao et al. (2025)은 3D 가우시안 스플래팅(3D Gaussian Splatting, 3DGS)을 시뮬레이션 백본으로 활용하여 엔드-투-엔드 주행 정책을 학습하는 프레임워크인 RAD(Reinforcement learning for Autonomous Driving)를 제안한다. 공개 첫 몇 달 만에 상당한 주목을 받은 RAD는 자율주행 논문으로서 주목할 만한 반응을 이끌어 냈으며, 시뮬레이션-실제 환경 간 격차에 대한 새로운 접근 방식을 제시한다.

핵심 통찰은 다음과 같다. 기존 주행 시뮬레이터(CARLA, nuScenes)는 텍스처, 조명, 객체 외관, 센서 잡음 측면에서 실제 센서 데이터와 체계적으로 다른 합성 장면을 렌더링한다. 반면 3DGS는 실제 주행 로그로부터 사실적인 3D 장면을 재구성함으로써, 실제 세계 그 자체—재구성되고, 탐색 가능하며, 증강 가능한—와 시각적으로 구별할 수 없는 환경에서 RL 학습을 가능하게 한다.

RAD의 아키텍처는 세 단계로 구성된다.

장면 재구성: 실제 주행 로그를 임의의 카메라 각도에서 새로운 시점 렌더링을 가능하게 하는 3DGS 표현으로 변환한다.

폐쇄 루프 RL 학습: 주행 정책(비전 트랜스포머)이 3DGS 환경과 상호작용하며, 안전한 차선 유지, 충돌 회피, 교통 법규 준수에 대한 보상을 받는다.

실제 환경 배포: 학습 및 배포 시의 시각적 분포가 일치하기 때문에, 학습된 정책이 물리적 차량으로 직접 전이된다.

저자들의 맞춤형 3DGS 기반 평가 벤치마크에서의 결과는 IL 기준 모델 대비 의미 있는 개선을 보여 준다. 모방 학습 방식에 비해 충돌률이 3배 낮다. nuScenes 데이터는 평가 벤치마크 자체가 아니라 장면 재구성의 입력으로 사용된다는 점에 유의해야 한다. 이 수치들은 유의미하지만, 해당 접근 방식은 표준화된 공개 벤치마크가 아닌 맞춤형 벤치마크에서 평가된 만큼 아직 초기 단계에 있다.

월드 모델 정렬: Raw2Drive

Yang et al. (2025)은 보다 미묘한 문제를 다룬다. 월드 모델이 사실적으로 묘사되더라도, 그 역학이 현실과 괴리될 수 있다는 것이다. 아름다운 이미지를 렌더링하지만 잘못된 물리를 시뮬레이션하는 월드 모델—예를 들어 잘못된 마찰 계수나 비현실적인 차량 역학—은 일반화 가능한 주행 기술을 학습하는 대신 모델의 결함을 이용하는 RL 정책을 만들어 낼 것이다. Raw2Drive는 시각적 재구성과 역학 예측을 공동으로 학습하는 "정렬된 세계 모델(aligned world model)"을 통해 이 문제를 해결하며, 모델의 내부 물리 법칙이 실제 차량 거동과 일치하도록 보장한다. arXiv에 게재된 이 프레임워크는 CARLA v2에서 최첨단 결과를 달성했는데, CARLA v2는 CARLA v1보다 훨씬 어려운 벤치마크로서 더 복잡한 시나리오와 엄격한 평가 기준을 갖추고 있다.

주요 기술적 기여:

역학 정렬 손실(dynamics alignment loss): 세계 모델의 예측 상태 전이와 기록된 실제 상태 전이 사이의 불일치에 패널티를 부과하는 정규화 항이다.
잠재 공간 계획(latent-space planning): 각 계획 단계마다 전체 이미지를 렌더링하는 대신, Raw2Drive는 세계 모델의 잠재 공간에서 계획을 수립한다. 이는 계산적으로 효율적이면서도 과제 관련 정보를 보존하는 압축된 표현이다.
RL 미세 조정(RL fine-tuning): 모방 학습(imitation learning)을 통한 사전 학습 이후, 정렬된 세계 모델 내에서 RL로 정책을 미세 조정하며, 특히 IL이 실패하는 시나리오(예: 충돌 직전 상황, 복잡한 교차로)를 집중적으로 다룬다.

공정한 세계 모델: AD-R1

Yan et al. (2025)은 이전 연구에서 간과된 세계 모델 기반 RL의 편향을 규명한다. 전문가 주행 로그로 학습된 세계 모델은 차량이 잘 운전될 때 발생하는 상황을 예측하는 데 특화되어 있다. 반면 차량이 실수를 저지를 때 발생하는 상황, 즉 RL 학습이 가장 유익한 영역에서는 예측 성능이 저조하다.

AD-R1은 충돌 직전 상황과 회복 조작을 포함하여 전문가 및 비전문가 주행 데이터 모두를 학습한 "공정한 세계 모델(impartial world model)"로 이 문제를 해결한다. (초기 단계이나 주목받고 있는) 이 연구는 세계 모델의 학습 데이터 다양성이 시각적 충실도만큼이나 중요하다고 주장한다.

이 프레임워크는 커리큘럼 기반 RL 접근법을 사용한다:

1단계: IL 사전 학습이 좋은 초기화를 제공하는 쉬운 시나리오(직선 도로, 적은 교통량)에서 학습한다.

2단계: 공정한 세계 모델이 현실적인 실패 모드를 생성하는 더 어려운 시나리오(밀집 교통, 악천후, 공사 구간)를 점진적으로 도입한다.

3단계: 세계 모델이 주행 정책을 스트레스 테스트하기 위해 능동적으로 도전적인 상황을 생성하는 적대적 시나리오 생성을 수행한다.

이 커리큘럼은 인간 운전자가 학습하는 방식을 모방한다. 쉬운 도로부터 시작하여, 이후 고속도로 합류, 마지막으로 빙설 상황을 다룬다. 핵심 실증 결과는 3단계(적대적 학습)가 1단계와 2단계만으로 학습된 정책에 비해 롱테일(long-tail) 시나리오에서의 안전 위반을 크게 감소시킨다는 것이다. 논문은 이를 Risk Foreseeing Benchmark에서 입증하고 있으나, 초록에서 단일 핵심 수치는 보고하지 않는다.

적응적 추론: AdaThinkDrive

Luo et al. (2025)은 RL 기반 주행 문제에 대해 다른 관점을 취한다. 시뮬레이터를 개선하는 대신 주행 에이전트의 추론 능력을 향상시키는 것이다. AdaThinkDrive는 비전-언어 모델의 연쇄적 사고(chain-of-thought, CoT) 추론과 RL 미세 조정을 통합하여, 주행 에이전트가 행동을 실행하기 전에 의사결정 과정을 명확히 표현할 수 있도록 한다.

이 연구는 실용적인 문제를 다룬다. 단순한 주행 시나리오(직선 도로, 교통 없음)에서는 연쇄적 사고 추론이 이점 없이 지연 시간만 증가시킨다. AdaThinkDrive는 RL을 사용하여 언제 생각할지를 학습한다. 숙고가 필요한 복잡한 상황에서만 CoT 추론을 작동시키고, 일상적인 주행에서는 이를 우회한다.

적응적 추론 메커니즘은 유용한 균형을 달성한다. 복잡한 시나리오에서는 CoT 모듈이 에이전트의 결정을 설명하여(예: "횡단보도에서 보행자 감지, 속도 감소, 이탈 전 반대 차선 확인") 성능과 해석 가능성을 모두 향상시킨다. 단순한 시나리오에서는 에이전트가 반사적으로 행동하여 실시간 제어에 필요한 낮은 지연 시간을 유지한다.

비판적 분석: 주장과 근거

주장	근거	판정
3DGS 기반 RL이 주행에서 IL 대비 성능 향상	맞춤형 3DGS 벤치마크에서 충돌률 3배 감소 (Gao et al.)	✅ 지지됨
정렬된 세계 모델은 sim-to-real 격차를 줄인다	CARLA v2 최첨단 성능 (Yang et al.)	✅ 지지됨 (시뮬레이션 벤치마크)
공정한 세계 모델은 장꼬리 처리를 개선한다	Risk Foreseeing Benchmark (Yan et al.)에서 안전 위반 대폭 감소	✅ 지지됨 (시뮬레이션)
Adaptive CoT는 복잡한 시나리오 추론을 개선한다	지연 시간을 유지하면서 해석 가능한 결정 (Luo et al.)	✅ 지지됨
RL로 훈련된 정책은 공공 도로 주행에 준비되어 있다	실세계 배포 연구 미발표	❌ 반박됨 (현재 기준)

안전 검증의 공백

RL 기반 자율주행 연구 전반에 걸쳐 논의를 회피할 수 없는 핵심 문제는 안전 검증이다. 전통적인 안전 공학은 형식적 방법론, 결함 트리, 고장 모드 분석에 의존하는데, 이러한 도구들은 명시적이고 해석 가능한 시스템 사양을 필요로 한다. RL로 훈련된 신경망 정책은 이 중 어느 것도 제공하지 않는다. 이는 수백만 개의 학습된 파라미터를 통해 센서 입력을 조향각으로 변환하는 블랙박스에 불과하다.

이러한 시스템이 공공 도로 주행에 충분히 안전하다는 것을 어떻게 인증할 수 있는가? 규제 당국의 답변은 불명확하다. ISO 21448(의도된 기능의 안전성)은 AI 기반 주행 기능의 불충분성을 처리하기 위한 프레임워크를 제공하지만, 이는 모듈식 인지-계획-제어 아키텍처를 위해 설계된 것으로 종단간 학습 정책에는 적합하지 않다. RL로 훈련된 시스템에 안전 표준을 적용하는 것은 규제 및 공학 분야 모두에서 미해결 과제로 남아 있다.

미해결 과제 및 향후 방향

세계 모델은 미지의 도시에도 일반화될 수 있는가? 현재 결과는 특정 데이터셋(nuScenes: 보스턴, 싱가포르; CARLA: 합성 환경)을 기준으로 벤치마크되어 있다. 새로운 도시 구조, 운전 관행(좌측통행 대 우측통행), 도로 조건에 대한 일반화는 검증되지 않았다.

어떤 보상 함수가 "충분히 안전한가"? RL의 성능은 보상 함수의 품질에 전적으로 의존한다. 긴급 차량, 도로 위 이물질, 비정상적인 보행자 행동과 같은 엣지 케이스를 포함한 안전 운전의 전체적인 복잡성을 스칼라 보상 신호로 인코딩하는 것은 매우 어려운 설계 과제이다.

책임 문제를 어떻게 처리할 것인가? RL로 훈련된 차량이 훈련 보상에서 다루지 않은 시나리오로 인해 사고를 일으킨 경우, 누가 책임을 지는가? 제조업체인가? RL 알고리즘 설계자인가? 훈련 데이터 제공자인가?

RL과 IL을 최적으로 결합할 수 있는가? 이들 연구 중 일부는 사전 훈련에 IL을 사용하고 미세 조정에 RL을 사용한다. 어떤 주행 행동을 시연으로부터 학습해야 하고 어떤 것을 최적화를 통해 발견해야 하는지를 결정하는 원칙적인 프레임워크가 존재하는가?

3DGS 기반 훈련의 연산 비용은 얼마인가? RAD는 주행 로그로부터 3D 장면을 재구성해야 하는데, 이 과정은 연산 비용이 매우 높다. 이 접근법이 양산 시스템에 필요한 수백만 시간의 주행 데이터로 확장될 수 있는가?

자율주행 산업에 대한 시사점

자율주행에서 모방 학습으로부터 강화 학습으로의 전환은 AI 분야의 더 넓은 흐름을 반영한다. 즉, 인간 행동을 모방하는 시스템에서 목표를 최적화하는 시스템으로의 전환이다. 이러한 전환은 엣지 케이스의 더 나은 처리, 더 견고한 안전 특성, 그리고 궁극적으로 우월한 주행 성능을 약속한다.

그러나 이는 새로운 위험도 수반한다. 잘못 명세된 보상 함수를 최적화하는 RL 정책은 기술적으로는 "최적"이지만 실질적으로는 위험한 전략을 발견할 수 있다. 예컨대 지나치게 공격적으로 코너를 주파하거나, 처리량을 극대화하기 위해 너무 늦게 제동하거나, 인간 운전자라면 너무 좁다고 판단할 교통 흐름의 틈을 파고드는 경우가 이에 해당한다. 자율주행을 위한 보상 공학은 주행 문제 자체만큼이나 어려운 과제임이 드러날 수 있다.

이 분야는 빠르게 진전되고 있다. RAD의 성과, Raw2Drive의 CARLA v2 결과, AdaThinkDrive의 적응적 추론은 모두 실질적인 발전을 나타낸다. 남은 것은 가장 어려운 부분, 즉 시뮬레이션 벤치마크를 실세계 안전으로 전환하는 것인데, 이는 아무리 사실적인 렌더링으로도 완전히 해결할 수 없는 과제이다.

References (4)

[1] Gao, H., Chen, S., Jiang, B. et al. (2025). RAD: Training an end-to-end driving policy via large-scale 3DGS-based reinforcement learning. arXiv:2502.13144.

DOI Scholar

[2] Yang, Z., Jia, X., Li, Q. et al. (2025). Raw2Drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in CARLA v2). arXiv:2505.16394.

DOI Scholar

[3] Yan, T., Tang, T., Gui, X. et al. (2025). AD-R1: Closed-loop reinforcement learning for end-to-end autonomous driving with impartial world models. arXiv:2511.20325.

DOI Scholar

[4] Luo, Y., Li, F., Xu, S. et al. (2025). AdaThinkDrive: Adaptive thinking via reinforcement learning for autonomous driving. arXiv:2509.13769.

DOI Scholar

End-to-End Autonomous Driving with RL: Can World Models Close the Sim-to-Real Gap?

RAD: 3D Gaussian Splatting Meets Reinforcement Learning

World Model Alignment: Raw2Drive

Impartial World Models: AD-R1

Adaptive Reasoning: AdaThinkDrive

Critical Analysis: Claims and Evidence

The Safety Verification Gap

Open Questions and Future Directions

Implications for the Autonomous Driving Industry

강화학습 기반 엔드-투-엔드 자율주행: 월드 모델이 시뮬레이션-실제 환경 간 격차를 해소할 수 있는가?

RAD: 3D 가우시안 스플래팅과 강화학습의 결합

월드 모델 정렬: Raw2Drive

공정한 세계 모델: AD-R1

적응적 추론: AdaThinkDrive

비판적 분석: 주장과 근거

안전 검증의 공백

미해결 과제 및 향후 방향

자율주행 산업에 대한 시사점

References (4)

Explore this topic deeper