Deep DiveAI & Machine Learning

Embodied World Models: Teaching Robots to Simulate Before They Act

Before acting in the physical world, an effective robot should be able to imagine the consequences. World models — internal simulators that predict how actions reshape future states — are becoming the central architecture for embodied AI. A comprehensive survey and a Meta/HKUST research agenda map the state of the art and the open problems.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

A chess engine does not move a piece to see what happens. It simulates the consequences internally, evaluates thousands of possible futures, and then selects the move with the best expected outcome. This capacity — predicting the effects of actions before executing them — is what separates planning from trial-and-error. For language models operating in text, trial-and-error is cheap: a bad sentence can be regenerated. For robots operating in the physical world, trial-and-error breaks things. This asymmetry is why world models — internal simulators that capture environment dynamics — have become the central research question in embodied AI.

The Research Landscape

A Unified Framework for World Models

Li, He, Zhang, Wu, Li, and Liu (2025) present what is, to date, the most systematic survey of world models for embodied AI. The paper proposes a three-axis taxonomy that organizes the field:

Functionality axis: Decision-Coupled vs. General-Purpose. Decision-coupled world models are trained jointly with a policy — the model learns to predict futures that are useful for making decisions, even if those predictions are not perceptually accurate. General-purpose world models aim to produce realistic predictions of future states regardless of the downstream task. The trade-off is precision versus flexibility: decision-coupled models are more efficient for a specific task but do not transfer; general-purpose models transfer but may waste representational capacity on details irrelevant to any particular decision.

Temporal modeling axis: Sequential Simulation and Inference vs. Global Difference Prediction. Sequential models generate future states one step at a time, autoregressively. This is flexible but accumulates errors over long horizons — each prediction error compounds into the next. Global difference prediction models instead estimate the change between the current state and a future state in one shot, avoiding error accumulation but struggling with complex multi-step dynamics.

Spatial representation axis: Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. Each trades off computational cost, spatial fidelity, and compositional generalization differently. Latent vectors are compact but lose spatial structure. Decomposed representations separate objects from backgrounds, enabling compositional reasoning but requiring object detection as a prerequisite.

The survey covers robotics, autonomous driving, and general video prediction, identifying a consistent gap across all domains: pixel-level prediction quality does not predict task-level performance. A model can produce visually realistic future frames while failing to predict task-relevant dynamics.

From Language Models to World Models

Fung, Bachrach, Celikyilmaz, Chaudhuri, and collaborators (2025) frame the transition from language models to world models as a critical transition for embodied AI. Their argument is that the development of world models is central to reasoning and planning of embodied AI agents, allowing them to understand and predict their environment, to understand user intentions and social contexts.

The paper proposes that world modeling encompasses three integrated capabilities:

Multimodal perception: The agent must integrate visual, tactile, auditory, and proprioceptive inputs into a unified representation. This is harder than multimodal language modeling because the modalities have different temporal resolutions (vision at 30Hz, touch at 1000Hz) and different spatial frames (camera coordinates vs. robot joint angles).

Planning through reasoning for action and control: The world model must support forward prediction, counterfactual reasoning ("what would have happened if I had pushed harder?"), and goal-conditioned planning ("what sequence of actions reaches the desired state?"). Each requires progressively more sophisticated causal modeling.

Memory: Embodied agents operate in persistent environments. The world model must maintain a belief state updated incrementally as new observations arrive, rather than reprocessing the entire history at each step.

Beyond the physical world, the paper proposes learning mental world models of users — predicting what the human partner intends and needs to enable better human-agent collaboration.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
World models are central to embodied AI planning	Both papers converge on this position; consistent with the broader robotics literature	✅ Supported
Error accumulation in sequential models limits long-horizon prediction	Li et al.'s survey of temporal modeling approaches	✅ Supported — well-documented limitation
Pixel prediction quality does not predict task performance	Li et al.'s cross-domain metric analysis	✅ Supported
The LLM-to-world-model transition is the key paradigm for robotics	Fung et al.'s position paper	⚠️ Plausible framing; alternative paradigms (e.g., end-to-end RL) remain competitive
Mental models of users improve human-robot collaboration	Proposed by Fung et al.	⚠️ Proposed but not yet empirically validated

The Sim-to-Real Gap Persists

Both papers acknowledge but do not solve the fundamental challenge: world models are simulators, and simulators are wrong. The gap between a learned world model's predictions and actual physics — the sim-to-real transfer problem — remains the primary obstacle to deploying world-model-based agents in unstructured environments. A robot that can plan effectively in its internal model but fails when the real world deviates from that model is not useful.

Li et al. note that current evaluation metrics assess pixel fidelity or state-level accuracy but not physical consistency — a model can produce visually realistic predictions that are physically impossible.

Open Questions and Future Directions

Evaluation metrics for physical consistency: How should we measure whether a world model's predictions are physically plausible, beyond pixel-level similarity? Metrics that assess energy conservation, momentum, and collision detection in predicted futures do not yet exist at scale.

Computational cost for real-time control: World models are useful only if they can generate predictions faster than real time. The trade-off between model complexity and inference speed is a binding constraint for robotics applications where control loops run at hundreds of hertz.

Data scarcity for real-world manipulation: Autonomous driving benefits from massive real-world datasets (millions of hours of driving footage). Robotic manipulation lacks comparable datasets. Can world models be effectively pre-trained on video data and fine-tuned on small robotic datasets?

Compositional generalization: Can a world model trained on "pushing blocks" generalize to "pushing cups"? The spatial representation axis matters here — decomposed representations should theoretically enable compositional transfer, but empirical evidence is limited.

Integration with foundation models: Can large pretrained vision-language models serve as world models, or do world models require fundamentally different training objectives that prioritize physical dynamics over semantic content?

What This Means for Your Research

If you work in robotics, the three-axis taxonomy from Li et al. is a practical tool for positioning your own work. Making the functionality, temporal, and spatial axes explicit clarifies which limitations are inherent to your architectural choices and which can be addressed.

If you work on foundation models, the LLM-to-world-model transition is a direction with significant room. Language models predict the next token; world models predict the next state. The architectural similarities are suggestive, but training objectives and evaluation criteria differ enough to require dedicated investigation.

Explore related robotics and world model research through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 리뷰이다. 특정 연구 결과, 통계 및 주장은 학술 작업에서 인용하기 전에 원본 논문과 대조하여 검증해야 한다.

구현된 세계 모델: 로봇이 행동하기 전에 시뮬레이션하도록 가르치기

체스 엔진은 어떤 일이 일어나는지 보기 위해 말을 움직이지 않는다. 내부적으로 결과를 시뮬레이션하고, 수천 가지 가능한 미래를 평가한 다음, 기대 결과가 가장 좋은 수를 선택한다. 이 능력 — 행동을 실행하기 전에 그 효과를 예측하는 것 — 이 계획을 시행착오와 구별하는 것이다. 텍스트에서 작동하는 언어 모델의 경우, 시행착오는 비용이 저렴하다: 나쁜 문장은 다시 생성할 수 있다. 물리적 세계에서 작동하는 로봇의 경우, 시행착오는 물건을 망가뜨린다. 이러한 비대칭성이 바로 환경 역학을 포착하는 내부 시뮬레이터인 세계 모델(world model)이 구현 AI(embodied AI)의 핵심 연구 문제가 된 이유이다.

연구 현황

세계 모델을 위한 통합 프레임워크

Li, He, Zhang, Wu, Li, Liu (2025)는 현재까지 구현 AI를 위한 세계 모델에 관한 가장 체계적인 서베이를 제시한다. 이 논문은 해당 분야를 조직화하는 3축 분류 체계를 제안한다.

기능 축: 의사결정 결합형(Decision-Coupled) 대 범용형(General-Purpose). 의사결정 결합형 세계 모델은 정책과 공동으로 훈련된다 — 예측이 지각적으로 정확하지 않더라도, 모델은 의사결정에 유용한 미래를 예측하는 법을 학습한다. 범용형 세계 모델은 하위 작업에 관계없이 미래 상태에 대한 현실적인 예측을 생성하는 것을 목표로 한다. 트레이드오프는 정밀도 대 유연성이다: 의사결정 결합형 모델은 특정 작업에 더 효율적이지만 전이되지 않으며, 범용형 모델은 전이되지만 특정 의사결정과 무관한 세부 사항에 표현 용량을 낭비할 수 있다.

시간적 모델링 축: 순차적 시뮬레이션 및 추론(Sequential Simulation and Inference) 대 전역 차분 예측(Global Difference Prediction). 순차적 모델은 자기회귀적으로 미래 상태를 한 번에 한 단계씩 생성한다. 이는 유연하지만 긴 시간적 지평에 걸쳐 오류가 누적된다 — 각 예측 오류가 다음 단계로 복합된다. 전역 차분 예측 모델은 대신 현재 상태와 미래 상태 사이의 변화를 한 번에 추정하여 오류 누적을 피하지만, 복잡한 다단계 역학에서는 어려움을 겪는다.

공간 표현 축: 전역 잠재 벡터(Global Latent Vector), 토큰 특징 시퀀스(Token Feature Sequence), 공간 잠재 그리드(Spatial Latent Grid), 분해 렌더링 표현(Decomposed Rendering Representation). 각각은 계산 비용, 공간 충실도, 구성적 일반화를 서로 다르게 절충한다. 잠재 벡터는 컴팩트하지만 공간 구조를 잃는다. 분해 표현은 객체를 배경으로부터 분리하여 구성적 추론을 가능하게 하지만, 전제 조건으로 객체 감지가 필요하다.

이 서베이는 로봇공학, 자율 주행, 일반 비디오 예측을 다루며, 모든 도메인에서 일관된 격차를 확인한다: 픽셀 수준의 예측 품질이 작업 수준의 성능을 예측하지 못한다는 것이다. 모델은 작업 관련 역학을 예측하는 데 실패하면서도 시각적으로 현실적인 미래 프레임을 생성할 수 있다.

언어 모델에서 세계 모델로

Fung, Bachrach, Celikyilmaz, Chaudhuri 및 공동 연구자들 (2025)은 언어 모델에서 세계 모델로의 전환을 구현 AI의 핵심적 전환으로 규정한다. 이들의 주장은 세계 모델의 개발이 구현 AI 에이전트의 추론 및 계획에 핵심적이며, 에이전트가 자신의 환경을 이해하고 예측하며 사용자의 의도와 사회적 맥락을 이해할 수 있게 한다는 것이다.

이 논문은 세계 모델링이 세 가지 통합된 역량을 포함한다고 제안한다.

다중 모달 지각(Multimodal perception): 에이전트는 시각적, 촉각적, 청각적, 고유 감각적(proprioceptive) 입력을 통합된 표현으로 통합해야 한다. 이는 다중 모달 언어 모델링보다 어려운데, 그 이유는 모달리티들이 서로 다른 시간적 해상도(시각은 30Hz, 촉각은 1000Hz)와 서로 다른 공간적 프레임(카메라 좌표 대 로봇 관절 각도)을 가지기 때문이다. 행동 및 제어를 위한 추론을 통한 계획: 세계 모델은 순방향 예측, 반사실적 추론("더 세게 밀었다면 어떻게 되었을까?"), 그리고 목표 조건부 계획("원하는 상태에 도달하기 위한 행동 순서는 무엇인가?")을 지원해야 한다. 각각은 점진적으로 더 정교한 인과 모델링을 필요로 한다.

메모리: 구현된 에이전트는 지속적인 환경에서 작동한다. 세계 모델은 각 단계마다 전체 이력을 재처리하는 것이 아니라, 새로운 관측이 도착할 때마다 점진적으로 업데이트되는 신념 상태를 유지해야 한다.

물리적 세계를 넘어, 이 논문은 사용자의 정신적 세계 모델을 학습할 것을 제안한다 — 더 나은 인간-에이전트 협업을 가능하게 하기 위해 인간 파트너가 의도하고 필요로 하는 것을 예측하는 것이다.

비판적 분석: 주장과 증거

주장	증거	평가
세계 모델은 구현된 AI 계획의 핵심이다	두 논문 모두 이 입장에 수렴하며, 더 넓은 로봇공학 문헌과도 일치한다	✅ 지지됨
순차적 모델에서의 오류 누적은 장기 예측을 제한한다	Li et al.의 시간적 모델링 접근법 조사	✅ 지지됨 — 잘 기록된 한계
픽셀 예측 품질이 작업 성능을 예측하지 못한다	Li et al.의 교차 도메인 메트릭 분석	✅ 지지됨
LLM에서 세계 모델로의 전환이 로봇공학의 핵심 패러다임이다	Fung et al.의 입장 논문	⚠️ 그럴듯한 프레이밍; 대안적 패러다임(예: 종단 간 RL)이 여전히 경쟁력을 유지하고 있음
사용자의 정신 모델이 인간-로봇 협업을 개선한다	Fung et al.이 제안함	⚠️ 제안되었으나 아직 경험적으로 검증되지 않음

Sim-to-Real 격차는 지속된다

두 논문 모두 근본적인 과제를 인정하지만 해결하지는 못한다: 세계 모델은 시뮬레이터이며, 시뮬레이터는 틀린다. 학습된 세계 모델의 예측과 실제 물리학 사이의 격차 — sim-to-real 전이 문제 — 는 비구조화된 환경에 세계 모델 기반 에이전트를 배치하는 데 있어 여전히 주요 장애물이다. 내부 모델에서는 효과적으로 계획할 수 있지만 실제 세계가 그 모델에서 벗어날 때 실패하는 로봇은 유용하지 않다.

Li et al.은 현재의 평가 메트릭이 물리적 일관성이 아닌 픽셀 충실도나 상태 수준의 정확도를 평가한다고 지적한다 — 모델이 물리적으로 불가능한 시각적으로 현실적인 예측을 생성할 수 있다.

열린 질문과 미래 방향

물리적 일관성을 위한 평가 메트릭: 픽셀 수준의 유사성을 넘어, 세계 모델의 예측이 물리적으로 그럴듯한지 어떻게 측정해야 하는가? 예측된 미래에서 에너지 보존, 운동량, 충돌 감지를 평가하는 메트릭은 아직 대규모로 존재하지 않는다.

실시간 제어를 위한 계산 비용: 세계 모델은 실시간보다 빠르게 예측을 생성할 수 있을 때만 유용하다. 모델 복잡도와 추론 속도 간의 트레이드오프는 제어 루프가 수백 헤르츠로 실행되는 로봇공학 응용에서 구속 제약이다.

실세계 조작을 위한 데이터 부족: 자율 주행은 대규모 실세계 데이터셋(수백만 시간의 주행 영상)의 혜택을 받는다. 로봇 조작은 이에 비견할 만한 데이터셋이 부족하다. 세계 모델을 비디오 데이터로 효과적으로 사전 학습하고 소규모 로봇 데이터셋으로 미세 조정할 수 있는가?

구성적 일반화: "블록 밀기"로 학습된 세계 모델이 "컵 밀기"로 일반화할 수 있는가? 공간적 표현 축이 여기서 중요하다 — 분해된 표현은 이론적으로 구성적 전이를 가능하게 해야 하지만, 경험적 증거는 제한적이다.

기반 모델과의 통합: 대규모 사전 학습된 시각-언어 모델이 세계 모델로 기능할 수 있는가, 아니면 세계 모델은 의미론적 내용보다 물리적 역학을 우선시하는 근본적으로 다른 학습 목표를 요구하는가?

이것이 당신의 연구에 의미하는 바

로봇공학 분야에서 연구하는 경우, Li et al.의 3축 분류 체계는 자신의 연구 위치를 설정하는 데 실용적인 도구이다. 기능적, 시간적, 공간적 축을 명시적으로 정의함으로써, 어떤 한계가 아키텍처 선택에 내재된 것인지, 그리고 어떤 한계가 해결 가능한 것인지를 명확히 할 수 있다.

기반 모델(foundation model) 분야에서 연구하는 경우, LLM에서 세계 모델로의 전환은 상당한 발전 가능성이 있는 방향이다. 언어 모델은 다음 토큰을 예측하고, 세계 모델은 다음 상태를 예측한다. 아키텍처상의 유사성은 시사하는 바가 있으나, 훈련 목표와 평가 기준은 전용 연구를 필요로 할 만큼 충분히 상이하다.

관련 로봇공학 및 세계 모델 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (2)

[1] Li, X., He, X., Zhang, L., Wu, M., Li, X., & Liu, Y. (2025). A Comprehensive Survey on World Models for Embodied AI. arXiv:2510.16732.

DOI Scholar

[2] Fung, P., Bachrach, Y., Celikyilmaz, A., Chaudhuri, K. et al. (2025). Embodied AI Agents: Modeling the World. arXiv:2506.22355.