Deep DiveAI & Machine Learning

Embodied World Models: Teaching Robots to Simulate Before They Act

Before acting in the physical world, an effective robot should be able to imagine the consequences. World models — internal simulators that predict how actions reshape future states — are becoming the central architecture for embodied AI. A comprehensive survey and a Meta/HKUST research agenda map the state of the art and the open problems.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

A chess engine does not move a piece to see what happens. It simulates the consequences internally, evaluates thousands of possible futures, and then selects the move with the best expected outcome. This capacity — predicting the effects of actions before executing them — is what separates planning from trial-and-error. For language models operating in text, trial-and-error is cheap: a bad sentence can be regenerated. For robots operating in the physical world, trial-and-error breaks things. This asymmetry is why world models — internal simulators that capture environment dynamics — have become the central research question in embodied AI.

The Research Landscape

A Unified Framework for World Models

Li, He, Zhang, Wu, Li, and Liu (2025) present what is, to date, the most systematic survey of world models for embodied AI. The paper proposes a three-axis taxonomy that organizes the field:

Functionality axis: Decision-Coupled vs. General-Purpose. Decision-coupled world models are trained jointly with a policy — the model learns to predict futures that are useful for making decisions, even if those predictions are not perceptually accurate. General-purpose world models aim to produce realistic predictions of future states regardless of the downstream task. The trade-off is precision versus flexibility: decision-coupled models are more efficient for a specific task but do not transfer; general-purpose models transfer but may waste representational capacity on details irrelevant to any particular decision.

Temporal modeling axis: Sequential Simulation and Inference vs. Global Difference Prediction. Sequential models generate future states one step at a time, autoregressively. This is flexible but accumulates errors over long horizons — each prediction error compounds into the next. Global difference prediction models instead estimate the change between the current state and a future state in one shot, avoiding error accumulation but struggling with complex multi-step dynamics.

Spatial representation axis: Global Latent Vector, Token Feature Sequence, Spatial Latent Grid, and Decomposed Rendering Representation. Each trades off computational cost, spatial fidelity, and compositional generalization differently. Latent vectors are compact but lose spatial structure. Decomposed representations separate objects from backgrounds, enabling compositional reasoning but requiring object detection as a prerequisite.

The survey covers robotics, autonomous driving, and general video prediction, identifying a consistent gap across all domains: pixel-level prediction quality does not predict task-level performance. A model can produce visually realistic future frames while failing to predict task-relevant dynamics.

From Language Models to World Models

Fung, Bachrach, Celikyilmaz, Chaudhuri, and collaborators (2025) frame the transition from language models to world models as a critical transition for embodied AI. Their argument is that the development of world models is central to reasoning and planning of embodied AI agents, allowing them to understand and predict their environment, to understand user intentions and social contexts.

The paper proposes that world modeling encompasses three integrated capabilities:

Multimodal perception: The agent must integrate visual, tactile, auditory, and proprioceptive inputs into a unified representation. This is harder than multimodal language modeling because the modalities have different temporal resolutions (vision at 30Hz, touch at 1000Hz) and different spatial frames (camera coordinates vs. robot joint angles).

Planning through reasoning for action and control: The world model must support forward prediction, counterfactual reasoning ("what would have happened if I had pushed harder?"), and goal-conditioned planning ("what sequence of actions reaches the desired state?"). Each requires progressively more sophisticated causal modeling.

Memory: Embodied agents operate in persistent environments. The world model must maintain a belief state updated incrementally as new observations arrive, rather than reprocessing the entire history at each step.

Beyond the physical world, the paper proposes learning mental world models of users — predicting what the human partner intends and needs to enable better human-agent collaboration.

Critical Analysis: Claims and Evidence

<
ClaimEvidenceVerdict
World models are central to embodied AI planningBoth papers converge on this position; consistent with the broader robotics literature✅ Supported
Error accumulation in sequential models limits long-horizon predictionLi et al.'s survey of temporal modeling approaches✅ Supported — well-documented limitation
Pixel prediction quality does not predict task performanceLi et al.'s cross-domain metric analysis✅ Supported
The LLM-to-world-model transition is the key paradigm for roboticsFung et al.'s position paper⚠️ Plausible framing; alternative paradigms (e.g., end-to-end RL) remain competitive
Mental models of users improve human-robot collaborationProposed by Fung et al.⚠️ Proposed but not yet empirically validated

The Sim-to-Real Gap Persists

Both papers acknowledge but do not solve the fundamental challenge: world models are simulators, and simulators are wrong. The gap between a learned world model's predictions and actual physics — the sim-to-real transfer problem — remains the primary obstacle to deploying world-model-based agents in unstructured environments. A robot that can plan effectively in its internal model but fails when the real world deviates from that model is not useful.

Li et al. note that current evaluation metrics assess pixel fidelity or state-level accuracy but not physical consistency — a model can produce visually realistic predictions that are physically impossible.

Open Questions and Future Directions

  • Evaluation metrics for physical consistency: How should we measure whether a world model's predictions are physically plausible, beyond pixel-level similarity? Metrics that assess energy conservation, momentum, and collision detection in predicted futures do not yet exist at scale.
  • Computational cost for real-time control: World models are useful only if they can generate predictions faster than real time. The trade-off between model complexity and inference speed is a binding constraint for robotics applications where control loops run at hundreds of hertz.
  • Data scarcity for real-world manipulation: Autonomous driving benefits from massive real-world datasets (millions of hours of driving footage). Robotic manipulation lacks comparable datasets. Can world models be effectively pre-trained on video data and fine-tuned on small robotic datasets?
  • Compositional generalization: Can a world model trained on "pushing blocks" generalize to "pushing cups"? The spatial representation axis matters here — decomposed representations should theoretically enable compositional transfer, but empirical evidence is limited.
  • Integration with foundation models: Can large pretrained vision-language models serve as world models, or do world models require fundamentally different training objectives that prioritize physical dynamics over semantic content?
  • What This Means for Your Research

    If you work in robotics, the three-axis taxonomy from Li et al. is a practical tool for positioning your own work. Making the functionality, temporal, and spatial axes explicit clarifies which limitations are inherent to your architectural choices and which can be addressed.

    If you work on foundation models, the LLM-to-world-model transition is a direction with significant room. Language models predict the next token; world models predict the next state. The architectural similarities are suggestive, but training objectives and evaluation criteria differ enough to require dedicated investigation.

    Explore related robotics and world model research through ORAA ResearchBrain.

    References (2)

    [1] Li, X., He, X., Zhang, L., Wu, M., Li, X., & Liu, Y. (2025). A Comprehensive Survey on World Models for Embodied AI. arXiv:2510.16732.
    [2] Fung, P., Bachrach, Y., Celikyilmaz, A., Chaudhuri, K. et al. (2025). Embodied AI Agents: Modeling the World. arXiv:2506.22355.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 6 keywords →