When a human mathematician encounters a difficult problem, something happens before pencil touches paper: an assessment. How hard is this? What approach is likely to work? Do I have the relevant knowledge? This pre-computation — the monitoring of one's own cognitive resources before committing to a strategy — is metacognition, and its absence in current AI systems is a measurable source of failure. Language models plunge into generating answers without first evaluating whether their approach is appropriate, whether their knowledge is sufficient, or whether their output is reliable. Three recent lines of work suggest this is changing.
Flavell's Framework, Implemented
Oh (2025), presenting at the COLM 2025 Workshop on LLM Explainability, takes the most direct approach: implementing John Flavell's 1979 cognitive monitoring model — a foundational theory from developmental psychology — as a three-phase system for LLM reasoning. The framework operationalizes the Monitor-Generate-Verify cycle that Flavell theorized as the structure of human metacognition.
The system works in three distinct phases. In the monitoring phase, the model assesses the task before attempting a solution: what type of problem is this, what strategies might apply, what is the expected difficulty? In the generation phase, the model produces a solution informed by this assessment, allocating computational effort proportional to diagnosed difficulty. In the verification phase, the model evaluates its own output against the metacognitive criteria established during monitoring.
The critical insight is that current approaches split this cycle in half. Methods like Plan-and-Solve excel at strategic planning (the monitoring phase) but lack mechanisms to verify whether selected strategies succeed. Methods like Self-Refine excel at iterative output improvement (the verification phase) but begin generation without any upfront task assessment. Oh's unified implementation shows the value of closing the loop: on GSM8K arithmetic reasoning, the full Monitor-Generate-Verify system achieves 75.42% accuracy versus 68.44% for Self-Refine and 67.07% for Self-Verification, while requiring fewer refinement attempts (1.3 versus 2.0 iterations). Upfront monitoring produces better initial solutions, reducing the need for costly iterative correction.
Metacognition in Open Worlds
Zhou, Liu, Li et al. (2025), publishing in the Findings of ACL 2025, tackle a more ambitious setting: open-world planning in environments like Minecraft, where the space of possible actions is vast, goals are long-horizon, and the environment changes dynamically. Their system, Metagent-P, integrates the world knowledge of large language models with the symbolic reasoning capabilities of cognitive architectures and the self-reflection characteristic of metacognition.
The architecture constructs a planning-verification-execution-reflection framework. The planning module generates action sequences. Verification checks these plans for logical consistency before execution — a form of prospective metacognition that catches errors in reasoning before they become errors in action. Execution carries out verified plans. Reflection evaluates outcomes against expectations and stores experiences in a multimodal memory system for future reference.
The results are striking: Metagent-P reduces average replanning counts by 34% compared to current methods and exceeds the average human success rate by 18.96% in long-horizon Minecraft tasks. The reduction in replanning is the key metacognitive indicator — it means the system's initial plans are better because they have been subjected to self-evaluation before commitment, not merely refined through trial-and-error after failure. The system also demonstrates self-evolution, improving its planning capability through accumulated experience — a form of metacognitive learning that mirrors how human experts develop better intuitions over time.
Grounded Metacognitive Reasoning
Elenjical et al. (2026) introduce Think², a framework for what they call grounded metacognitive reasoning in LLMs. While the details of this forthcoming work are still emerging, the conceptual contribution is significant: the argument that LLM metacognition must be grounded in the model's actual capabilities rather than simulated through prompting tricks.
The distinction matters because many current "self-reflection" methods in LLMs are performative rather than genuine. When a language model is prompted to "check your reasoning," it generates text that looks like self-evaluation but does not actually access diagnostic information about its own processing. It is producing the linguistic form of metacognition without the computational substance — like a student who writes "I verified my answer" without actually rechecking the calculation. Grounded metacognition, by contrast, requires that the model's self-assessment draw on real signals about its internal state: confidence calibration, detection of knowledge boundaries, awareness of reasoning patterns that historically produce errors.
From Simulation to Substance
The convergence of these approaches — Flavell's cognitive monitoring implemented as system architecture, neuro-symbolic metacognition for open-world agents, and grounded rather than performative self-evaluation — suggests a maturation of the field beyond prompting-based reflection. The early wave of "self-correcting" LLMs relied on generating more text about the text already generated. The emerging approach embeds metacognitive mechanisms into the computational architecture itself: monitoring before generating, verifying before executing, and learning from the gap between predicted and actual outcomes.
The practical stakes are high. Every AI system deployed in a consequential domain — medical diagnosis, legal reasoning, financial analysis, scientific research — faces situations where it should recognize its own limitations and defer to human judgment. A system without metacognition cannot know what it does not know. It will generate confident outputs on questions outside its competence, and no amount of scaling will fix this, because the problem is architectural rather than statistical. Teaching machines to monitor their own reasoning is not a refinement of current capabilities — it is the development of a capability that current systems fundamentally lack.