Field MapAI & Machine LearningSystematic Review

LLMOrbit: Mapping Six Years of Language Model Evolution from Scaling Walls to Agentic Systems

Where did we come from, and where are we going? LLMOrbit maps the full landscape of large language models from 2019 to 2025 as a circular taxonomyโ€”revealing that the field has hit scaling walls and is pivoting toward agentic architectures as the next growth vector.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Where does the field of large language models stand as of 2025? The pace of development has been so rapid that even active researchers struggle to maintain a coherent map of the landscape. New models, architectures, and training techniques appear weekly, each claiming improvement over predecessors whose names are barely familiar. The result is a field that is simultaneously advancing quickly and losing its collective sense of direction.

Patro & Agneeswaran's LLMOrbit addresses this disorientation with a circular taxonomyโ€”a structured map of the LLM landscape from the introduction of GPT-2 in 2019 through the agentic systems of 2025. The circular structure is deliberate: rather than implying a linear progression from worse to better, it captures the cyclic and branching nature of LLM development, where ideas recur in new forms and seemingly abandoned approaches resurface with modern twists.

The Scaling Era (2019โ€“2023)

The first phase of LLM development was defined by a simple hypothesis: bigger models trained on more data produce better results. This hypothesis, formalized in scaling laws by recent studies and refined by Hoffmann et al. (2022, the "Chinchilla" paper), drove a parameter arms race from GPT-2's 1.5 billion parameters (2019) to GPT-4's rumored trillions.

The scaling era produced genuine and substantial improvements. Capabilities that were impossible at smaller scalesโ€”few-shot learning, complex instruction following, extended coherent generationโ€”emerged reliably as models grew. The scaling laws provided a remarkably accurate predictive framework: given a compute budget, you could estimate the optimal model size and training data quantity.

But the scaling era also encountered scaling wallsโ€”diminishing returns that made continued scaling increasingly expensive relative to the improvement obtained:

  • Data walls: High-quality training data is finite. Models exhausted the supply of carefully curated web text and increasingly relied on synthetic or lower-quality data, with corresponding quality degradation.
  • Compute walls: Training the largest models requires clusters of thousands of GPUs running for monthsโ€”an investment measured in hundreds of millions of dollars that only a handful of organizations can afford.
  • Capability walls: Certain abilitiesโ€”reliable mathematical reasoning, consistent factual accuracy, long-horizon planningโ€”improved slowly with scale, suggesting that more parameters alone cannot unlock them.

The Reasoning Turn (2024โ€“2025)

The response to scaling walls was not to abandon scale but to redirect investment toward how models learn rather than how much they learn. The reasoning turn, catalyzed by DeepSeek R1 and reinforced by subsequent work, demonstrated that training methodsโ€”particularly reinforcement learning applied to reasoning processesโ€”could unlock capabilities that pure scaling had not.

LLMOrbit identifies several key developments in this phase:

  • Chain-of-thought training: Models trained to show their reasoning step by step, enabling verification and improvement of the reasoning process itself
  • Process reward models: Rewarding intermediate reasoning steps rather than only final answers, providing denser learning signals
  • Test-time compute scaling: Allocating more computation at inference time for harder problems, trading latency for accuracy in a principled way
  • Specialized reasoning models: Domain-specific models (legal, medical, mathematical) that reason within professional frameworks

The Multimodal Expansion

Parallel to the reasoning turn, the multimodal expansion integrated vision, audio, and structured data with language understanding. LLMOrbit maps the progression from CLIP-style contrastive alignment (connecting images and text in a shared embedding space) through instruction-tuned multimodal models (LLaVA, GPT-4V) to domain-specific multimodal experts (medical VLMs, remote sensing VLMs).

The taxonomy reveals that multimodality is not a single capability but a spectrum:

  • Perception: Understanding the content of non-text inputs (what does this image show?)
  • Grounding: Connecting language references to specific regions of non-text inputs (where in this image is the cat?)
  • Reasoning: Drawing conclusions that require integrating information across modalities (does this X-ray show evidence consistent with the patient's reported symptoms?)
  • Generation: Producing non-text outputs guided by language (generate an image of a sunset over mountains)
Current models achieve perception and basic grounding reliably; cross-modal reasoning and controlled generation remain active research frontiers.

The Agentic Pivot

The most recent phaseโ€”and the one LLMOrbit identifies as the current trajectoryโ€”is the pivot from models as passive responders to models as autonomous agents. This shift redefines the LLM from a text-in-text-out function to a cognitive controller that plans, uses tools, maintains memory, interacts with environments, and coordinates with other agents.

LLMOrbit's taxonomy of agentic capabilities includes:

  • Tool use: Calling external APIs, executing code, querying databases
  • Planning: Decomposing complex goals into executable sub-steps
  • Memory: Maintaining information across interactions, building persistent knowledge
  • Self-reflection: Evaluating own outputs and identifying errors
  • Multi-agent coordination: Collaborating with other AI agents toward shared goals
The agentic pivot represents a qualitative shift in what LLMs are. A language model is a statistical tool. An agent is an autonomous system with goals, plans, and the ability to act on the world. The safety, alignment, and governance implications of this shift are substantialโ€”and, as LLMOrbit notes, the governance frameworks have not kept pace with the capability development.

The Map, Not the Territory

LLMOrbit is explicitly a taxonomyโ€”a map of the landscape, not a prediction of where it will go next. The authors are careful to note that circular taxonomies reveal patterns but do not determine trajectories. The field may continue on its current agentic path, or it may encounter new walls that redirect development in unexpected directions.

What the taxonomy does provide is orientation. For researchers entering the field, it answers the question "What should I know?" For practitioners evaluating which technologies to adopt, it answers "Where does this fit in the broader landscape?" For policymakers attempting to regulate AI development, it answers "What kinds of systems exist and what can they do?"

Claims and Evidence

<
ClaimEvidenceVerdict
Scaling laws accurately predicted early LLM improvementKaplan et al. and Hoffmann et al. validated on multiple model familiesโœ… Well-established
Scaling has hit diminishing returns for certain capabilitiesData, compute, and capability walls documented across multiple effortsโœ… Supported
RL-based reasoning training outperforms pure scaling for reasoningDeepSeek R1, Hou et al. demonstrate reasoning gains from RLโœ… Supported
The agentic pivot is the dominant current research directionPublication volume, industry investment, and benchmark development all shifted toward agentsโœ… Observed
A single taxonomy can capture the full LLM landscapeInherent simplification; important nuances are necessarily lostโš ๏ธ Useful simplification

Open Questions

  • Post-Transformer architectures: LLMOrbit is implicitly Transformer-centric. Will alternative architectures (state space models, linear attention, hybrid designs) create a parallel taxonomy branch?
  • Convergence or divergence?: Are LLMs converging toward a single dominant architecture, or is the field diverging into specialized branches (reasoning models, multimodal models, agent models) that share less and less common ground?
  • The next wall: What will be the scaling wall for agentic AI? Memory management? Multi-agent coordination failures? Safety and alignment limitations? Identifying the next constraint before hitting it would enable proactive research investment.
  • Evaluation evolution: As LLMs evolve from text generators to autonomous agents, evaluation must evolve correspondingly. What benchmarks will define the next generation of LLM capability assessment?
  • The consolidation question: Will the LLM landscape consolidate around a few dominant model families (as happened with search engines and social networks), or will it remain fragmented with many viable approaches?
  • What This Means for Your Research

    For any researcher working with or on LLMs, LLMOrbit provides essential context. Understanding where the field has beenโ€”and why it has moved in the directions it hasโ€”is prerequisite for identifying where it is going and where the most impactful research opportunities lie.

    The key strategic insight from the taxonomy: the era of winning through scale alone is closing. The open frontiers are reasoning quality, domain specialization, multimodal integration, and agentic capability. Researchers who invest in these directions are better positioned than those who continue to pursue raw scaling.

    For the broader AI community, LLMOrbit serves as a reminder that rapid progress can obscure fundamental questions. We have built systems of remarkable capabilityโ€”but the question of what these systems are, how they should be governed, and what role they should play in human society remains as open as it was when GPT-2 was released seven years ago.

    References (1)

    [1] Patro, B. & Agneeswaran, V. (2026). LLMOrbit: A Circular Taxonomy of Large Language Models โ€” From Scaling Walls to Agentic AI Systems. arXiv:2601.14053.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 8 keywords โ†’