Field MapAI & Machine LearningSystematic Review

LLMOrbit: Mapping Six Years of Language Model Evolution from Scaling Walls to Agentic Systems

Where did we come from, and where are we going? LLMOrbit maps the full landscape of large language models from 2019 to 2025 as a circular taxonomy—revealing that the field has hit scaling walls and is pivoting toward agentic architectures as the next growth vector.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

Where does the field of large language models stand as of 2025? The pace of development has been so rapid that even active researchers struggle to maintain a coherent map of the landscape. New models, architectures, and training techniques appear weekly, each claiming improvement over predecessors whose names are barely familiar. The result is a field that is simultaneously advancing quickly and losing its collective sense of direction.

Patro & Agneeswaran's LLMOrbit addresses this disorientation with a circular taxonomy—a structured map of the LLM landscape from the introduction of GPT-2 in 2019 through the agentic systems of 2025. The circular structure is deliberate: rather than implying a linear progression from worse to better, it captures the cyclic and branching nature of LLM development, where ideas recur in new forms and seemingly abandoned approaches resurface with modern twists.

The Scaling Era (2019–2023)

The first phase of LLM development was defined by a simple hypothesis: bigger models trained on more data produce better results. This hypothesis, formalized in scaling laws by recent studies and refined by Hoffmann et al. (2022, the "Chinchilla" paper), drove a parameter arms race from GPT-2's 1.5 billion parameters (2019) to GPT-4's rumored trillions.

The scaling era produced genuine and substantial improvements. Capabilities that were impossible at smaller scales—few-shot learning, complex instruction following, extended coherent generation—emerged reliably as models grew. The scaling laws provided a remarkably accurate predictive framework: given a compute budget, you could estimate the optimal model size and training data quantity.

But the scaling era also encountered scaling walls—diminishing returns that made continued scaling increasingly expensive relative to the improvement obtained:

Data walls: High-quality training data is finite. Models exhausted the supply of carefully curated web text and increasingly relied on synthetic or lower-quality data, with corresponding quality degradation.
Compute walls: Training the largest models requires clusters of thousands of GPUs running for months—an investment measured in hundreds of millions of dollars that only a handful of organizations can afford.
Capability walls: Certain abilities—reliable mathematical reasoning, consistent factual accuracy, long-horizon planning—improved slowly with scale, suggesting that more parameters alone cannot unlock them.

The Reasoning Turn (2024–2025)

The response to scaling walls was not to abandon scale but to redirect investment toward how models learn rather than how much they learn. The reasoning turn, catalyzed by DeepSeek R1 and reinforced by subsequent work, demonstrated that training methods—particularly reinforcement learning applied to reasoning processes—could unlock capabilities that pure scaling had not.

LLMOrbit identifies several key developments in this phase:

Chain-of-thought training: Models trained to show their reasoning step by step, enabling verification and improvement of the reasoning process itself
Process reward models: Rewarding intermediate reasoning steps rather than only final answers, providing denser learning signals
Test-time compute scaling: Allocating more computation at inference time for harder problems, trading latency for accuracy in a principled way
Specialized reasoning models: Domain-specific models (legal, medical, mathematical) that reason within professional frameworks

The Multimodal Expansion

Parallel to the reasoning turn, the multimodal expansion integrated vision, audio, and structured data with language understanding. LLMOrbit maps the progression from CLIP-style contrastive alignment (connecting images and text in a shared embedding space) through instruction-tuned multimodal models (LLaVA, GPT-4V) to domain-specific multimodal experts (medical VLMs, remote sensing VLMs).

The taxonomy reveals that multimodality is not a single capability but a spectrum:

Perception: Understanding the content of non-text inputs (what does this image show?)
Grounding: Connecting language references to specific regions of non-text inputs (where in this image is the cat?)
Reasoning: Drawing conclusions that require integrating information across modalities (does this X-ray show evidence consistent with the patient's reported symptoms?)
Generation: Producing non-text outputs guided by language (generate an image of a sunset over mountains)

Current models achieve perception and basic grounding reliably; cross-modal reasoning and controlled generation remain active research frontiers.

The Agentic Pivot

The most recent phase—and the one LLMOrbit identifies as the current trajectory—is the pivot from models as passive responders to models as autonomous agents. This shift redefines the LLM from a text-in-text-out function to a cognitive controller that plans, uses tools, maintains memory, interacts with environments, and coordinates with other agents.

LLMOrbit's taxonomy of agentic capabilities includes:

Tool use: Calling external APIs, executing code, querying databases
Planning: Decomposing complex goals into executable sub-steps
Memory: Maintaining information across interactions, building persistent knowledge
Self-reflection: Evaluating own outputs and identifying errors
Multi-agent coordination: Collaborating with other AI agents toward shared goals

The agentic pivot represents a qualitative shift in what LLMs are. A language model is a statistical tool. An agent is an autonomous system with goals, plans, and the ability to act on the world. The safety, alignment, and governance implications of this shift are substantial—and, as LLMOrbit notes, the governance frameworks have not kept pace with the capability development.

The Map, Not the Territory

LLMOrbit is explicitly a taxonomy—a map of the landscape, not a prediction of where it will go next. The authors are careful to note that circular taxonomies reveal patterns but do not determine trajectories. The field may continue on its current agentic path, or it may encounter new walls that redirect development in unexpected directions.

What the taxonomy does provide is orientation. For researchers entering the field, it answers the question "What should I know?" For practitioners evaluating which technologies to adopt, it answers "Where does this fit in the broader landscape?" For policymakers attempting to regulate AI development, it answers "What kinds of systems exist and what can they do?"

Claims and Evidence

Claim	Evidence	Verdict
Scaling laws accurately predicted early LLM improvement	Kaplan et al. and Hoffmann et al. validated on multiple model families	✅ Well-established
Scaling has hit diminishing returns for certain capabilities	Data, compute, and capability walls documented across multiple efforts	✅ Supported
RL-based reasoning training outperforms pure scaling for reasoning	DeepSeek R1, Hou et al. demonstrate reasoning gains from RL	✅ Supported
The agentic pivot is the dominant current research direction	Publication volume, industry investment, and benchmark development all shifted toward agents	✅ Observed
A single taxonomy can capture the full LLM landscape	Inherent simplification; important nuances are necessarily lost	⚠️ Useful simplification

Open Questions

Post-Transformer architectures: LLMOrbit is implicitly Transformer-centric. Will alternative architectures (state space models, linear attention, hybrid designs) create a parallel taxonomy branch?

Convergence or divergence?: Are LLMs converging toward a single dominant architecture, or is the field diverging into specialized branches (reasoning models, multimodal models, agent models) that share less and less common ground?

The next wall: What will be the scaling wall for agentic AI? Memory management? Multi-agent coordination failures? Safety and alignment limitations? Identifying the next constraint before hitting it would enable proactive research investment.

Evaluation evolution: As LLMs evolve from text generators to autonomous agents, evaluation must evolve correspondingly. What benchmarks will define the next generation of LLM capability assessment?

The consolidation question: Will the LLM landscape consolidate around a few dominant model families (as happened with search engines and social networks), or will it remain fragmented with many viable approaches?

What This Means for Your Research

For any researcher working with or on LLMs, LLMOrbit provides essential context. Understanding where the field has been—and why it has moved in the directions it has—is prerequisite for identifying where it is going and where the most impactful research opportunities lie.

The key strategic insight from the taxonomy: the era of winning through scale alone is closing. The open frontiers are reasoning quality, domain specialization, multimodal integration, and agentic capability. Researchers who invest in these directions are better positioned than those who continue to pursue raw scaling.

For the broader AI community, LLMOrbit serves as a reminder that rapid progress can obscure fundamental questions. We have built systems of remarkable capability—but the question of what these systems are, how they should be governed, and what role they should play in human society remains as open as it was when GPT-2 was released seven years ago.

면책 조항: 이 게시물은 정보 제공 목적의 연구 동향 개요이다. 특정 연구 결과, 통계 및 주장은 학술 저작물에서 인용하기 전에 원본 논문을 통해 검증해야 한다.

LLMOrbit: 스케일링 한계에서 에이전트 시스템까지, 6년간의 언어 모델 진화 매핑

2025년 현재 대형 언어 모델(LLM) 분야는 어디에 서 있는가? 발전 속도가 너무 빠른 나머지, 현역 연구자들조차 이 분야의 지형을 일관되게 파악하기 어렵다. 새로운 모델, 아키텍처, 학습 기법이 매주 등장하며, 각각은 이름조차 낯선 전작을 개선했다고 주장한다. 그 결과, 이 분야는 빠르게 발전하는 동시에 집단적 방향 감각을 잃어가고 있다.

Patro & Agneeswaran의 LLMOrbit은 이러한 방향 상실에 순환 분류 체계(circular taxonomy)로 대응한다. 이는 2019년 GPT-2의 등장부터 2025년의 에이전트 시스템에 이르기까지 LLM 지형을 구조적으로 매핑한 것이다. 순환 구조는 의도적인 선택이다. 열등한 것에서 우수한 것으로의 선형적 발전을 암시하는 대신, 아이디어가 새로운 형태로 반복되고 한때 폐기된 듯 보이던 접근법이 현대적 변형으로 재등장하는 LLM 발전의 순환적·분기적 특성을 포착한다.

스케일링 시대 (2019–2023)

LLM 발전의 첫 번째 단계는 단순한 가설로 정의되었다. 더 많은 데이터로 학습된 더 큰 모델이 더 나은 결과를 낸다는 것이다. 최근 연구들에 의해 공식화된 스케일링 법칙(scaling law)과 Hoffmann et al. (2022, "Chinchilla" 논문)에 의해 정제된 이 가설은, GPT-2의 15억 파라미터(2019년)에서 GPT-4의 수조에 달하는 것으로 알려진 파라미터까지 파라미터 군비 경쟁을 이끌었다.

스케일링 시대는 실질적이고 상당한 성능 향상을 이루었다. 소규모 모델에서는 불가능했던 능력들—퓨샷 학습(few-shot learning), 복잡한 지시 수행, 길고 일관된 텍스트 생성—이 모델이 커질수록 안정적으로 나타났다. 스케일링 법칙은 놀랍도록 정확한 예측 프레임워크를 제공했다. 주어진 컴퓨팅 예산으로 최적의 모델 크기와 학습 데이터 양을 추정할 수 있었다.

그러나 스케일링 시대는 또한 스케일링 한계(scaling walls)와 마주쳤다. 지속적인 스케일 확대를 통해 얻을 수 있는 개선에 비해 비용이 점점 더 커지는 수확 체감 현상이 나타난 것이다.

데이터 한계: 고품질 학습 데이터는 유한하다. 모델들은 엄선된 웹 텍스트 공급을 소진하고 점점 더 합성 데이터나 저품질 데이터에 의존하게 되었으며, 그에 상응하는 품질 저하가 발생했다.
컴퓨팅 한계: 가장 큰 모델을 학습시키려면 수천 개의 GPU 클러스터를 수개월간 운영해야 한다. 이는 수억 달러에 달하는 투자로, 소수의 조직만이 감당할 수 있다.
능력 한계: 신뢰할 수 있는 수학적 추론, 일관된 사실 정확성, 장기 계획 수립 등의 특정 능력들은 규모가 커져도 더디게 향상되었으며, 이는 파라미터 수만으로는 이러한 능력을 끌어낼 수 없음을 시사한다.

추론 전환 (2024–2025)

스케일링 한계에 대한 대응은 규모 확대를 포기하는 것이 아니라, 얼마나 많이 학습하느냐보다 어떻게 학습하느냐로 투자를 전환하는 것이었다. DeepSeek R1을 촉매로 이후 연구들에 의해 강화된 추론 전환은, 순수한 스케일링으로는 열어젖히지 못했던 능력들을 학습 방법—특히 추론 과정에 적용된 강화 학습—이 끌어낼 수 있음을 입증했다.

LLMOrbit은 이 단계에서의 몇 가지 핵심 발전을 식별한다.

사고 연쇄(chain-of-thought) 학습: 추론 과정을 단계별로 보여주도록 학습된 모델로, 추론 과정 자체의 검증과 개선이 가능하다.
과정 보상 모델(process reward model): 최종 답변만이 아닌 중간 추론 단계에 보상을 부여함으로써, 더 밀도 있는 학습 신호를 제공한다.
테스트 시간 컴퓨팅 스케일링(test-time compute scaling): 더 어려운 문제에 대해 추론 시점에 더 많은 연산을 할당하여, 원칙적인 방식으로 지연 시간과 정확도를 교환한다.
전문화된 추론 모델: 법률, 의료, 수학 등 전문적 프레임워크 내에서 추론하는 도메인 특화 모델이다.

멀티모달 확장

추론 전환과 다중 모달 확장

추론 전환과 병행하여, 다중 모달 확장은 시각, 오디오, 구조화된 데이터를 언어 이해와 통합하였다. LLMOrbit은 CLIP 방식의 대조적 정렬(이미지와 텍스트를 공유 임베딩 공간에서 연결)에서 시작하여 명령어 튜닝된 다중 모달 모델(LLaVA, GPT-4V)을 거쳐 도메인 특화 다중 모달 전문 모델(의료 VLM, 원격 탐사 VLM)로 이어지는 발전 과정을 지도화한다.

이 분류 체계는 다중 모달리티가 단일한 능력이 아니라 하나의 스펙트럼임을 드러낸다:

지각(Perception): 비텍스트 입력의 내용 이해 (이 이미지는 무엇을 보여 주는가?)
접지(Grounding): 언어적 참조를 비텍스트 입력의 특정 영역과 연결 (이 이미지에서 고양이는 어디에 있는가?)
추론(Reasoning): 여러 모달리티에 걸친 정보를 통합하여 결론 도출 (이 X-ray 사진은 환자가 보고한 증상과 일치하는 소견을 보이는가?)
생성(Generation): 언어의 안내를 받아 비텍스트 출력물 생성 (산 너머 일몰 이미지를 생성하라)

현재 모델들은 지각과 기본적인 접지는 안정적으로 수행한다; 교차 모달 추론과 제어된 생성은 여전히 활발한 연구 과제로 남아 있다.

에이전틱 전환

가장 최근의 단계—그리고 LLMOrbit이 현재의 궤적으로 식별하는 단계—는 모델이 수동적 응답자에서 자율적 에이전트로 전환하는 것이다. 이 전환은 LLM을 텍스트 입출력 함수에서 계획을 수립하고, 도구를 사용하며, 기억을 유지하고, 환경과 상호작용하며, 다른 에이전트와 협력하는 인지적 제어자로 재정의한다.

LLMOrbit의 에이전틱 역량 분류 체계는 다음을 포함한다:

도구 사용(Tool use): 외부 API 호출, 코드 실행, 데이터베이스 조회
계획(Planning): 복잡한 목표를 실행 가능한 하위 단계로 분해
기억(Memory): 상호작용 전반에 걸쳐 정보 유지, 지속적 지식 구축
자기 성찰(Self-reflection): 자체 출력물 평가 및 오류 식별
다중 에이전트 조율(Multi-agent coordination): 공유 목표를 향한 다른 AI 에이전트와의 협력

에이전틱 전환은 LLM이 무엇인가에 대한 질적 변화를 의미한다. 언어 모델은 통계적 도구이다. 에이전트는 목표와 계획을 가지며 세계에 행동을 가할 수 있는 자율 시스템이다. 이러한 전환이 가져오는 안전성, 정렬, 거버넌스 측면의 함의는 상당하다—그리고 LLMOrbit이 지적하듯이, 거버넌스 프레임워크는 역량 개발 속도를 따라가지 못하고 있다.

지도이지 영토가 아니다

LLMOrbit은 명시적으로 분류 체계, 즉 지형의 지도이며 다음 행선지에 대한 예측이 아니다. 저자들은 순환적 분류 체계가 패턴을 드러내지만 궤적을 결정하지는 않는다는 점을 주의 깊게 언급한다. 이 분야는 현재의 에이전틱 경로를 계속 걸을 수도 있고, 예상치 못한 방향으로 발전을 전환시키는 새로운 장벽에 부딪힐 수도 있다.

분류 체계가 제공하는 것은 방향 설정이다. 이 분야에 입문하는 연구자들에게는 "무엇을 알아야 하는가?"라는 질문에 답한다. 어떤 기술을 채택할지 평가하는 실무자들에게는 "이것은 더 넓은 지형에서 어디에 위치하는가?"라는 질문에 답한다. AI 개발을 규제하려는 정책 입안자들에게는 "어떤 종류의 시스템이 존재하며 무엇을 할 수 있는가?"라는 질문에 답한다.

주장과 근거

주장	근거	평결
스케일링 법칙이 초기 LLM 개선을 정확히 예측하였다	Kaplan et al.과 Hoffmann et al.이 여러 모델 패밀리에서 검증	✅ 충분히 확립됨
특정 역량에 대한 스케일링은 수확 체감에 도달하였다	데이터, 컴퓨팅, 역량 한계가 여러 연구에서 문서화됨	✅ 지지됨
RL 기반 추론 훈련이 추론에서 순수 스케일링을 능가한다	DeepSeek R1, Hou et al.이 RL을 통한 추론 향상을 증명	✅ 지지됨
에이전틱 전환이 현재의 지배적 연구 방향이다	출판 규모, 산업 투자, 벤치마크 개발 모두 에이전트 방향으로 이동	✅ 관찰됨
단일 분류 체계가 LLM 전체 지형을 포착할 수 있다	본질적 단순화; 중요한 뉘앙스가 필연적으로 손실됨	⚠️ 유용한 단순화

미해결 질문

포스트-Transformer 아키텍처: LLMOrbit는 암묵적으로 Transformer 중심적이다. 대안적 아키텍처(상태 공간 모델, 선형 어텐션, 하이브리드 설계)가 병렬적인 분류 체계 분기를 만들어낼 것인가?

수렴인가 분기인가?: LLM들이 단일 지배적 아키텍처로 수렴하고 있는가, 아니면 점점 공통점이 줄어드는 특화된 분기들(추론 모델, 멀티모달 모델, 에이전트 모델)로 분기하고 있는가?

다음 벽: 에이전틱 AI에 있어서 스케일링 벽은 무엇이 될 것인가? 메모리 관리인가? 멀티 에이전트 조율 실패인가? 안전성 및 정렬 한계인가? 그 한계에 도달하기 전에 다음 제약을 식별하는 것이 선제적 연구 투자를 가능하게 할 것이다.

평가의 진화: LLM들이 텍스트 생성기에서 자율 에이전트로 진화함에 따라, 평가 역시 그에 상응하게 진화해야 한다. 어떤 벤치마크가 차세대 LLM 역량 평가를 정의하게 될 것인가?

통합의 문제: LLM 환경이 소수의 지배적 모델 패밀리 중심으로 통합될 것인가(검색 엔진과 소셜 네트워크에서 일어났던 것처럼), 아니면 많은 실행 가능한 접근 방식들로 파편화된 채 유지될 것인가?

당신의 연구에 주는 의미

LLM을 활용하거나 연구하는 모든 연구자에게 LLMOrbit는 필수적인 맥락을 제공한다. 이 분야가 걸어온 길—그리고 왜 그러한 방향으로 나아갔는지—을 이해하는 것은 앞으로의 방향과 가장 영향력 있는 연구 기회가 어디에 있는지를 파악하기 위한 전제 조건이다.

분류 체계로부터 도출되는 핵심 전략적 통찰은 다음과 같다: 규모만으로 승리하는 시대는 저물고 있다. 열린 개척지는 추론 품질, 도메인 특화, 멀티모달 통합, 그리고 에이전틱 역량이다. 이러한 방향에 투자하는 연구자들은 단순한 스케일링 추구를 지속하는 연구자들보다 더 유리한 위치에 있다.

더 넓은 AI 커뮤니티에게 LLMOrbit는 빠른 발전이 근본적인 질문들을 가릴 수 있다는 점을 상기시켜 준다. 우리는 놀라운 역량을 갖춘 시스템을 구축해 왔다—그러나 이 시스템들이 무엇인지, 어떻게 거버넌스되어야 하는지, 그리고 인간 사회에서 어떤 역할을 담당해야 하는지의 문제는 7년 전 GPT-2가 공개되었을 때와 마찬가지로 여전히 열려 있다.

References (1)

[1] Patro, B. & Agneeswaran, V. (2026). LLMOrbit: A Circular Taxonomy of Large Language Models — From Scaling Walls to Agentic AI Systems. arXiv:2601.14053.

DOI Scholar