Critical ReviewAI & Machine Learning

Context Rot: Why Million-Token LLMs Still Lose Information in the Middle

Models advertise million-token context windows, but can they actually use all that context? Tavakoli et al. (2025) benchmark long-term memory in LLMs and find non-uniform degradation—performance drops as conversations expand, with information in the middle of long contexts systematically neglected.

By Sean K.S. Shin

This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The context window race has produced impressive numbers: 128K tokens, 200K, 1M, 10M. Model providers compete on this specification as though a larger context window were straightforwardly better—more context means more information available, which should mean better answers. The assumption seems obvious.

It is also wrong, or at least seriously incomplete. Tavakoli et al. (2025) demonstrate that having a large context window and effectively using that context window are different things. Their benchmark—BEAM—reveals that models with million-token context windows exhibit systematic performance degradation as conversations expand, with information positioned in the middle of long contexts particularly vulnerable to being ignored or forgotten.

The Research Landscape

The "lost in the middle" phenomenon has been documented in prior work: when relevant information is placed in the middle of a long context (rather than at the beginning or end), models are less likely to attend to it correctly. This effect was initially observed with shorter contexts, and there was hope that models specifically trained for long contexts would overcome it.

Tavakoli et al. test this hope systematically. Their contributions are twofold: a benchmark for evaluating long-context memory, and a memory-augmentation system designed to address the failures the benchmark reveals.

BEAM: A Long-Context Memory Benchmark

BEAM (the benchmark) takes a different approach from most long-context evaluations. Rather than testing whether a model can find a needle in a haystack—a single piece of information buried in irrelevant text—BEAM generates extended conversations (scaling up to 10 million tokens) with thematic variety and accompanying questions that test diverse memory capabilities.

The benchmark contains 100 conversations with 2,000 validated questions, designed to test whether models can track facts, relationships, and evolving information across conversation lengths that approach and exceed their advertised context windows. This design choice matters: real-world use of long context involves sustained, thematically complex interaction, not artificial retrieval from random noise.

What the Benchmark Reveals

The findings are concerning for anyone relying on long context windows as a substitute for structured memory:

Performance degrades as conversations expand. Models with 1M-token context windows show measurable performance decline as conversations grow longer, even when they remain within the nominal context limit. The degradation is not catastrophic—models do not suddenly fail—but it is consistent and cumulative.

Degradation is non-uniform. Information positioned at the beginning and end of conversations is retained more reliably than information in the middle. This "lost in the middle" effect, previously observed at shorter context lengths, persists in models designed for long contexts. The attention mechanism, it appears, has systematic positional biases that scaling the context window does not eliminate.

Different memory capabilities degrade at different rates. Factual recall ("what was X's name?") is relatively robust. Relational reasoning ("how does X's decision in conversation turn 50 affect Y's situation in turn 200?") degrades more quickly. Temporal tracking ("what changed between the first and second discussion of topic Z?") is particularly vulnerable.

LIGHT: A Cognition-Inspired Memory System

To address these limitations, the authors propose LIGHT, a system that augments LLMs with three complementary memory components inspired by human cognition:

Long-term episodic memory: Stores summarized records of past conversation segments, retrievable by semantic similarity
Short-term working memory: Maintains the most recent and most relevant information in an active buffer
Scratchpad: Accumulates salient facts and evolving state information, updated as the conversation progresses

This three-component architecture mirrors (loosely) the distinction in cognitive psychology between episodic memory, working memory, and note-taking strategies. The idea is that rather than relying solely on the attention mechanism to manage all context, the system offloads different types of memory to appropriate structures.

LIGHT achieves consistent improvements across different models, with performance gains ranging from 3.5% to 12.69% depending on the base model and task type. The ablation studies validate that each memory component contributes: removing any one of the three systems degrades performance, confirming that they address different aspects of the memory problem.

Critical Analysis: Claims and Evidence

Claim	Evidence	Verdict
Models with 1M-token context struggle as conversations expand	BEAM benchmark evaluation across multiple models	✅ Supported
"Lost in the middle" persists in long-context models	Positional analysis of retrieval accuracy across context positions	✅ Supported
Different memory types degrade at different rates	Task-specific analysis (factual, relational, temporal)	✅ Supported
LIGHT improves performance by 3.5-12.69%	Comparison with and without LIGHT across models	✅ Supported
Each LIGHT component is independently valuable	Ablation removing each component individually	✅ Supported

The benchmark design is a strength: generating thematically diverse conversations with validated questions is more representative of real use than needle-in-a-haystack tests. A limitation is that the conversations are synthetic (generated by the framework rather than collected from real users), and it is unclear whether the patterns of information distribution in synthetic conversations match those in natural interaction.

Open Questions

Architectural solutions: Is the "lost in the middle" effect an inherent property of self-attention, or can architectural modifications (different positional encodings, hierarchical attention, memory-augmented transformers) eliminate it? LIGHT works around the problem rather than solving it at the architecture level.

Scaling behavior: Does the degradation get worse as context windows grow from 1M to 10M tokens, or does it plateau? Understanding the scaling curve would inform whether even larger context windows are worth pursuing.

Training-time solutions: Could models be trained specifically to attend uniformly across positions? Curriculum training that explicitly penalizes positional bias might address the root cause rather than the symptom.

Practical implications for RAG vs. long context: If long-context models systematically lose information in the middle, this strengthens the case for retrieval-augmented generation (RAG) as a complementary approach—retrieve the relevant pieces rather than hoping the model attends to them in a long context.

Memory system overhead: LIGHT adds computational overhead for memory management. What is the cost-benefit tradeoff compared to simply using a model with a larger context window? At what conversation length does the memory system's benefit exceed its cost?

What This Means for Your Research

The practical takeaway is clear: do not assume that a model with a million-token context window will effectively use all million tokens. If your application involves long conversations, multi-document analysis, or any scenario where information is distributed across a large context, you should expect degradation—particularly for information that falls in the middle of the context.

LIGHT demonstrates that external memory systems can partially compensate for this limitation, and the cognitive-science-inspired design (separating episodic memory, working memory, and active note-taking) provides a principled framework for building such systems.

For the broader field, these findings suggest that the context window race may be producing diminishing returns. A 10M-token window that cannot reliably use its middle 8M tokens is less useful than a 200K-token window paired with an effective memory system.

Explore related work through ORAA ResearchBrain.

면책 조항: 이 게시물은 정보 제공을 목적으로 한 연구 동향 개요이다. 학술 저작물에서 인용하기 전에 구체적인 연구 결과, 통계 및 주장을 원본 논문과 대조하여 검증해야 한다.

맥락 부식(Context Rot): 백만 토큰 LLM이 여전히 중간 정보를 잃어버리는 이유

맥락 창(context window) 경쟁은 인상적인 수치를 만들어냈다: 128K 토큰, 200K, 1M, 10M. 모델 제공업체들은 더 큰 맥락 창이 단순히 더 낫다는 듯이—더 많은 맥락은 더 많은 정보를 의미하고, 이는 더 나은 답변을 의미해야 한다는 논리로—이 사양을 두고 경쟁한다. 그 가정은 당연해 보인다.

하지만 그것은 틀렸거나, 적어도 심각하게 불완전하다. Tavakoli et al. (2025)은 대규모 맥락 창을 보유하는 것과 그 맥락 창을 효과적으로 활용하는 것이 서로 다른 문제임을 입증한다. 그들의 벤치마크인 BEAM은 백만 토큰 맥락 창을 가진 모델들이 대화가 확장될수록 체계적인 성능 저하를 보이며, 긴 맥락의 중간에 위치한 정보가 특히 무시되거나 망각되기 쉽다는 것을 밝힌다.

연구 현황

"중간에서 길을 잃는(lost in the middle)" 현상은 이전 연구에서 이미 문서화된 바 있다: 관련 정보가 긴 맥락의 중간에 배치될 경우(처음이나 끝이 아닌), 모델이 이를 올바르게 처리할 가능성이 낮아진다. 이 효과는 처음에 더 짧은 맥락에서 관찰되었으며, 긴 맥락을 위해 특별히 훈련된 모델들은 이를 극복할 것이라는 기대가 있었다.

Tavakoli et al.은 이 기대를 체계적으로 검증한다. 그들의 기여는 두 가지이다: 장문 맥락 기억을 평가하기 위한 벤치마크, 그리고 벤치마크가 드러낸 실패를 해결하기 위해 설계된 메모리 증강 시스템.

BEAM: 장문 맥락 메모리 벤치마크

벤치마크인 BEAM은 대부분의 장문 맥락 평가와는 다른 접근 방식을 취한다. 관련 없는 텍스트 속에 묻힌 단일 정보를 모델이 찾을 수 있는지 테스트하는 이른바 '건초더미에서 바늘 찾기' 방식 대신, BEAM은 주제적 다양성을 갖춘 확장된 대화(최대 1천만 토큰까지 확장)와 다양한 기억 능력을 테스트하는 질문들을 생성한다.

이 벤치마크는 2,000개의 검증된 질문을 포함한 100개의 대화로 구성되며, 모델이 광고된 맥락 창 한계에 근접하거나 초과하는 대화 길이에 걸쳐 사실, 관계, 그리고 변화하는 정보를 추적할 수 있는지 테스트하도록 설계되었다. 이러한 설계 선택은 중요한 의미를 지닌다: 장문 맥락의 실제 사용 환경은 무작위 노이즈로부터의 인위적인 검색이 아니라, 지속적이고 주제적으로 복잡한 상호작용을 포함하기 때문이다.

벤치마크가 드러내는 것

이 연구 결과는 구조화된 메모리의 대안으로 장문 맥락 창에 의존하는 모든 이들에게 우려스럽다:

대화가 확장될수록 성능이 저하된다. 1M 토큰 맥락 창을 가진 모델들은 공식적인 맥락 한계 내에 있을 때조차 대화가 길어짐에 따라 측정 가능한 성능 저하를 보인다. 이 저하는 급격하지 않으며—모델이 갑자기 실패하지는 않는다—하지만 일관되고 누적적이다.

저하는 균일하지 않다. 대화의 처음과 끝에 위치한 정보는 중간에 위치한 정보보다 더 안정적으로 유지된다. 더 짧은 맥락 길이에서 이전에 관찰된 이 "중간에서 길을 잃는" 효과는 장문 맥락을 위해 설계된 모델에서도 지속된다. 어텐션(attention) 메커니즘은 맥락 창을 확장해도 제거되지 않는 체계적인 위치적 편향을 가지고 있는 것으로 보인다.

서로 다른 기억 능력은 서로 다른 속도로 저하된다. 사실적 회상("X의 이름이 무엇이었나?")은 상대적으로 견고하다. 관계적 추론("대화 50번에서 X의 결정이 200번에서 Y의 상황에 어떤 영향을 미치는가?")은 더 빠르게 저하된다. 시간적 추적("주제 Z의 첫 번째와 두 번째 논의 사이에 무엇이 변했는가?")은 특히 취약하다.

LIGHT: 인지에서 영감을 받은 메모리 시스템

이러한 한계를 해결하기 위해 저자들은 LIGHT를 제안한다. LIGHT는 인간의 인지에서 영감을 받은 세 가지 상호 보완적인 메모리 구성 요소로 LLM을 증강하는 시스템이다:

장기 일화 기억(long-term episodic memory): 과거 대화 세그먼트의 요약 기록을 저장하며, 의미적 유사도로 검색 가능
단기 작업 기억(short-term working memory): 가장 최근의 정보와 가장 관련성 높은 정보를 활성 버퍼에 유지
스크래치패드(scratchpad): 두드러진 사실과 변화하는 상태 정보를 누적하며, 대화가 진행됨에 따라 업데이트됨

이 세 가지 요소로 구성된 아키텍처는 인지 심리학에서 일화 기억, 작업 기억, 노트 작성 전략 간의 구분을 (느슨하게) 반영한다. 핵심 아이디어는 모든 문맥 관리를 어텐션 메커니즘에만 의존하는 대신, 서로 다른 유형의 기억을 적절한 구조로 분산시키는 것이다.

LIGHT는 다양한 모델에 걸쳐 일관된 성능 향상을 달성하며, 기반 모델과 과제 유형에 따라 3.5%에서 12.69%에 이르는 성능 향상을 보인다. 절제 연구(ablation studies)를 통해 각 기억 요소가 기여함이 검증되었다. 세 가지 시스템 중 어느 하나를 제거해도 성능이 저하되었으며, 이는 각 요소가 기억 문제의 서로 다른 측면을 다루고 있음을 확인해 준다.

비판적 분석: 주장과 근거

주장	근거	판정
100만 토큰 문맥을 가진 모델은 대화가 확장될수록 어려움을 겪는다	여러 모델에 대한 BEAM 벤치마크 평가	✅ 지지됨
"중간 내용 소실(lost in the middle)" 현상이 장문맥 모델에서도 지속된다	문맥 위치에 따른 검색 정확도의 위치 분석	✅ 지지됨
서로 다른 기억 유형은 서로 다른 속도로 저하된다	과제별 분석(사실적, 관계적, 시간적)	✅ 지지됨
LIGHT는 성능을 3.5~12.69% 향상시킨다	모델 간 LIGHT 적용 전후 비교	✅ 지지됨
각 LIGHT 구성 요소는 독립적으로 가치가 있다	각 구성 요소를 개별적으로 제거하는 절제 연구	✅ 지지됨

벤치마크 설계는 강점 중 하나이다. 검증된 질문을 포함한 주제적으로 다양한 대화를 생성하는 방식은 바늘-건초더미(needle-in-a-haystack) 테스트보다 실제 사용을 더 잘 대표한다. 한계점은 대화가 합성된 것(실제 사용자로부터 수집된 것이 아니라 프레임워크에 의해 생성됨)이라는 점이며, 합성 대화에서의 정보 분포 패턴이 자연스러운 상호작용에서의 패턴과 일치하는지 불분명하다.

미해결 과제

아키텍처적 해결책: "중간 내용 소실" 효과는 자기 어텐션(self-attention)의 본질적인 특성인가, 아니면 아키텍처 수정(다른 위치 인코딩, 계층적 어텐션, 기억 증강 트랜스포머)으로 제거할 수 있는가? LIGHT는 아키텍처 수준에서 문제를 해결하기보다 우회하는 방식으로 작동한다.

스케일링 거동: 문맥 창이 100만 토큰에서 1,000만 토큰으로 증가할 때 성능 저하가 악화되는가, 아니면 정체되는가? 스케일링 곡선을 이해하면 더 큰 문맥 창을 추구할 가치가 있는지 판단하는 데 도움이 될 것이다.

훈련 시점의 해결책: 모델이 모든 위치에 균일하게 어텐션을 기울이도록 특별히 훈련될 수 있는가? 위치 편향을 명시적으로 패널티로 부과하는 커리큘럼 훈련이 증상이 아닌 근본 원인을 해결할 수도 있다.

RAG 대 장문맥에 대한 실용적 함의: 장문맥 모델이 중간 정보를 체계적으로 잃는다면, 이는 검색 증강 생성(RAG)을 보완적 접근법으로 지지하는 근거를 강화한다. 즉, 긴 문맥에서 모델이 해당 정보에 어텐션을 기울이기를 기대하는 것보다, 관련 부분을 직접 검색하는 편이 낫다.

기억 시스템의 오버헤드: LIGHT는 기억 관리를 위한 계산 오버헤드를 추가한다. 단순히 더 큰 문맥 창을 가진 모델을 사용하는 것과 비교했을 때 비용 대비 효익은 어떠한가? 어느 대화 길이에서 기억 시스템의 이점이 비용을 초과하는가?

연구에 대한 시사점

실용적인 시사점은 명확하다. 100만 토큰 문맥 창을 가진 모델이 100만 토큰 전체를 효과적으로 활용할 것이라고 가정해서는 안 된다. 애플리케이션이 긴 대화, 다중 문서 분석, 또는 정보가 방대한 문맥에 걸쳐 분산되어 있는 시나리오를 포함한다면, 특히 문맥의 중간 부분에 해당하는 정보에서 성능 저하를 예상해야 한다. LIGHT는 외부 메모리 시스템이 이러한 한계를 부분적으로 보완할 수 있음을 입증하며, 인지과학에서 영감을 받은 설계(일화 기억(episodic memory), 작업 기억(working memory), 능동적 메모 작성을 분리하는 방식)는 이러한 시스템을 구축하기 위한 원칙적인 프레임워크를 제공한다.

더 넓은 분야에 대한 시사점으로, 이러한 연구 결과는 컨텍스트 윈도우(context window) 경쟁이 수확 체감을 낳고 있을 수 있음을 시사한다. 중간의 800만 토큰을 안정적으로 활용하지 못하는 1,000만 토큰 윈도우는, 효과적인 메모리 시스템과 결합된 20만 토큰 윈도우보다 유용성이 떨어진다.

관련 연구는 ORAA ResearchBrain을 통해 탐색할 수 있다.

References (1)

[1] Tavakoli, M., Salemi, A., Ye, C., Abdalla, M., Zamani, H., & Mitchell, J.R. (2025). Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs. arXiv:2510.27246.

DOI Scholar