Critical ReviewAI & Machine Learning

Context Rot: Why Million-Token LLMs Still Lose Information in the Middle

Models advertise million-token context windows, but can they actually use all that context? Tavakoli et al. (2025) benchmark long-term memory in LLMs and find non-uniform degradationโ€”performance drops as conversations expand, with information in the middle of long contexts systematically neglected.

By Sean K.S. Shin
This blog summarizes research trends based on published paper abstracts. Specific numbers or findings may contain inaccuracies. For scholarly rigor, always consult the original papers cited in each post.

The context window race has produced impressive numbers: 128K tokens, 200K, 1M, 10M. Model providers compete on this specification as though a larger context window were straightforwardly betterโ€”more context means more information available, which should mean better answers. The assumption seems obvious.

It is also wrong, or at least seriously incomplete. Tavakoli et al. (2025) demonstrate that having a large context window and effectively using that context window are different things. Their benchmarkโ€”BEAMโ€”reveals that models with million-token context windows exhibit systematic performance degradation as conversations expand, with information positioned in the middle of long contexts particularly vulnerable to being ignored or forgotten.

The Research Landscape

The "lost in the middle" phenomenon has been documented in prior work: when relevant information is placed in the middle of a long context (rather than at the beginning or end), models are less likely to attend to it correctly. This effect was initially observed with shorter contexts, and there was hope that models specifically trained for long contexts would overcome it.

Tavakoli et al. test this hope systematically. Their contributions are twofold: a benchmark for evaluating long-context memory, and a memory-augmentation system designed to address the failures the benchmark reveals.

BEAM: A Long-Context Memory Benchmark

BEAM (the benchmark) takes a different approach from most long-context evaluations. Rather than testing whether a model can find a needle in a haystackโ€”a single piece of information buried in irrelevant textโ€”BEAM generates extended conversations (scaling up to 10 million tokens) with thematic variety and accompanying questions that test diverse memory capabilities.

The benchmark contains 100 conversations with 2,000 validated questions, designed to test whether models can track facts, relationships, and evolving information across conversation lengths that approach and exceed their advertised context windows. This design choice matters: real-world use of long context involves sustained, thematically complex interaction, not artificial retrieval from random noise.

What the Benchmark Reveals

The findings are concerning for anyone relying on long context windows as a substitute for structured memory:

Performance degrades as conversations expand. Models with 1M-token context windows show measurable performance decline as conversations grow longer, even when they remain within the nominal context limit. The degradation is not catastrophicโ€”models do not suddenly failโ€”but it is consistent and cumulative.

Degradation is non-uniform. Information positioned at the beginning and end of conversations is retained more reliably than information in the middle. This "lost in the middle" effect, previously observed at shorter context lengths, persists in models designed for long contexts. The attention mechanism, it appears, has systematic positional biases that scaling the context window does not eliminate.

Different memory capabilities degrade at different rates. Factual recall ("what was X's name?") is relatively robust. Relational reasoning ("how does X's decision in conversation turn 50 affect Y's situation in turn 200?") degrades more quickly. Temporal tracking ("what changed between the first and second discussion of topic Z?") is particularly vulnerable.

LIGHT: A Cognition-Inspired Memory System

To address these limitations, the authors propose LIGHT, a system that augments LLMs with three complementary memory components inspired by human cognition:

  • Long-term episodic memory: Stores summarized records of past conversation segments, retrievable by semantic similarity
  • Short-term working memory: Maintains the most recent and most relevant information in an active buffer
  • Scratchpad: Accumulates salient facts and evolving state information, updated as the conversation progresses
This three-component architecture mirrors (loosely) the distinction in cognitive psychology between episodic memory, working memory, and note-taking strategies. The idea is that rather than relying solely on the attention mechanism to manage all context, the system offloads different types of memory to appropriate structures.

LIGHT achieves consistent improvements across different models, with performance gains ranging from 3.5% to 12.69% depending on the base model and task type. The ablation studies validate that each memory component contributes: removing any one of the three systems degrades performance, confirming that they address different aspects of the memory problem.

Critical Analysis: Claims and Evidence

<
ClaimEvidenceVerdict
Models with 1M-token context struggle as conversations expandBEAM benchmark evaluation across multiple modelsโœ… Supported
"Lost in the middle" persists in long-context modelsPositional analysis of retrieval accuracy across context positionsโœ… Supported
Different memory types degrade at different ratesTask-specific analysis (factual, relational, temporal)โœ… Supported
LIGHT improves performance by 3.5-12.69%Comparison with and without LIGHT across modelsโœ… Supported
Each LIGHT component is independently valuableAblation removing each component individuallyโœ… Supported

The benchmark design is a strength: generating thematically diverse conversations with validated questions is more representative of real use than needle-in-a-haystack tests. A limitation is that the conversations are synthetic (generated by the framework rather than collected from real users), and it is unclear whether the patterns of information distribution in synthetic conversations match those in natural interaction.

Open Questions

  • Architectural solutions: Is the "lost in the middle" effect an inherent property of self-attention, or can architectural modifications (different positional encodings, hierarchical attention, memory-augmented transformers) eliminate it? LIGHT works around the problem rather than solving it at the architecture level.
  • Scaling behavior: Does the degradation get worse as context windows grow from 1M to 10M tokens, or does it plateau? Understanding the scaling curve would inform whether even larger context windows are worth pursuing.
  • Training-time solutions: Could models be trained specifically to attend uniformly across positions? Curriculum training that explicitly penalizes positional bias might address the root cause rather than the symptom.
  • Practical implications for RAG vs. long context: If long-context models systematically lose information in the middle, this strengthens the case for retrieval-augmented generation (RAG) as a complementary approachโ€”retrieve the relevant pieces rather than hoping the model attends to them in a long context.
  • Memory system overhead: LIGHT adds computational overhead for memory management. What is the cost-benefit tradeoff compared to simply using a model with a larger context window? At what conversation length does the memory system's benefit exceed its cost?
  • What This Means for Your Research

    The practical takeaway is clear: do not assume that a model with a million-token context window will effectively use all million tokens. If your application involves long conversations, multi-document analysis, or any scenario where information is distributed across a large context, you should expect degradationโ€”particularly for information that falls in the middle of the context.

    LIGHT demonstrates that external memory systems can partially compensate for this limitation, and the cognitive-science-inspired design (separating episodic memory, working memory, and active note-taking) provides a principled framework for building such systems.

    For the broader field, these findings suggest that the context window race may be producing diminishing returns. A 10M-token window that cannot reliably use its middle 8M tokens is less useful than a 200K-token window paired with an effective memory system.

    Explore related work through ORAA ResearchBrain.

    References (1)

    [1] Tavakoli, M., Salemi, A., Ye, C., Abdalla, M., Zamani, H., & Mitchell, J.R. (2025). Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs. arXiv:2510.27246.

    Explore this topic deeper

    Search 290M+ papers, detect research gaps, and find what hasn't been studied yet.

    Click to remove unwanted keywords

    Search 6 keywords โ†’