Context Window Management
Session 5.7 · ~5 min read
More Context Is Not Always Better
Every AI model has a context window, the maximum amount of text it can process in a single request. Claude handles 200,000 tokens. GPT-4 handles 128,000. Gemini handles up to 2 million. These numbers sound like you can throw everything at the model and let it sort things out. Research shows this is a bad idea.
Stanford and UC Berkeley researchers documented the "lost in the middle" problem in 2023: models attend well to the beginning and end of context but poorly to the middle. Accuracy dropped by more than 30% when relevant information was placed in middle positions. A 2025 study by Chroma tested 18 frontier models and found that every single one gets worse as input length increases, a phenomenon now called "context rot."
Context windows have an effective limit that is far below the advertised limit. A model that accepts 200,000 tokens does not perform equally well across all 200,000 tokens. Performance degrades as context grows. The skill is not filling the window. The skill is filling it with exactly what matters.
What Goes In, What Stays Out
Every token in your context window competes for the model's attention. Irrelevant context does not just waste space. It actively degrades performance. Chroma's research found that semantically similar but irrelevant content actively misleads the model, causing worse results than simply having no context at all.
| Include | Exclude |
|---|---|
| System prompt (voice, constraints, rules) | General background information the model already knows |
| Specific facts the model needs for this task | Tangentially related research |
| Few-shot examples (2-3 maximum) | Every example you have ever collected |
| The exact research sources for this piece | Your entire research library |
| Structural template for the output | Templates for other content types |
| Relevant previous chapter (for sequential content) | All previous chapters |
Context Strategy for Content Production
A practical context strategy has three tiers. Each tier adds context only if the previous tier did not produce sufficient quality.
System prompt + task + template
(~2,000 tokens)"] --> B{"Output quality
sufficient?"} B -->|Yes| C["Use this output"] B -->|No| D["Tier 2: Enriched
+ research brief + 2 examples
(~5,000-10,000 tokens)"] D --> E{"Output quality
sufficient?"} E -->|Yes| F["Use this output"] E -->|No| G["Tier 3: Maximum
+ full sources + extended examples
(~20,000-50,000 tokens)"] G --> H["Use this output
(review carefully)"] style A fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#c8a882,color:#ede9e3 style G fill:#222221,stroke:#c47a5a,color:#ede9e3
Start with Tier 1. If the output is missing specifics that only your research sources contain, move to Tier 2. Only go to Tier 3 when the content genuinely requires extensive source material, such as a deeply technical article or a chapter that must reference multiple previous chapters.
Context Placement Matters
Given the "lost in the middle" problem, where you place information within the context window affects how well the model uses it. Critical information should appear at the beginning (system prompt, most important constraints) or the end (the specific task, the most relevant source). Supporting information goes in the middle, where it will receive less attention but still contribute to the overall output.
For production prompts, this means structuring your input deliberately:
- Beginning: System prompt, voice rules, critical constraints
- Middle: Research sources, background information, examples
- End: The specific task, output format, final reminders of key rules
Repeating your most important instructions at both the beginning and end of the context is not redundant. It is strategic. The model weights the beginning and end more heavily, so placing critical rules in both positions increases compliance.
Measuring Context Efficiency
Track the ratio of context tokens to output quality. If doubling your context from 5,000 to 10,000 tokens produces a noticeable quality improvement, the additional context was worth it. If doubling it again to 20,000 tokens produces no visible improvement, you have found the point of diminishing returns for that content type.
Further Reading
- The 'Lost in the Middle' Problem, DEV Community
- Context Rot: Why LLMs Degrade as Context Grows, Morph
- Context Window Management for LLM Apps, Redis
Assignment
Take a real production task that requires substantial context (e.g., writing a review based on research). Create three versions of the prompt: one with minimal context (just the task instructions), one with moderate context (instructions + research summary), and one with maximum context (instructions + full research sources + multiple examples). Compare output quality across all three. Find the point of diminishing returns. Document it.