Session 5.7: Context Window Management

Course → Module 5: Prompt Engineering

Session 7 of 10

More Context Is Not Always Better

Every AI model has a context window, the maximum amount of text it can process in a single request. Claude handles 200,000 tokens. GPT-4 handles 128,000. Gemini handles up to 2 million. These numbers sound like you can throw everything at the model and let it sort things out. Research shows this is a bad idea.

Stanford and UC Berkeley researchers documented the "lost in the middle" problem in 2023: models attend well to the beginning and end of context but poorly to the middle. Accuracy dropped by more than 30% when relevant information was placed in middle positions. A 2025 study by Chroma tested 18 frontier models and found that every single one gets worse as input length increases, a phenomenon now called "context rot."

Context windows have an effective limit that is far below the advertised limit. A model that accepts 200,000 tokens does not perform equally well across all 200,000 tokens. Performance degrades as context grows. The skill is not filling the window. The skill is filling it with exactly what matters.

What Goes In, What Stays Out

Every token in your context window competes for the model's attention. Irrelevant context does not just waste space. It actively degrades performance. Chroma's research found that semantically similar but irrelevant content actively misleads the model, causing worse results than simply having no context at all.

Include	Exclude
System prompt (voice, constraints, rules)	General background information the model already knows
Specific facts the model needs for this task	Tangentially related research
Few-shot examples (2-3 maximum)	Every example you have ever collected
The exact research sources for this piece	Your entire research library
Structural template for the output	Templates for other content types
Relevant previous chapter (for sequential content)	All previous chapters

Context Strategy for Content Production

A practical context strategy has three tiers. Each tier adds context only if the previous tier did not produce sufficient quality.

graph TD A["Tier 1: Essential
System prompt + task + template
(~2,000 tokens)"] --> B{"Output quality
sufficient?"} B -->|Yes| C["Use this output"] B -->|No| D["Tier 2: Enriched
+ research brief + 2 examples
(~5,000-10,000 tokens)"] D --> E{"Output quality
sufficient?"} E -->|Yes| F["Use this output"] E -->|No| G["Tier 3: Maximum
+ full sources + extended examples
(~20,000-50,000 tokens)"] G --> H["Use this output
(review carefully)"] style A fill:#222221,stroke:#6b8f71,color:#ede9e3 style D fill:#222221,stroke:#c8a882,color:#ede9e3 style G fill:#222221,stroke:#c47a5a,color:#ede9e3

Start with Tier 1. If the output is missing specifics that only your research sources contain, move to Tier 2. Only go to Tier 3 when the content genuinely requires extensive source material, such as a deeply technical article or a chapter that must reference multiple previous chapters.

Context Placement Matters

Given the "lost in the middle" problem, where you place information within the context window affects how well the model uses it. Critical information should appear at the beginning (system prompt, most important constraints) or the end (the specific task, the most relevant source). Supporting information goes in the middle, where it will receive less attention but still contribute to the overall output.

For production prompts, this means structuring your input deliberately:

Beginning: System prompt, voice rules, critical constraints
Middle: Research sources, background information, examples
End: The specific task, output format, final reminders of key rules

Repeating your most important instructions at both the beginning and end of the context is not redundant. It is strategic. The model weights the beginning and end more heavily, so placing critical rules in both positions increases compliance.

Measuring Context Efficiency

Track the ratio of context tokens to output quality. If doubling your context from 5,000 to 10,000 tokens produces a noticeable quality improvement, the additional context was worth it. If doubling it again to 20,000 tokens produces no visible improvement, you have found the point of diminishing returns for that content type.

Assignment

Take a real production task that requires substantial context (e.g., writing a review based on research). Create three versions of the prompt: one with minimal context (just the task instructions), one with moderate context (instructions + research summary), and one with maximum context (instructions + full research sources + multiple examples). Compare output quality across all three. Find the point of diminishing returns. Document it.

Context Window Management