Dictionary

Context window

The context window is the amount of text a language model can see and process in a single call. It sets how many instructions, documents, and chat history you can include before the model simply forgets the oldest parts.

What is a context window?

A language model's context window is the maximum amount of text it can process in one call. Everything you send in (system instruction, user question, retrieved documents, chat history) plus everything the model generates has to fit inside that window. The unit is typically not words but tokens, the building blocks the model works with.

Think of the context window as a model's short-term memory. A human can hold around seven things in mind at once. A modern language model can hold hundreds of thousands, but that is still finite. Anything that falls out of the window is simply invisible to the model.

Context windows have grown quickly. GPT-3 started in 2020 with 2,000 tokens. Today Claude, GPT-4.x, and Gemini run with windows of 200,000 up to 2 million tokens. That sounds unlimited, but there are plenty of caveats in practice.

Why the context window matters

It bounds what you can pass in. An 80-page contract does not fit in a 4,000-token window. A full codebase does not fit in 32,000 tokens. For large documents you either need a big window or a smart way to send only the relevant pieces.

It drives what you pay. Most APIs bill per token, for input and output alike. A 100,000-token prompt on every call quickly becomes expensive. Caching and deliberate context pruning are not luxuries in production systems, they are requirements.

It affects quality. The further information sits in the window, the more likely the model forgets it or combines it poorly. The so-called lost in the middle effect: models recall the beginning and end of a long prompt better than the middle.

Tokens, not words

A token is roughly half a word in English or Dutch, and shorter for languages such as Chinese or Arabic. What counts as one token depends on the model's tokeniser. A few rules of thumb for English:

  • 100 tokens is roughly 75 words or five short sentences.

  • 1,000 tokens is about one A4 page of text.

  • 100,000 tokens is a short novel, around 300 pages.

  • 1 million tokens covers roughly five books or a medium-sized codebase.

Names, numbers, URLs, and abbreviations often break into more tokens than you expect. Count them with the model's own tokeniser before sizing your prompts.

How to work with a limited context window

  1. Retrieval-Augmented Generation
    Instead of sending every document, RAG fetches only the relevant passages and passes those in. A knowledge base of gigabytes fits in a few thousand tokens of context this way.

  2. Summarising on the fly
    In long conversations, periodically have the model summarise what was discussed and drop the oldest messages. The less noise, the better the answer.

  3. Chunking by task
    Split large documents and have the model do a sub-task per chunk, then merge the outputs. Works well for summarisation, extraction, and comparison.

  4. Prompt caching
    APIs that offer prompt caching (Anthropic, OpenAI) only charge once for a big system prompt or document, no matter how many calls you send over it. Can cut cost by up to 10x.

Bigger is not always better

A million-token window sounds impressive, but does not solve every problem.

Quality degrades with length. Research shows models struggle to recall facts from the middle of very long contexts. Performance on needle in a haystack tests measurably drops between 32,000 and 128,000 tokens.

Cost scales linearly. Every token you send in gets billed, relevant or not. A well-designed RAG with 5,000 tokens of context almost always beats a brute-force prompt with 500,000 tokens.

Latency scales too. Large prompts take more compute. For interactive use cases, a 30-second response is often unusable.

In practice you combine a generous context window with a thoughtful pipeline: smart retrieval, smart summarisation, smart caching. That gets you the best of both worlds.

Last Updated: April 23, 2026 Back to Dictionary
Keywords
context window llm tokens rag prompt engineering ai generative ai context length claude gpt