Dictionary

Context window

The context window is the amount of text a language model can see and process in a single call. It sets how many instructions, documents, and chat history you can include before the model simply forgets the oldest parts.

What is a context window?

A language model's context window is the maximum amount of text it can process in one call. Everything you send in (system instruction, user question, retrieved documents, chat history) plus everything the model generates has to fit inside that window. The unit is typically not words but tokens, the building blocks the model works with.

Think of the context window as a model's short-term memory. A human can hold around seven things in mind at once. A modern language model can hold hundreds of thousands, but that is still finite. Anything that falls out of the window is simply invisible to the model.

Context windows have grown quickly. GPT-3 started in 2020 with 2,000 tokens. Today Claude, GPT-4.x, and Gemini run with windows of 200,000 up to 2 million tokens. That sounds unlimited, but there are plenty of caveats in practice.

Why the context window matters

It bounds what you can pass in. An 80-page contract does not fit in a 4,000-token window. A full codebase does not fit in 32,000 tokens. For large documents you either need a big window or a smart way to send only the relevant pieces.

It drives what you pay. Most APIs bill per token, for input and output alike. A 100,000-token prompt on every call quickly becomes expensive. Caching and deliberate context pruning are not luxuries in production systems, they are requirements.

It affects quality. The further information sits in the window, the more likely the model forgets it or combines it poorly. The so-called lost in the middle effect: models recall the beginning and end of a long prompt better than the middle.

Tokens, not words

A token is roughly half a word in English or Dutch, and shorter for languages such as Chinese or Arabic. What counts as one token depends on the model's tokeniser. A few rules of thumb for English:

100 tokens is roughly 75 words or five short sentences.
1,000 tokens is about one A4 page of text.
100,000 tokens is a short novel, around 300 pages.
1 million tokens covers roughly five books or a medium-sized codebase.

Names, numbers, URLs, and abbreviations often break into more tokens than you expect. Count them with the model's own tokeniser before sizing your prompts.

How to work with a limited context window

Retrieval-Augmented Generation
Instead of sending every document, RAG fetches only the relevant passages and passes those in. A knowledge base of gigabytes fits in a few thousand tokens of context this way.
Summarising on the fly
In long conversations, periodically have the model summarise what was discussed and drop the oldest messages. The less noise, the better the answer.
Chunking by task
Split large documents and have the model do a sub-task per chunk, then merge the outputs. Works well for summarisation, extraction, and comparison.
Prompt caching
APIs that offer prompt caching (Anthropic, OpenAI) only charge once for a big system prompt or document, no matter how many calls you send over it. Can cut cost by up to 10x.

Bigger is not always better

A million-token window sounds impressive, but does not solve every problem.

Quality degrades with length. Research shows models struggle to recall facts from the middle of very long contexts. Performance on needle in a haystack tests measurably drops between 32,000 and 128,000 tokens.

Cost scales linearly. Every token you send in gets billed, relevant or not. A well-designed RAG with 5,000 tokens of context almost always beats a brute-force prompt with 500,000 tokens.

Latency scales too. Large prompts take more compute. For interactive use cases, a 30-second response is often unusable.

In practice you combine a generous context window with a thoughtful pipeline: smart retrieval, smart summarisation, smart caching. That gets you the best of both worlds.

Last Updated: April 23, 2026 Back to Dictionary

Keywords

context window llm tokens rag prompt engineering ai generative ai context length claude gpt

/ Related

Related Terms

Term

AI Act (EU)

The AI Act is the European Union regulation that governs artificial intelligence. It sorts AI systems by risk and places obligations on anyo...

Read definition

Term

AI agent

An AI agent is an AI system that autonomously plans and executes multiple steps to reach a goal. It uses a language model as its brain and c...

Read definition

Term

Artificial Intelligence (AI)

Artificial intelligence is technology that teaches computers to learn, reason, and make decisions from data instead of following hand-writte...

Read definition

Term

Bias

Bias in AI is a skew that creeps into models through data, algorithms, or human choices. It is not always harmful, but it has to be managed ...

Read definition

Term

Bottleneck analysis

Bottleneck analysis finds the step in a process where work gets stuck waiting, the step that dictates total throughput time. You spot bottle...

Read definition

/ Further reading

From the blog.

Driverless electric delivery cart on the streets of Leuven

Article · Apr 22, 2026

Collect&Go rolls out a driverless grocery cart in Leuven

Collect&Go and Telenet Business are testing an autonomous electric delivery cart in Leuven, steered over 5G. What it means for logistics and...

Sketched illustration of repetitive tasks flowing through a gear into a dashboard showing eight hours saved per week.

Article · Jan 28, 2026

10 Practical Steps to Automate Your Business Processes

Ten practical steps to automate your business processes without AI hype. Start small, fix the process first, use the tools you already own, ...