Memory Is a Lie — Part 1 / April 2026

Your Agent's Memory Is a Lie

Every multi-agent framework on GitHub treats memory like it's a solved problem. None of them are right. The context window is structurally incapable of being memory — and the math says so.

ZHnukez.xyzgithub.com/nukez-xyzData current: March 2026

Every multi-agent framework on GitHub — and I looked at all 8,000 of them — treats memory like it's a solved problem. LangGraph stores state in a graph. CrewAI passes task outputs between roles. AutoGen keeps a chat transcript. They all do some version of “put tokens in, get tokens out, call it memory.”

It's not memory. It's a context window. And the context window is structurally incapable of being memory.

I spent the last week producing a full technical study on context window mechanics — how they work at the hardware level, what fills them in production, who controls them, and why they fail as a foundation for autonomous agents. This post is the result. Everything below is drawn from that research.

What a Context Window Actually Is

A context window is the total number of tokens a large language model can process in a single forward pass. It is the model's working memory — the complete set of information it can “see” at any given moment. Everything the model reads, reasons about, and generates must fit within this window. Once the window is full, the model has no awareness of anything outside it.

The term gets used loosely to mean “how much input I can send,” but technically it encompasses everything involved in a single inference call: the system prompt, injected context (retrieved documents, tool schemas, conversation history), the user's current message, and the model's own generated output. Input and output share the same finite budget. A model with a 200,000-token context window that receives 180,000 tokens of input can only generate approximately 20,000 tokens of output before hitting the ceiling.

The window is a zero-sum resource.

Why Windows Are Finite: O(n²) and the Three Ceilings

Context windows exist because of how transformers compute attention. In the self-attention mechanism, every token in the sequence computes a relevance score against every other token. This is the Query-Key-Value architecture: each token produces a Query vector (what am I looking for?), a Key vector (what do I contain?), and a Value vector (what information do I carry). Attention scores are computed as the dot product of Query and Key vectors, scaled and softmax-normalized, then used to weight the Value vectors.

The cost of this operation scales quadratically with sequence length — O(n²). Doubling the context window from 100K to 200K tokens doesn't double the compute cost. It quadruples it. This isn't a software limitation or an engineering oversight. It's a mathematical property of the attention mechanism itself.

Three specific resources create the hard ceiling:

Attention Matrix Computation

The n×n attention score matrix must be computed for every layer, every head, in every forward pass. For a 200K-token sequence with 96 attention heads across 80 layers, this is an enormous amount of matrix arithmetic.

KV Cache Memory

During autoregressive generation (producing tokens one at a time), the model caches the Key and Value vectors from all previous tokens to avoid recomputing them. This cache grows linearly with sequence length and consumes GPU VRAM. For long sequences, the KV cache alone can exceed tens of gigabytes. A KV cache representing a 128K-token context window for a single user on Llama 3 70B consumes approximately 40 GB of memory — and that scales linearly with concurrent users.

GPU Memory Bandwidth

Even if compute is sufficient, the speed at which data can move between GPU memory and compute units creates a throughput bottleneck that worsens with sequence length.

These aren't theoretical constraints. They're physical ones. Measured in watts, bytes, and dollars.

The Models Lie About Their Windows

Here's the landscape as of March 2026:

Model	Advertised	Effective*	Max Output
Claude Opus 4.6	200K (1M beta)	~130–180K	32K
Claude Sonnet 4.5	200K (1M beta)	~130–180K	16K
GPT-5.2	256K (thinking)	~100–150K	128K
GPT-5.2 Standard	128K	~80–100K	16K
Gemini 3 Pro	1M	~300–600K	64K
Grok 4	2M (SuperGrok)	Unknown	Unknown
Llama 4 Scout	10M	Unknown	Varies

*Effective context = range where retrieval accuracy stays above 80% on needle-in-haystack benchmarks.

The advertised context window and the effective context window are not the same thing. The advertised number is the architectural maximum — the model will accept this many tokens without returning an error. The effective number is where the model still produces reliable, accurate outputs.

Independent testing consistently shows that most models begin degrading at 60–70% of their advertised maximum. A model claiming 200K tokens typically becomes unreliable around 130K, with performance drops that are sudden rather than gradual. Smaller model variants sometimes outperform larger ones on long-context tasks, because the smaller models generate more focused responses that preserve context budget for recall rather than verbose elaboration.

Then there's the lost-in-the-middle problem. Liu et al. (2023) demonstrated that LLMs exhibit a characteristic U-shaped recall pattern across their context windows. Models retrieve information most reliably from the beginning and end of the context, with significant accuracy degradation for content positioned in the middle. Empirical testing in 2025 shows 85–95% accuracy for information at the beginning and end, dropping to 76–82% for middle-positioned content. The larger the context window, the worse this gets.

The Seven Layers That Fill Your Window Before You Type

In a production LLM application, the user's actual message is often a small fraction of the total context budget. Seven layers consume the window simultaneously:

System prompt — behavioral instructions, safety constraints, capability descriptions. In consumer products like Claude.ai, this runs to 3,000–5,000 tokens that users never see and cannot control.
Tool/function definitions — schemas describing every tool the model can call. Each tool's name, description, parameter types, and usage instructions are serialized into context. A single MCP server with 20 tools can consume 14,000+ tokens of definitions alone.
Memory and personalization — persistent user preferences and memory system data injected at conversation start. Typically hundreds of tokens but grows with usage.
Retrieved context (RAG) — documents, code files, search results injected silently to give the model knowledge beyond its training data. The user often doesn't know this is happening.
Conversation history — the full sequence of prior turns. Each message accumulates tokens that persist for the remainder of the conversation.
Current user message — the actual query.
Model output — the generated response, which consumes context space as it's produced, reducing the budget available for subsequent turns.

Here's what this looks like concretely for a Claude.ai Pro user, before they type a single word:

Component	~Tokens	% of 200K Window
System prompt (behavioral instructions)	3,000–5,000	1.5–2.5%
Built-in tool definitions (search, code, etc.)	10,000–15,000	5–7.5%
MCP connector schemas (if enabled)	2,000–80,000+	1–40%
Memory / user preferences	300–1,000	0.2–0.5%
Reserved for output generation	16,000–32,000	8–16%
Overhead Subtotal	31,000–133,000	15.5–66.5%

One developer documented 82,000 tokens — 41% of a 200K context window — consumed by MCP tool definitions alone across 13 connected servers. Before any conversation. The richer your agent's toolset, the less room it has to think. That's the hidden tax of connectivity.

Seven Actors, No Coordination

The context window is not a static resource controlled by a single party. It's a shared, contested space modified by at least seven distinct actors, each with different incentives, visibility, and mechanisms of action. None of them coordinate with each other.

Actor 1: The Model Provider

(Anthropic, OpenAI, Google.) The most powerful actor. They set the architectural ceiling. They inject system prompts users never see — 5,000+ tokens of behavioral constraints. They inject built-in tool definitions even if the user never invokes them. They implement compaction when conversations approach the limit — summarizing earlier turns invisibly, introducing lossy compression. They cap output tokens (Claude at 32K, Gemini at 64K). Their incentive structure is conflicted: larger windows are a competitive advantage, but longer contexts cost more to serve.

Actor 2: The Platform

(Claude.ai, ChatGPT, Cursor, Claude Code.) Sits between the provider and the user. Decides how conversation history is maintained — full raw history vs. summarized after N turns vs. dropped. Decides how file uploads are injected — raw text dump vs. chunked retrieval. A 50-page PDF uploaded as raw text can consume 30,000+ tokens in a single injection. Controls which MCP connectors are active, directly adding or removing thousands of tokens of tool definitions with each toggle.

Actor 3: MCP Servers and Tool Providers

This is the actor most relevant to Nukez. MCP servers inject context at two moments: at connection time (tool definitions — a fixed cost per conversation) and at call time (tool results — a per-invocation cost that enters conversation history permanently until compaction). A well-designed server with 6 focused tools costs 2,000–3,000 tokens. A bloated server with 20 verbose tools costs 14,000+. Anthropic's Tool Search feature can defer loading and reduce upfront cost by up to 85%, but it's new and not universally deployed.

Actor 4: RAG Systems

Sit outside the model, inject content directly into the context window. Often invisible to the user. The user asks a question; the system quietly retrieves and injects thousands of tokens of supplementary context. Without careful relevance filtering, RAG becomes a major source of context bloat — injecting redundant or marginally relevant content that wastes budget without improving response quality. Documents placed in the middle of long contexts are particularly vulnerable to the lost-in-the-middle effect.

Actor 5: The User

The most variable actor. Single-sentence queries cost 10 tokens. Massive file uploads cost 100,000+. Multi-hour conversations accumulate tens of thousands. Tool invocations that trigger web searches or code execution add 3,000–5,000 tokens per round-trip. The user's key disadvantage is limited visibility — most consumer interfaces provide no indication of current context utilization. The user has no way of knowing they're at 75% capacity and that their next file upload will trigger compaction and information loss.

Actor 6: The Conversation Itself

The conversation is an actor because of how accumulation works. Each new API call in a multi-turn conversation includes the full history of prior turns. The window doesn't reset between messages — it grows monotonically until compaction intervenes. This creates a characteristic lifecycle: plenty of headroom in early turns, gradual accumulation in mid-conversation, compaction trigger in late conversation (lossy summarization of earlier turns), then post-compaction operation on a mix of precise recent context and compressed older context. For MCP-heavy workflows, this lifecycle is accelerated — a single maz_perspective call might return 3,000–5,000 tokens of synthesized analysis. Three such calls plus normal conversation could push past compaction threshold within 10–15 turns.

Actor 7: Automated Agent Loops

When LLMs are used in agentic loops — repeatedly calling tools, evaluating results, deciding next steps — the context window fills at an accelerated rate. A coding agent performing a single read-edit-test cycle generates 12 tool-related entries in conversation history, consuming 10,000–30,000 tokens from one task. Agentic systems face a unique tension: they need long context windows for coherence across multi-step tasks, but they consume context budget faster than any other usage pattern. This is why Claude Code implements aggressive compaction, tool output limits (warnings at 10K tokens, hard limits at 25K by default), and Programmatic Tool Calling — batching multiple tool operations into a single code block to keep intermediate results out of the context window entirely.

The MCP server designer doesn't know how long the conversation history will be. The RAG pipeline doesn't know how many tools are loaded. The user doesn't know any of this is happening. It's a classic resource contention problem, and most of the time, nobody is managing it.

What This Means for Agents

If your agent stores its knowledge, its state, its operational history inside a context window, here's what you're actually accepting:

Ephemeral

The knowledge exists for this session and nowhere else. When the conversation ends, everything is gone.

Lossy

Compaction will summarize away details the agent might need later. You can't control what survives. The model's summary of a 50-turn conversation is necessarily lossy compared to the raw transcript, and the agent doesn't know it happened.

Unverifiable

There's no receipt, no proof, no way to confirm that what the agent “remembers” is what was actually stored. The model could hallucinate a memory and neither it nor you would know the difference.

Non-portable

It can't be transferred to another agent, another session, or another model. It's trapped in one inference context on one provider's infrastructure.

This is not memory. This is a notepad that catches fire when you close the app.

All 8,000 multiagent repos on GitHub build on this foundation. They orchestrate which agents talk to which. They route tasks and synthesize responses. Some are genuinely sophisticated. But underneath, every one relies on context windows for state, and context windows are structurally incapable of providing what autonomous agents actually need.

What Agents Actually Need

Agents need four things from storage that context windows cannot provide:

Durability

Data persists beyond any single session, model, or provider.

Verifiability

The agent — or anyone — can prove that a piece of data is exactly what was stored, when it was stored, and that it hasn't been tampered with. Not a model's assertion. Cryptographic proof.

Portability

Stored data isn't trapped in one vendor's ecosystem. Any agent, on any model, through any interface.

Receipt Binding

Every storage operation produces a receipt that can be independently verified on a public blockchain.

Nukez

This is what we built.

An agent connects via SDK or MCP. It provisions a storage locker using a Solana wallet — the payment happens on-chain, no human required. It stores data and gets back a receipt: a merkle root anchored to the Solana blockchain via Switchboard oracle, with both a PullFeed attestation and an SPL Memo in the same transaction. Dual-layer on-chain proof.

When the agent — or any agent, or any human — wants to verify that data, they check the receipt against the chain. The merkle root either matches or it doesn't. There's no trust involved. There's math.

We tested this against 37 AI models across 9 providers. 700+ fully autonomous runs. Real Solana transactions. Real cloud storage. Real cryptographic receipts. 99.2% success rate. No human in the loop.

The study's design principles map directly to how Nukez operates:

Principle 1: Offload Computation, Not Results

When an agent calls Nukez through MCP from inside Claude.ai, the storage operation — provisioning, uploading, attesting, verifying — all happens server-side on our Cloud Run infrastructure. The host conversation's context window only pays for the compact result that comes back. The heavy lifting is invisible. This is the optimal pattern for context efficiency.

Principle 2: Minimize Tool Definition Footprint

The Nukez MCP server exposes ~6 focused tools with concise schemas. We're actively working on partitioning tools by lifecycle stage — if you don't have a locker yet, you only see wallet and purchase tools. If you have a locker, you only see file ops. This reduces the tool definition surface area by removing tools irrelevant to the current state. Every tool you don't load is context budget you keep.

Principle 3: Control Response Size

Nukez tool responses return receipt IDs, verification status, and compact metadata — not verbose narratives. A store operation returns a receipt, not a story about storing.

Principle 4: Design for Compaction Resilience

Nukez receipts are designed to survive summarization. A receipt ID and a verification status are atomic facts that compress cleanly. They don't degrade when the conversation is compacted because there's nothing to lose — the proof is on-chain, not in the context window.

The study's conclusion is the thesis of this entire project: the context window is ephemeral working memory. Expecting it to also be the filing cabinet, the audit trail, and the notary public is a category error.

The context window race is shifting — the period from 2023 to 2025 was dominated by raw token count expansion, from 4K to 8K to 128K to 1M to 2M to 10M. That race is approaching diminishing returns. The frontier is moving toward context quality and toward recognizing that some things simply don't belong in the context window at all. Agent memory is one of them.

Context windows are where agents think. Nukez is where agents remember.