The Infrastructure Behind the Lie
Part 1 covered why context windows can't be memory. Part 2 goes underneath: what they actually cost at the GPU level, what those costs force the infrastructure to do, and where the math points memory has to live.
Part 1 covered the mechanics of context windows — the O(n²) attention bottleneck, the seven layers of context consumption, the seven actors competing for the same finite budget, and why context windows are structurally incapable of serving as agent memory. Part 2 goes underneath: what context windows actually cost to operate at the GPU level, what those costs force the infrastructure providers to do, and what all of it means for Nukez.
Every Token Has a Physical Cost
Part 1 treated the context window as an abstract budget denominated in tokens. Part 2 treats it as what it actually is: a GPU memory allocation, a compute workload, and a line item on someone's infrastructure bill.
The self-attention mechanism described in Part 1 — where every token computes relevance scores against every other token — isn't just theoretically expensive. It has a concrete hardware footprint. The KV cache alone — the stored Key and Value vectors from all previous tokens that the model needs to avoid recomputing attention from scratch — consumes approximately 40 GB of GPU high-bandwidth memory for a 128K-token context window on a single user running Llama 3 70B. That's a single user, a single conversation. Scale to 1,000 concurrent users and you're looking at 40 terabytes of KV cache memory demand. For one model.
LLM inference systems waste 60–80% of allocated KV cache memory through fragmentation and over-allocation. vLLM's PagedAttention technique reduced that waste to under 4%, enabling 2–4x throughput improvements — the equivalent of doubling your GPU fleet without buying a single card. But even optimized, the KV cache remains the dominant memory consumer in production inference. For a 70B model serving 8K-context requests, the cache consumes approximately 20 GB per request. At a batch size of 32, that's 640 GB — more than the model weights themselves.
The implications cascade:
Context Length Costs Scale Quadratically
Part 1 described the O(n²) attention math. In dollar terms: a 128K-context request costs 64x more to serve than an 8K-context request on the same model. Not 16x. 64x. That's the quadratic penalty translated into infrastructure spend.
Output Tokens Cost 3–5x More Than Input Tokens
This is reflected directly in API pricing. OpenAI, Anthropic, and Google all price output tokens dramatically higher than input tokens because output generation requires sequential processing (one token at a time, each depending on all previous tokens), while input processing can be parallelized. Claude Sonnet 4.5 charges $3 per million input tokens and $15 per million output tokens. GPT-5.2 charges $1.75 input and $14 output — an 8x ratio. This asymmetry means that any system generating long outputs is disproportionately expensive.
Providers Are Burning Cash
Anthropic burns through approximately $2.7 million daily serving Claude, with infrastructure costs consuming 85% of revenue. Google's Gemini infrastructure costs are estimated at $5 billion per year. OpenAI's cost per generated token is approximately $0.00012. These numbers sound small per token but compound into existential-scale infrastructure bills at millions of concurrent users.
The KV Cache Is the Bottleneck — And It's Getting Worse
The KV cache deserves special attention because it's the hardware manifestation of the context window problem described in Part 1.
Every conversation turn, every tool call result, every RAG injection that enters the context window — as described by Part 1's seven layers and seven actors — has a direct physical counterpart in GPU memory. The system prompt's 3,000–5,000 tokens? That's KV cache entries that persist for the entire conversation. The 14,000 tokens of MCP tool definitions? KV cache entries. The 30,000 tokens from a PDF upload? KV cache entries. The tool result from a web search? More KV cache entries that never go away until compaction.
Part 1's observation about the conversation lifecycle — early headroom, mid-conversation accumulation, late compaction — is literally the lifecycle of GPU memory pressure. Early turns have low KV cache occupancy. Mid-conversation, the cache grows with every turn. Late conversation, the system either runs out of memory (request fails) or triggers compaction (lossy summarization to reduce the token count and free cache space).
The industry's response to this bottleneck has been a cascade of increasingly sophisticated optimizations:
PagedAttention (vLLM, 2023)
Treats KV cache like virtual memory pages instead of contiguous blocks. Eliminated the 60–80% memory fragmentation waste, enabling 2–4x more concurrent users on the same hardware.
KV Cache Quantization (2024–2025)
Reduces the precision of cached Key and Value vectors from FP16 to FP8 or even FP4. NVIDIA's NVFP4 format achieves less than 1% accuracy loss compared to FP16 baselines on modern benchmarks — including long-context reasoning over 64K-token sequences — while cutting cache memory by 4x. This means the same GPU can serve 4x the context length or 4x the concurrent users.
KV Cache Offloading (2024–2026)
Moves inactive KV blocks from expensive GPU memory to cheaper CPU DRAM or even NVMe SSDs. Reading cached attention back from CPU memory adds 10–50ms of latency per retrieval, but that's vastly cheaper than recomputing the cache from scratch. NVIDIA's Grace Hopper architecture, with its unified CPU-GPU memory connected via NVLink-C2C, enables this offloading at near-GPU bandwidth. LMCache, an open-source KV cache management layer for vLLM, achieves 3–10x latency reductions by reusing cached attention across requests.
Prefix Caching and Hash-Based Deduplication (2025–2026)
Recognizes that many requests share identical prefixes — the same system prompt, the same tool definitions, the same RAG context. Instead of recomputing KV vectors for shared prefixes, the system hashes each token block and checks a global cache. If a match exists, the precomputed KV tensors are injected directly. At enterprise scale, this eliminates enormous amounts of redundant computation. One analysis found that with 5,000 agent calls per day sharing the same 11,000-token prefix (system prompt + tool definitions + policy context), the system was recomputing 55 million tokens of shared context daily. Prefix caching eliminates that entirely.
Every one of these optimizations is working around the same fundamental problem: the context window is too expensive to scale naively, and every byte of context that enters the window has a real cost in GPU memory, compute cycles, and electricity.
The Pricing Tells the Story
API pricing isn't arbitrary. It's a direct expression of infrastructure economics — and it reveals exactly how providers think about context.
The most telling data point: inference costs have declined approximately 10x annually since 2022. GPT-4-equivalent performance that cost $20 per million tokens in late 2022 costs $0.40 per million tokens today. That's a 50x reduction in three years. The a16z analysis called this “LLMflation” — inference cost deflation faster than the PC compute revolution or the dotcom bandwidth boom.
But this deflation has not been uniform. It's been driven almost entirely by optimizations at the model level (smaller, more efficient architectures, MoE routing, quantization) and the serving level (PagedAttention, prefix caching, speculative decoding). The fundamental cost structure hasn't changed: attention is still O(n²), KV cache still scales linearly with context length, and output generation is still sequential.
What this means for agents:
Long Contexts Are Proportionally More Expensive
As providers race to offer 1M and 2M token windows, the per-request cost for a maxed-out context is orders of magnitude higher than a short request. Gemini 3 Pro's 1M-token window is technically available, but actually filling it costs dramatically more than a 32K request. Providers manage this through tiered pricing — Anthropic charges 2x input and 1.5x output for requests exceeding 200K tokens on the 1M beta.
The Output Asymmetry Punishes Agent Patterns
Agents generate more output than humans. An agentic loop that reads files, reasons about them, generates code, and produces test results can easily generate 10,000+ output tokens per iteration. At GPT-5.2's $14/million output token rate, a 20-iteration agent loop processing 200K+ output tokens costs $2.80 in output alone — per task. Scale to thousands of agent tasks per day and the numbers become significant.
Prompt Caching Is the Hidden Subsidy
Anthropic's prompt caching gives a 90% discount on repeated input content (cache reads cost 0.1x base input price). This means agents with stable system prompts and tool definitions pay dramatically less per request after the first cache fill. But caching only works for exact prefix matches. Any variation — different conversation history, different user context — breaks the cache. Agents that maintain consistent, stable prefixes benefit enormously. Agents with highly variable contexts don't.
What the Infrastructure Forces
The economics described above create structural pressures that shape how every actor in Part 1's framework behaves:
Despite marketing 1M-token windows, every provider's infrastructure is optimized for the median request — which is far shorter than the maximum. Short requests are cheap to serve, cache-friendly, and parallelizable. Long requests are expensive, cache-hostile, and memory-intensive. This is why output token caps exist (Claude's 32K, Gemini's 64K). It's why compaction exists — the provider would rather summarize your context than serve a 200K-token request at full fidelity. The 1M-token window is the exception, not the norm.
Claude Code triggers compaction at 80% window utilization. ChatGPT summarizes older turns silently. Every platform has a version of this. It's not a feature — it's cost management disguised as a feature. Compaction reduces KV cache pressure, lowers per-request compute cost, and extends conversation length at the price of information loss. The user experiences continuity; the provider experiences lower infrastructure bills.
Tool Search (deferred loading) isn't just a developer convenience — it's Anthropic's response to the fact that tool definitions are one of the largest fixed-cost components of context. As documented in Part 1, 82,000 tokens of tool definitions across 13 servers means 82,000 tokens of KV cache that must be computed and stored for every single request in every conversation where those tools are active. Tool Search reduces this to only the tools actually needed, cutting the KV cache overhead proportionally. The 85% reduction in tool context reported by Anthropic isn't just about saving token budget for the user — it's about serving more concurrent conversations per GPU.
Part 1's Actor 7 (automated agent loops) fills context fastest — 12 tool entries from one task, 10,000–30,000 tokens. In infrastructure terms, agents generate the highest KV cache pressure, the most output tokens (expensive), and the longest conversations. Anthropic's Programmatic Tool Calling — letting the model batch tool operations into a single code block — is explicitly designed to keep intermediate results out of the context window and therefore out of the KV cache. The 37% reduction in token usage they report on complex research tasks is 37% less GPU memory per request.
What This Means for Nukez
Every point above converges on the same conclusion: context windows are expensive, contested, ephemeral, and getting more so. The infrastructure providers are fighting a rearguard action against the quadratic cost of attention, using every optimization available to make long contexts viable. But even optimized, the context window remains the wrong place to store anything you care about keeping.
Nukez sits outside this entire cost structure. When an agent uses Nukez through MCP to store data and receive a cryptographic receipt, the storage happens on Google Cloud Storage. The attestation happens on the Solana blockchain via Switchboard oracle. The verification uses merkle trees. None of these operations touch the KV cache. None of them scale with O(n²). None of them are subject to compaction, lossy summarization, or the lost-in-the-middle effect.
The only context window cost Nukez imposes is:
- Tool definitions at connection time — approximately 2,000–3,000 tokens for ~6 focused tools. This is fixed and small. With lifecycle-stage partitioning (the open TODO from Part 1), it gets smaller: an agent without a locker sees only wallet and purchase tools, reducing the definition footprint further.
- Tool results at call time — compact responses returning receipt IDs, verification status, and minimal metadata. A store operation returns a receipt. A verify operation returns a boolean and a merkle root. These are information-dense, compaction-resilient payloads that survive summarization because they're atomic facts, not narratives.
Everything else — the actual file storage, the encryption, the merkle tree construction, the on-chain attestation, the PullFeed + SPL Memo transaction — happens server-side on Cloud Run. The host conversation's KV cache never knows it happened.
This is the infrastructure-level expression of Part 1's Principle 1: offload computation, not results.
The KV cache holds the agent's thought process. Nukez holds the agent's receipts.
The cache is ephemeral, expensive, lossy, and provider-controlled. The receipt is permanent, cheap, lossless, and independently verifiable.
The Reconciliation
Here's where the two layers — Part 1's context mechanics and Part 2's infrastructure economics — converge into a single argument:
The context window is a poor abstraction for agent memory, and the infrastructure makes it worse, not better.
The context window is constrained by O(n²) attention math (Part 1) and expensive per-token GPU economics (Part 2). It's managed through compaction that introduces information loss (Part 1) and is motivated by infrastructure cost reduction (Part 2). It's consumed by seven competing actors with no coordination (Part 1), each adding to KV cache pressure that providers are fighting to reduce (Part 2). And the more capable agents become — more tools, more agentic loops, more multi-step reasoning — the faster they fill the window (Part 1) and the more they cost to serve (Part 2).
Every optimization the industry invests in — FlashAttention, PagedAttention, KV quantization, cache offloading, prefix caching, Tool Search, Programmatic Tool Calling, compaction — is an optimization of the wrong layer for the wrong problem. They're making ephemeral working memory more efficient. They're not making it durable. They're not making it verifiable. They're not making it portable.
Agents that need to remember things — what data they stored, what operations they performed, what commitments they made — need a layer that is:
Storage that doesn't scale with attention complexity or KV cache size.
Memory that doesn't get compacted, summarized, or evicted when the provider decides the context is too expensive to maintain.
State that exists before the conversation starts and persists after it ends.
Proof that the agent's memory is what it claims to be, verifiable by anyone, without trusting any provider.
That's Nukez. The context window holds the current task. Nukez holds the receipts. They're complementary layers, not competing ones — and the infrastructure economics guarantee they always will be.
The context window race is approaching diminishing returns. Going from 200K to 1M to 10M makes the notepad bigger. It doesn't turn the notepad into a filing cabinet — and the infrastructure costs of maintaining a 10M-token notepad ensure that even the biggest windows will be aggressively optimized, compressed, and evicted.
The real question for agent infrastructure isn't “how do we make context windows bigger?” It's “what shouldn't be in the context window at all?”
Agent memory is the answer. And the infrastructure economics prove it.
