Research Note — Frontier Models / May 2026

How the Frontier Is Quietly Building Toward Nukez

A pattern hiding in plain sight across the last eighteen months of model releases.

ZHnukez.xyzgithub.com/nukez-xyz55 models · 825 runs · 8 providers

The 237-Kilobyte Clue

Last week, while benchmarking 55 cloud models against the Nukez agentic-storage workflow, we caught something odd in the Gemini 3.x lineup. Three of the four Gemini 3.x models emit a small thought_signature field on every tool call — 54 to 208 bytes of opaque binary data that the agent has to echo back to the model on every subsequent turn. Reasonable metadata. Negligible overhead.

The fourth model, gemini-3.1-pro-preview, emits a thought_signature of 237,992 bytes — roughly 1,000× larger than its siblings — for the exact same kind of tool call. Across an 8-turn agent loop, with the signature replayed on every turn, that is ~1.9 MB of opaque binary state the model carries through its context window and that the user pays to round-trip on every call.

What is it? Per Google's documentation, thought_signature is a cryptographic encoding of the model's internal reasoning state. It exists to preserve the model's “thinking” across tool boundaries without leaking the raw chain-of-thought into the conversation history. It is required. It is opaque. It is growing.

Read that sentence twice and ask yourself: have we seen this pattern before?

We have. Anthropic shipped prompt caching with separately-priced cache_creation and cache_read token buckets. OpenAI shipped the Responses API with server-side conversation state, then ChatGPT memory, then cached_tokens accounting. xAI added thinking_tokens as a separate billing category for reasoning models. Each release was framed as a discrete feature: cost optimization, reasoning persistence, multi-turn coherence. Each was useful. Each was vendor-specific. Each was opaque to varying degrees.

But step back, and a pattern resolves. Every frontier lab is shipping pieces of the same underlying problem — durable, verifiable, portable agentic state — and every one of them is shipping it as a proprietary, vendor-locked mechanism that mostly cannot be audited, ported, or substituted.

This is the story of how the frontier is converging — slowly, indirectly, and without saying it out loud — toward exactly what Nukez was designed to be.

What's Actually Being Built

To see the pattern, you have to inventory what has been shipped.

Google · thought_signature (Gemini 3.x, 2026)

A binary blob the model emits with every tool call, encoding its internal reasoning state. The agent framework must echo it back verbatim on every subsequent turn or the next call fails with a 400. Size varies wildly — small for flash variants, massive for pro variants. Format is undocumented. Contents are unverifiable by anyone except Google. Round-trips through the model's context window on every turn, so cumulative overhead compounds across an agent loop.

Stated purpose: preserve reasoning continuity across tool boundaries without exposing the chain-of-thought.

Actual effect: locks the conversation to Google's runtime; introduces a 4-orders-of-magnitude variance in protocol overhead that is invisible to the developer.

Anthropic · cache_creation_input_tokens / cache_read_input_tokens (2024–present)

Prompt caching, with three distinct token accounting buckets: fresh input_tokens, cache_creation_input_tokens (the first time a prefix is sent, priced at 1.25× normal input), and cache_read_input_tokens (subsequent reads, priced at 0.1× normal input). Cache TTLs of 5 minutes by default, with optional 1-hour extensions at higher write cost.

Stated purpose: dramatically reduce the cost of stateful, long-running agentic workloads.

Actual effect: this is the most transparent of the bunch — Anthropic exposes the bucket counts in usage, so developers can see exactly what is cached and what is not. But the contents of the cache, what gets cached versus evicted, and how the breakpoint heuristics work remain Anthropic-internal. And the cache itself does not survive across organizations, regions, or providers.

OpenAI · Responses API state · cached_tokens · ChatGPT memory (2024–present)

A bundle. The Responses API persists conversation state server-side and references it by ID, rather than requiring the client to send the full history on every turn. prompt_tokens_details.cached_tokens exposes how much of the input was served from OpenAI's automatic cache. ChatGPT memory features quietly persist user state across sessions.

Stated purpose: reduce latency, lower cost on repeated context, enable longer conversations without ballooning client payloads.

Actual effect: server-side state lives in OpenAI's account boundary. Not portable. Not auditable beyond the metadata fields they choose to expose. The “I will handle remembering for you” pattern.

xAI · thinking_tokens (Grok 4 reasoning variants, 2025)

A separately-billed token bucket for the model's internal reasoning steps in reasoning models. Distinct from completion tokens. The reasoning itself is mostly hidden from the developer — you pay for it but do not see most of it.

Stated purpose: honest pricing for the additional compute reasoning models perform.

Actual effect: another opaque cost dimension introduced as a default of using the newest models. The developer cannot see what was reasoned about, only that it cost extra.

Inception · reasoning_tokens (Mercury, 2025)

Same pattern as xAI — reasoning is reported as an additional token bucket, not a subset of completion. Mercury's reasoning is explicitly additive: you pay for input + output + reasoning, where reasoning is the model's internal deliberation.

Stated purpose: accurate accounting for diffusion-style reasoning models.

Actual effect: same as the others. Opaque internal state, separately billed, format undisclosed.

The Pattern, in One Frame

Pull the lens back. What is each of these features doing, at the level of system design?

Vendor	Mechanism	What it persists	Verifiable by	Portable	Audit surface
Google	`thought_signature`	Model reasoning state	Google only	No	None
Anthropic	Prompt caching	Conversation prefix	Anthropic only	No	Bucket counts
OpenAI	Responses API state	Full conversation	OpenAI only	No	State IDs
OpenAI	ChatGPT memory	Cross-session context	OpenAI only	No	None
xAI	`thinking_tokens`	Internal reasoning	xAI only	No	Token counts
Inception	`reasoning_tokens`	Internal reasoning	Inception only	No	Token counts

The verb is the same across every row: persist agentic state across calls and turns. The differences are cosmetic — what gets persisted, how it is billed, what telemetry leaks out. But the underlying engineering problem each is solving is identical: agentic systems are stateful, and that state has to live somewhere.

The labs have all decided where it lives: inside their walls.

Why They're All Doing This Now

It is worth steelmanning their choice before critiquing it.

Agentic workflows are different from chat workflows in one structural way: a conversation can be reconstructed from the wire transcript, but an agent's reasoning often cannot. The agent makes a tool call, then needs to act on the result, then makes another. To stay coherent over 8, 20, 80 turns, it has to “remember” not just what it said but why — its intermediate hypotheses, its discarded paths, its planning context.

There are exactly three places that “why” can live:

1 · Exposed in the conversation history

The agent talks to itself in the open. Every reasoning step becomes a text turn. This bloats the context window, leaks proprietary chain-of-thought, and forces the agent to re-derive the same context on every turn. It is how older “agent loops” worked, and it is expensive and inelegant.

2 · Hidden inside the model provider

The lab tracks state on their side, identifying the conversation by an ID or a signature, and serves it back to the model when needed. This is what every modern lab is converging on. It is cheap, it is clean, it keeps the chain-of-thought private — and it locks every conversation to a single vendor's infrastructure forever.

3 · On a neutral cryptographic substrate

State is committed to a verifiable shared layer that any provider, any agent, any auditor can read. Provenance is portable. The chain-of-thought stays private (it is never written down), but the commitments about what happened are public, verifiable, and survive the death of any one vendor.

Path 1 is too expensive. Path 2 is what every lab is shipping. Path 3 is what Nukez was designed for.

The Cost of Path 2

The path the labs chose is reasonable for them. It is much harder to defend if you are a developer building on top of these systems.

When state lives inside the provider:

You cannot verify what is in it

The 237 KB blob your agent is round-tripping might be reasoning context, or it might be vendor telemetry, or it might be padding designed to keep your context window full and your bills higher. You have no way to inspect, and the format is undocumented.

You cannot port it

If you want to migrate your agent from Gemini to Claude mid-conversation, you cannot — Claude cannot read Gemini's thought_signature any more than you can. Your conversation is hostage to whichever vendor you started with.

You cannot audit it

When something goes wrong — when the agent makes a bad decision, when costs spike unexpectedly, when a tool call returns the wrong result — you cannot reconstruct what the model was reasoning about. The state that drove the decision is locked away from you.

You cannot substitute for it

If your provider has an outage, you cannot keep the conversation running against a backup. If they raise prices on the cached state, you cannot shop around. If they decide to deprecate the model, every conversation tied to it dies with it.

You pay for it regardless

The protocol is mandatory. Gemini 3.x will not accept a tool-call round-trip without thought_signature. The Responses API charges for state retrieval. You cannot opt out.

These are not theoretical concerns. We measured them. Across our 825-run benchmark of 55 models against an identical agentic storage workflow, we observed:

3 of 4 Gemini 3.x models working only after we built a parallel test harness using Google's native SDK, because the OpenAI-compatible shim does not plumb thought_signature properly. Vendor lock-in materialized as harness lock-in.
gemini-3.1-pro-preview emitting thought_signature payloads ~1,000× larger than its siblings. Same vendor, same model family, undocumented variance in protocol overhead.
Anthropic's tok/op numbers running ~70% higher than OpenAI's for the same workload — not because Claude does more work, but because Anthropic's cache accounting includes cached input in the total while OpenAI's does not. Cross-vendor comparison requires accounting expertise the protocol does not surface.

Across the same benchmark, the cryptographic alternative — a Nukez receipt for a full request-pay-confirm-provision-upload-download-verify operation — measures ~250 bytes. Less than the size of a single thought_signature segment from the smallest Gemini 3.x model. Less than 0.15% of pro-preview's single-call signature.

The Cryptographic Alternative

Nukez did not invent the idea that agentic systems need durable, verifiable state. We agree with the labs on the diagnosis. We disagree on the prescription.

A Nukez storage operation produces a receipt that is:

Cryptographically signed on Solana mainnet, by an ed25519 keypair the user controls.
Verifiable by anyone with an RPC connection — not just the issuer.
Portable across providers — the receipt is meaningful regardless of which LLM, framework, or runtime called it.
Documented — the format is open, the spec is published, the contents are inspectable.
Compact — a full operation's receipt set is on the order of hundreds of bytes, not hundreds of kilobytes.
Persistent — the chain does not go down when a vendor does.

The chain-of-thought of the agent that produced the receipt remains private — Nukez never asks for it, never sees it, never stores it. But the commitments the agent made — what it paid, what it stored, what it provisioned, what it retrieved — are publicly verifiable. This is the inversion of the proprietary pattern: keep the thinking private, make the actions public.

That is the design we shipped two years ago. It is the design every lab is incrementally, indirectly, rediscovering.

The Convergence Is One-Way

Here is the part of this story worth sitting with.

When you watch the rollout of these features over time, the direction is unambiguous. It is not that some labs are moving toward state-rich agentic protocols and others are staying lightweight. All of them are moving the same direction. Each release is more state-heavy than the last. Each new feature persists more, hides more, costs more, and locks in more.

The reason is simple: agentic systems are about to eat enterprise software, and stateful agents perform dramatically better than stateless ones. The labs see this. They are not adding stateful features for fun — they are racing to capture the protocol layer of the agent economy before it standardizes elsewhere.

The question for the next 18 months is not whether agentic state becomes the dominant abstraction. It is whether that abstraction lives:

Inside five walled gardens

Each requiring developers to choose a vendor and stick with them for the lifetime of every conversation.

On a neutral substrate

Any vendor, any developer, any auditor can participate without bilateral trust.

The labs are betting on the first. They have every commercial reason to. State that lives inside their boundary is state they can charge for, control, sunset, and deprecate at will. It is the most defensible kind of moat in software.

Nukez is the bet that the second future is better — not just for developers, but for the agentic economy as a whole. Open protocols won at TCP/IP. They won at HTTP. They won at SMTP. They have consistently outperformed proprietary alternatives whenever the underlying problem was about coordination across many actors who do not fully trust each other.

Agentic state is exactly that kind of problem. We have many vendors, many agents, many users, many auditors. None of them fully trust any of the others. The optimal substrate for that environment is one that does not require trust in any single party.

We did not predict that the labs would all converge toward stateful agentic protocols at the same time. But once they did, they made our thesis legible in a way nothing else could have.

Every new thought_signature, every cached token bucket, every Responses API state ID is, in effect, a piece of evidence that the problem we identified is real — and a partial admission that it cannot be solved well from inside a single vendor's walls.

The convergence is happening. The only question is whether the destination is five proprietary copies of Nukez, or Nukez itself.

What This Means for Builders

If you are building agentic systems on top of these vendor protocols today, three things are worth knowing:

1 · The state your conversations accumulate is not your asset

It is the vendor's. When you decide later that you want to switch providers, the data goes with them, not with you.

2 · Cost predictability is collapsing

Between cache accounting, signature size variance, thinking tokens, and reasoning-token additivity, the per-op cost of a long agentic loop has become genuinely hard to forecast across providers. Build in tolerances or accept the variance.

3 · Auditability is going backwards

As more state moves inside vendors, your ability to reconstruct what happened in a failed agent run is shrinking, not growing. Build your own audit layer if you want one — the providers will not give it to you.

We built Nukez because we believe agentic state should be a public good. Five years from now, when every interesting software interaction is mediated by agents, the question of “what actually happened, and who can prove it” will be the most important plumbing question in computing. We would rather answer it once, openly, in the open, than have to ask five different vendors and trust their answers.

The labs are building toward us. We are going to keep building, in the open, with receipts.

This post is based on data from the canonical Nukez agentic-storage benchmark, run May 11–13, 2026, against 55 cloud models across 8 frontier-lab providers. The full dataset, including all per-model trace files and the 825-run log, is available at nukez.xyz/proof/benchmark.