Does it feel like the big American AI providers are quietly turning the pricing screws? That's because they are. OpenAI is now pushing users across a ladder that runs from Free to Plus to Pro to Business to Enterprise, with Pro positioned at $200 per month and Business sold on a per-seat basis. Anthropic has also expanded its plan structure around heavier usage, including Claude Max tiers and team-oriented plans that explicitly target people running AI workflows all day. Oh sure, they backtrack every once in a while, only to find new ways to get more out of your credit cards later. For AI builders, the message is getting harder to miss: if the models are becoming your operating system, the providers want a bigger rent check.
That doesn't mean developers are powerless. It means the old mindset of “just send the prompt and see what happens” is getting expensive fast. Yeah, I've switched to Deepseek V4 for the moment too, but is this the long term solution? The other Chinese providers are not cheap either for those who still don't want to host these trillions-parameter LLMs. The teams that will win are the ones that treat token usage as product design, systems design, and workflow design all at once. That is why token optimization has become such a useful topic for engineers building agents, coding tools, copilots, and internal AI systems. It's part of the job now.
This article presents the top five practical solutions. Each one tackles a different kind of token waste, and each one is easy to explain visually with FlowZap Code. Token waste is often hidden in the plumbing: the model tier you choose, the context you carry, the tools you load, the cache you ignore, and the bloated output you accept.
The new AI tax
The squeeze is happening in two ways at once. First, headline tiers are getting more segmented and more expensive for serious users, especially developers who rely on long context windows, agentic tooling, or all-day usage. Second, the more capable the workflow becomes, the easier it is to accidentally inflate token usage with repeated instructions, giant tool schemas, oversized context, and verbose model outputs.
So the real problem is not only “which provider should a team choose?” The real problem is: how can a team keep building ambitious AI workflows without getting milked on every request? That is where these five solutions come in.
Solution #1: Stop sending every task to the fancy model
The first and most obvious leak is model choice. Too many products treat the strongest model like the default answer to everything, even when the task is simple extraction, light classification, format cleanup, or routine Q&A over narrow context. That is the AI equivalent of sending a limousine to pick up a sandwich.
A routing layer fixes that. Instead of assuming one model for every job, the system inspects the task, estimates its difficulty, and routes it to the cheapest model that is still good enough. That one design choice can turn “premium model everywhere” into a tiered system: small jobs go cheap, medium jobs go mid-tier, and only hard tasks touch the expensive stuff.
Fortunately, the open-source ecosystem is mature enough to handle this out of the box. RouteLLM is directly built around model routing and evaluation. LiteLLM is a gateway-style option that many teams already use, making the idea feel less like a theoretical exercise and more like standard production plumbing.
client { # Application
n1: circle label:"User request"
n10: circle label:"Deliver response"
n1.handle(bottom) -> gateway.n2.handle(top) [label="Query"]
}
gateway { # LLM Gateway
n2: rectangle label:"Extract intent & length"
n3: diamond label:"Requires heavy reasoning?"
n4: rectangle label:"Select Flash / Haiku"
n5: rectangle label:"Select Pro / Opus"
n8: rectangle label:"Normalize response"
n9: rectangle label:"Log token metrics"
n2.handle(right) -> n3.handle(left)
n3.handle(bottom) -> n4.handle(top) [label="No"]
n3.handle(right) -> n5.handle(left) [label="Yes"]
n4.handle(bottom) -> providers.n6.handle(top) [label="Call budget API"]
n5.handle(bottom) -> providers.n7.handle(top) [label="Call premium API"]
n8.handle(right) -> n9.handle(left)
n9.handle(top) -> client.n10.handle(bottom) [label="Final result"]
}
providers { # Model APIs
n6: rectangle label:"Process fast"
n7: rectangle label:"Process deep reasoning"
n6.handle(top) -> gateway.n8.handle(bottom) [label="Return answer"]
n7.handle(top) -> gateway.n8.handle(bottom) [label="Return answer"]
}
Everyone understands the pain of paying premium prices for work that did not need premium effort. Routing turns token optimization from a fuzzy backend concept into a straightforward rule: stop overpaying for easy work.
Solution #2: Put your context on a diet before it hits the model
One of the sneakiest costs in agent systems is context obesity. The agent starts small, then accumulates chat history, tool logs, documentation, retrieved chunks, error messages, and random debris from earlier turns until every request looks like it is dragging a suitcase through the airport. Then teams wonder why the bill looks unwell.
Context compaction is the cure. The point is not to rewrite everything into a summary. The point is to remove redundancy, stale material, and low-value clutter while preserving the information that still matters for the current step. Anthropic’s context engineering guidance reinforces that good agent performance depends on selecting and shaping context intentionally, not stuffing the maximum amount of text into a large window.
As a builder, you do not need “more context.” You need the right context, just like a traveler needs a carry-on, not a moving truck. Visualizing this as an architectural step makes the solution obvious:
agent { # AI Orchestrator
n1: circle label:"Task triggered"
n4: rectangle label:"Assemble raw context"
n9: rectangle label:"Build final prompt"
n12: circle label:"Parse result"
n1.handle(bottom) -> datastore.n2.handle(top) [label="Fetch history & docs"]
n4.handle(bottom) -> compactor.n5.handle(top) [label="100k tokens"]
n9.handle(bottom) -> llm.n10.handle(top) [label="Send 15k tokens"]
}
datastore { # Vector DB & Logs
n2: rectangle label:"Retrieve chunks"
n3: rectangle label:"Package records"
n2.handle(right) -> n3.handle(left)
n3.handle(top) -> agent.n4.handle(bottom) [label="Return matches"]
}
compactor { # Context Gateway
n5: rectangle label:"Parse AST & blocks"
n6: rectangle label:"Drop stale / duplicate info"
n7: rectangle label:"Rank by query relevance"
n8: rectangle label:"Pack remaining context"
n5.handle(right) -> n6.handle(left)
n6.handle(right) -> n7.handle(left)
n7.handle(right) -> n8.handle(left)
n8.handle(top) -> agent.n9.handle(bottom) [label="Compacted 15k tokens"]
}
llm { # Model API
n10: rectangle label:"Process lean prompt"
n11: rectangle label:"Generate answer"
n10.handle(right) -> n11.handle(left)
n11.handle(top) -> agent.n12.handle(bottom) [label="Deliver response"]
}
On the tooling side, Compresr’s Context Gateway positions compaction as infrastructure rather than a one-off hack. LangChain’s context engineering materials take this further: showing it's not just about compression, but smart selection, organization, and isolation of context.
Solution #3: Make the provider remember what never changes
This one is almost rude in its simplicity. If the first half of the prompt never changes, why are you paying full freight for it every single time? Prompt caching exists precisely because stable prefixes are common in real systems: instructions, policies, tool lists, static docs, schemas, and guardrails tend to repeat across calls.
The trick is prompt structure. Teams sabotage caching when they mix volatile user data too early or rebuild long instruction blocks on every turn. A better design keeps the reusable prefix stable and front-loaded, then appends the small dynamic portion later. That lets the provider reuse the cached work and charge only for the delta.
The mechanics are beautiful in their simplicity: a "first time versus next 200 times" dynamic. The first call does the heavy lift. Later calls cruise through the fast lane.
client { # Application
n1: circle label:"Start Run 1"
n2: rectangle label:"Assemble static system prompt"
n3: rectangle label:"Append volatile user input"
n7: rectangle label:"Receive Answer 1"
n8: circle label:"Start Run 2"
n9: rectangle label:"Reuse static system prompt"
n10: rectangle label:"Append new volatile user input"
n13: rectangle label:"Receive Answer 2"
n14: circle label:"Session done"
n1.handle(right) -> n2.handle(left)
n2.handle(right) -> n3.handle(left)
n3.handle(bottom) -> provider.n4.handle(top) [label="Send 51k tokens"]
n7.handle(right) -> n8.handle(left)
n8.handle(right) -> n9.handle(left)
n9.handle(right) -> n10.handle(left)
n10.handle(bottom) -> provider.n11.handle(top) [label="Send 51k tokens"]
n13.handle(right) -> n14.handle(left)
}
provider { # Model Provider
n4: rectangle label:"Cache Miss: compute full attention"
n5: rectangle label:"Write KV cache for 50k prefix"
n6: rectangle label:"Compute 1k delta & generate"
n11: rectangle label:"Cache Hit: read 50k from KV cache"
n12: rectangle label:"Compute 1k delta & generate"
n4.handle(right) -> n5.handle(left)
n5.handle(right) -> n6.handle(left)
n6.handle(top) -> client.n7.handle(bottom) [label="Cost: $1.00"]
n11.handle(right) -> n12.handle(left)
n12.handle(top) -> client.n13.handle(bottom) [label="Cost: $0.10"]
}
If you are annoyed at rising plan prices, remember you do not necessarily need a new tool or a new provider to reduce spend here. Often you just need to structure your requests like an adult. Prompt caching is an optimization that feels almost embarrassingly obvious once you map it out.
Solution #4: Stop loading the whole tool universe into every request
Beyond the basics lies a real insider pain point. Once agents started using MCP servers and bigger tool catalogs, teams discovered that the model was not choking on user input alone. It was choking on gigantic tool definitions, parameter schemas, descriptions, and examples that got reloaded over and over again. That is the hidden token tax.
This matters because it is not just a little inefficiency. The MCP ecosystem is actively discussing formal ways to reduce schema bloat, including deduplication and retrieval-based tool selection, because the problem is large enough to distort whole workflows. Speakeasy’s dynamic toolset approach pushes in the same direction: expose only the tools that matter for the current task instead of dumping the whole kitchen sink into the prompt.
The architectural contrast here is stark. On one side, your system blasts every schema into every request and you wonder why the bill balloons. On the other, it checks what is actually needed, fetches only relevant tool definitions, and keeps the prompt lean.
agent { # AI Agent
n1: circle label:"New objective"
n2: rectangle label:"Analyze required capabilities"
n5: rectangle label:"Construct prompt with 2 schemas"
n8: circle label:"Execute step"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> registry.n3.handle(top) [label="Request tools by intent"]
n5.handle(bottom) -> llm.n6.handle(top) [label="Send lean request"]
}
registry { # MCP Tool Registry
n3: diamond label:"Match intent to tools"
n4: rectangle label:"Extract specific schemas (not all 50)"
n3.handle(right) -> n4.handle(left)
n4.handle(top) -> agent.n5.handle(bottom) [label="Return 2 schemas"]
}
llm { # Model API
n6: rectangle label:"Process 500 tool tokens (not 10k)"
n7: rectangle label:"Select tool to call"
n6.handle(right) -> n7.handle(left)
n7.handle(top) -> agent.n8.handle(bottom) [label="Return tool call"]
}
Some of the biggest costs aren't caused by “using too much AI” in the abstract. They are caused by poorly structured agent systems that providers are perfectly happy to bill you for.
Solution #5: Put a leash on output before output puts a leash on you
Input tokens get all the attention, but output tokens are where a lot of products quietly bleed money. A model writes a long answer when JSON would do. It rewrites a whole file when a patch would do. It spills verbose reasoning into places where the application only needed a result. Then the next agent step has to ingest that bloated output, so today’s output waste becomes tomorrow’s input waste.
Output budgeting is the fix. Set max tokens. Use stop sequences. Ask for structured output. Prefer diffs and patches over full rewrites when the application can apply them. Tighten the schema so the model is less likely to wander off into essay mode. This is less glamorous than talking about giant context windows, but it is often where day-to-day savings actually appear.
As a builder, you are sold more context, more agent loops, more autonomy, more magic. Great. But if your system responds with a novella every time it sneezes, you are paying for that performance art. Output budgeting is where a team decides it wants useful answers, not literary flourishes.
app { # Orchestrator
n1: circle label:"Task initiated"
n2: rectangle label:"Apply output schema (JSON, Diff)"
n3: rectangle label:"Set max_tokens=500"
n6: diamond label:"Parses correctly?"
n7: rectangle label:"Tighten system prompt"
n8: circle label:"Proceed to next step"
n1.handle(right) -> n2.handle(left)
n2.handle(right) -> n3.handle(left)
n3.handle(bottom) -> provider.n4.handle(top) [label="Strict constraints"]
n6.handle(right) -> n8.handle(left) [label="Yes"]
n6.handle(bottom) -> n7.handle(top) [label="No"]
n7.handle(left) -> n2.handle(bottom) [label="Retry with penalties"]
}
provider { # LLM API
n4: rectangle label:"Process request"
n5: rectangle label:"Generate formatted output"
n4.handle(right) -> n5.handle(left)
n5.handle(top) -> app.n6.handle(bottom) [label="Returns exact shape"]
}
The beauty of this solution is that it is immediately actionable. You do not need to wait for a future model release or a better pricing tier. You can decide today that outputs should be shaped to the application, not left to chance.
The GitHub toolbox that actually belongs in this story
It's easy to get lost in an endless list of GitHub repos. Instead, here is a curated toolkit tied directly to these five solutions. These projects prove that each optimization is real enough to have production-grade tooling behind it.
| Solution | Repo or product | Why it matters | | :-- | :-- | :-- | | Solution #1: routing | RouteLLM | Purpose-built for model routing and evaluation. | | Solution #1: routing | LiteLLM | Strong production gateway architecture for routing, fallbacks, and multi-provider setups. | | Solution #2: context compaction | Compresr Context Gateway | Treats context compaction as a dedicated infrastructure layer. | | Solution #2: context compaction | LangChain context engineering | Provides a framework for context selection, ordering, and structuring. | | Solution #3: prompt caching | Native provider features | Immediately available if you structure prompts with a static prefix first. | | Solution #4: schema reduction | MCP SEP-1576 | Confirms schema bloat is a live ecosystem issue and proposes deduplication standards. | | Solution #4: schema reduction | Speakeasy dynamic toolsets | A practical implementation of loading only the tools required for the task. | | Solution #5: output budgeting | Structured-output features | Built into most APIs; requires application discipline more than new dependencies. | | Related optimization | GPTCache | Reliable for semantic caching in repeat-heavy apps. | | Related optimization | vCache | Built for semantic caching when cache reliability and error bounds matter. | | Related optimization | LLMLingua | The standard for prompt compression when reducing KV-cache size. |
Semantic caching and prompt compression still matter. But the bigger problem for most AI builders today is architectural waste across routing, context, tools, caching, and output control.
The Bottom Line
The big providers are going to keep pushing users toward higher tiers, larger commitments, and more expensive ways of working. That part is unlikely to reverse. But you do not have to accept every cost as inevitable. You can route better, compact context, exploit caching, load fewer tools, and budget outputs like professionals.
If the model vendors want to squeeze more money out of AI builders, it's time to become much harder to squeeze.
Inspirations
- OpenAI Business Pricing: https://openai.com/business/chatgpt-pricing/
- ChatGPT Rate Card: https://help.openai.com/en/articles/11481834-chatgpt-rate-card
- ChatGPT Enterprise Guide: https://www.hungyichen.com/en/insights/chatgpt-enterprise-guide
- Claude 2026 Pricing Guide: https://www.heyuan110.com/posts/ai/2026-04-03-claude-pricing-complete-guide/
- Finout Claude Pricing Breakdown: https://www.finout.io/blog/claude-pricing-in-2026-for-individuals-organizations-and-developers
- AI Coding Tools Pricing: https://ijonis.com/en/ai-coding-tools-pricing
- Anthropic Context Engineering: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- Redis Token Optimization: https://redis.io/blog/llm-token-optimization-speed-up-apps/
- RouteLLM GitHub: https://github.com/lm-sys/routellm
- RouteLLM Cost Reduction Writeup: https://gaodalie.substack.com/p/routellm-how-i-route-to-the-best
- LiteLLM GitHub: https://github.com/BerriAI/litellm
- LiteLLM Routing Docs: https://github.com/BerriAI/litellm/blob/main/docs/my-website/docs/routing.md
- Claude Automatic Context Compaction: https://platform.claude.com/cookbook/tool-use-automatic-context-compaction
- Compresr Context Gateway: https://compresr.ai/gateway
- Compresr Architecture Writeup: https://lilting.ch/en/articles/compresr-context-gateway-agent-proxy
- LangChain Context Engineering Repo: https://github.com/langchain-ai/context_engineering
- SylphAI Prompt Caching Guide: https://sylphai.substack.com/p/the-complete-guide-to-prompt-caching
- Prompt Caching Infrastructure Guide: https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
- ProjectDiscovery Cost Cut Writeup: https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching
- MCP SEP-1576 (Schema Bloat Proposal): https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576
- Speakeasy Dynamic Toolsets: https://www.speakeasy.com/blog/how-we-reduced-token-usage-by-100x-dynamic-toolsets-v2
- Chrome DevTools MCP Token Issue: https://github.com/ChromeDevTools/chrome-devtools-mcp/issues/340
- Token-Budget-Aware Reasoning: https://gist.github.com/thehunmonkgroup/aceb859a819729711bdb815cc946a34c
- Simbian Structured Outputs: https://simbian.ai/blog/using-structured-outputs-to-chain-llm-pipelines
- Google ADK Output Limits Issue: https://github.com/google/adk-python/issues/701
- GPTCache GitHub: https://github.com/zilliztech/gptcache
- GPTCache Overview: https://zilliz.com/what-is-gptcache
- vCache GitHub: https://github.com/vcache-project/vCache
- LLMLingua GitHub: https://github.com/microsoft/llmlingua
- LLMLingua Integration: https://microsoft.github.io/promptflow/integrations/tools/llmlingua-prompt-compression-tool.html
