Tags: token 使用, AI 构建者, 上下文压缩, 提示缓存, MCP, 输出控制

大多数美国 AI 提供商正在悄悄地把价格螺丝越拧越紧。OpenAI 现在把用户推向 Free、Plus、Pro、Business、Enterprise 这条更细分的阶梯，其中 Pro 定价为每月 200 美元，Business 则按席位收费。Anthropic 也围绕更高强度的使用扩展了套餐结构，包括 Claude Max 档位和面向全天运行 AI 工作流用户的团队方案。是的，他们偶尔会退一步，但很快又会找到新的方式从你的信用卡里多拿一点。对 AI 构建者来说，信号已经很清楚：如果模型正在变成你的操作系统，那么供应商就想收更高的“租金”。

这并不意味着开发者就无能为力。它意味着“把 prompt 发出去然后看结果”的老思路，正在变得飞快地昂贵起来。没错，我自己也暂时切到 DeepSeek V4 了，但这真的是长期方案吗？对于那些仍然不想自托管这些万亿参数 LLM 的人来说，其他中国供应商也并不便宜。最终会赢的团队，是那些把 token 使用同时视为产品设计、系统设计和工作流设计问题的人。也正因为如此，token 优化才会成为构建 agent、编码工具、copilot 和内部 AI 系统的工程师们如此重要的话题。它已经是工作的一部分了。

这篇文章给出五个实用方案。每个方案都针对一种不同的 token 浪费，而且都非常适合用 FlowZap Code 可视化表达。Token 浪费通常藏在底层管线里：你选的模型层级、保留的上下文、加载的工具、忽略的缓存、以及你默认接受的冗长输出。

新的 AI 税

这股压力正在两个方向同时发生。第一，主打套餐正在变得更细分、也更昂贵，尤其是对依赖长上下文窗口、agentic 工具或全天使用的开发者而言。第二，随着工作流能力越来越强，也更容易在不知不觉间把 token 使用量抬高：重复指令、庞大的工具 schema、过量上下文、以及冗长的模型输出，都会让成本悄悄上涨。

所以真正的问题不仅是：“团队应该选哪家供应商？” 更根本的问题是：团队怎样才能继续构建雄心勃勃的 AI 工作流，而不在每次请求上都被狠狠抽成？ 这正是下面五个方案要解决的事。

方案 1：别把每个任务都丢给最贵的模型

最明显的成本漏洞，就是模型选择。太多产品把最强模型当成所有任务的默认答案，即使任务只是轻量提取、简单分类、格式清理，或者狭窄上下文上的常规问答。这相当于让一辆豪华轿车去送一份三明治。

路由层可以解决这个问题。系统不再默认所有工作都交给同一个模型，而是先判断任务、估计难度，再把它路由给“足够好、但更便宜”的模型。这个设计能把“高配模型无处不在”变成分层系统：小任务走便宜模型，中等任务走中档，只有真正困难的任务才碰昂贵模型。

好消息是，开源生态已经足够成熟，可以直接上手。RouteLLM 本身就是围绕模型路由和评估构建的。LiteLLM 则是很多团队已经在用的 gateway 风格方案，让这个思路更像标准生产管线，而不是理论练习。

client { # Application
n1: circle label:"User request"
n10: circle label:"Deliver response"
n1.handle(bottom) -> gateway.n2.handle(top) [label="Query"]
}

gateway { # LLM Gateway
n2: rectangle label:"Extract intent & length"
n3: diamond label:"Requires heavy reasoning?"
n4: rectangle label:"Select Flash / Haiku"
n5: rectangle label:"Select Pro / Opus"
n8: rectangle label:"Normalize response"
n9: rectangle label:"Log token metrics"
n2.handle(right) -> n3.handle(left)
n3.handle(bottom) -> n4.handle(top) [label="No"]
n3.handle(right) -> n5.handle(left) [label="Yes"]
n4.handle(bottom) -> providers.n6.handle(top) [label="Call budget API"]
n5.handle(bottom) -> providers.n7.handle(top) [label="Call premium API"]
n8.handle(right) -> n9.handle(left)
n9.handle(top) -> client.n10.handle(bottom) [label="Final result"]
}

providers { # Model APIs
n6: rectangle label:"Process fast"
n7: rectangle label:"Process deep reasoning"
n6.handle(top) -> gateway.n8.handle(bottom) [label="Return answer"]
n7.handle(top) -> gateway.n8.handle(bottom) [label="Return answer"]
}

大家都能理解：为本来不需要高配处理的工作支付高价，是多么让人心疼。路由把 token 优化变成了一条很简单的规则：别再为轻活付重价。

方案 2：让上下文先减肥，再交给模型

agent 系统里最隐蔽的成本之一，就是“上下文肥胖”。Agent 一开始很小，随后却不断堆积聊天历史、工具日志、文档、检索片段、错误信息，以及前面轮次留下的各种碎片，最后每次请求都像拖着一只行李箱穿过机场。然后团队再疑惑为什么账单看起来不太对劲。

上下文压缩就是解药。重点不是把一切重写成摘要，而是去掉重复内容、过期内容和低价值杂项，同时保留当前步骤真正还需要的信息。Anthropic 的 context engineering 指南也强调：好的 agent 表现依赖于对上下文的有意识选择和塑形，而不是把尽可能多的文本硬塞进窗口。

作为构建者，你不需要“更多上下文”。你需要的是正确的上下文，就像旅行者需要登机箱，而不是搬家卡车。把这件事画成架构步骤后，解决方案就一目了然了：

agent { # AI Orchestrator
n1: circle label:"Task triggered"
n4: rectangle label:"Assemble raw context"
n9: rectangle label:"Build final prompt"
n12: circle label:"Parse result"
n1.handle(bottom) -> datastore.n2.handle(top) [label="Fetch history & docs"]
n4.handle(bottom) -> compactor.n5.handle(top) [label="100k tokens"]
n9.handle(bottom) -> llm.n10.handle(top) [label="Send 15k tokens"]
}

datastore { # Vector DB & Logs
n2: rectangle label:"Retrieve chunks"
n3: rectangle label:"Package records"
n2.handle(right) -> n3.handle(left)
n3.handle(top) -> agent.n4.handle(bottom) [label="Return matches"]
}

compactor { # Context Gateway
n5: rectangle label:"Parse AST & blocks"
n6: rectangle label:"Drop stale / duplicate info"
n7: rectangle label:"Rank by query relevance"
n8: rectangle label:"Pack remaining context"
n5.handle(right) -> n6.handle(left)
n6.handle(right) -> n7.handle(left)
n7.handle(right) -> n8.handle(left)
n8.handle(top) -> agent.n9.handle(bottom) [label="Compacted 15k tokens"]
}

llm { # Model API
n10: rectangle label:"Process lean prompt"
n11: rectangle label:"Generate answer"
n10.handle(right) -> n11.handle(left)
n11.handle(top) -> agent.n12.handle(bottom) [label="Deliver response"]
}

在工具层面，Compresr’s Context Gateway 把压缩当成基础设施层，而不是一次性小修补。LangChain 的 context engineering 资料则更进一步：问题不只是压缩，而是要智能地选择、组织和隔离上下文。

方案 3：让提供商记住那些永远不变的内容

这个方案简单到有点过分。如果 prompt 的前半部分从来不变，为什么每次都要按全价付费？Prompt caching 之所以存在，就是因为稳定前缀在真实系统里很常见：指令、策略、工具列表、静态文档、schema 和 guardrail 往往会在多次调用中重复。

诀窍在于 prompt 结构。很多团队会在前面过早混入变化很大的用户数据，或者每一轮都重建一整大段指令，从而毁掉缓存。更好的设计，是把可复用的前缀保持稳定并前置，然后再追加少量动态内容。这样提供商就能复用缓存工作的结果，只对差异部分计费。

这个机制非常漂亮：第一次和后面 200 次完全不同。第一次调用做重活，后续调用则走快车道。

client { # Application
n1: circle label:"Start Run 1"
n2: rectangle label:"Assemble static system prompt"
n3: rectangle label:"Append volatile user input"
n7: rectangle label:"Receive Answer 1"
n8: circle label:"Start Run 2"
n9: rectangle label:"Reuse static system prompt"
n10: rectangle label:"Append new volatile user input"
n13: rectangle label:"Receive Answer 2"
n14: circle label:"Session done"
n1.handle(right) -> n2.handle(left)
n2.handle(right) -> n3.handle(left)
n3.handle(bottom) -> provider.n4.handle(top) [label="Send 51k tokens"]
n7.handle(right) -> n8.handle(left)
n8.handle(right) -> n9.handle(left)
n9.handle(right) -> n10.handle(left)
n10.handle(bottom) -> provider.n11.handle(top) [label="Send 51k tokens"]
n13.handle(right) -> n14.handle(left)
}

provider { # Model Provider
n4: rectangle label:"Cache Miss: compute full attention"
n5: rectangle label:"Write KV cache for 50k prefix"
n6: rectangle label:"Compute 1k delta & generate"
n11: rectangle label:"Cache Hit: read 50k from KV cache"
n12: rectangle label:"Compute 1k delta & generate"
n4.handle(right) -> n5.handle(left)
n5.handle(right) -> n6.handle(left)
n6.handle(top) -> client.n7.handle(bottom) [label="Cost: $1.00"]
n11.handle(right) -> n12.handle(left)
n12.handle(top) -> client.n13.handle(bottom) [label="Cost: $0.10"]
}

如果你对不断上涨的套餐价格感到烦躁，请记住：你未必需要换工具或换供应商才能降低成本。很多时候，你只需要像个成年人一样组织请求。Prompt caching 一旦画出来，就会变成一种几乎有点“显而易见”的优化。

方案 4：别在每次请求里加载整个工具宇宙

除了基础问题，还有一个真正的内部痛点。随着 agent 开始使用 MCP 服务器和更大的工具目录，团队发现模型卡住的不只是用户输入，还有那些一再重新加载的巨大工具定义、参数 schema、描述和示例。这就是隐藏的 token 税。

这很重要，因为它不只是小小的低效。MCP 生态正在积极讨论正式的减轻 schema 膨胀的方法，包括去重和基于检索的工具选择，因为这个问题大到足以扭曲整个工作流。Speakeasy 的 dynamic toolset 做法也朝同一方向推进：只暴露当前任务真正需要的工具，而不是把整套厨房工具箱都塞进 prompt。

这里的架构对比非常鲜明。一边是系统把所有 schema 都塞进每一次请求，然后惊讶为什么账单会爆掉。另一边则是先判断真正需要什么，只拉取相关工具定义，并保持 prompt 轻量。

agent { # AI Agent
n1: circle label:"New objective"
n2: rectangle label:"Analyze required capabilities"
n5: rectangle label:"Construct prompt with 2 schemas"
n8: circle label:"Execute step"
n1.handle(right) -> n2.handle(left)
n2.handle(bottom) -> registry.n3.handle(top) [label="Request tools by intent"]
n5.handle(bottom) -> llm.n6.handle(top) [label="Send lean request"]
}

registry { # MCP Tool Registry
n3: diamond label:"Match intent to tools"
n4: rectangle label:"Extract specific schemas (not all 50)"
n3.handle(right) -> n4.handle(left)
n4.handle(top) -> agent.n5.handle(bottom) [label="Return 2 schemas"]
}

llm { # Model API
n6: rectangle label:"Process 500 tool tokens (not 10k)"
n7: rectangle label:"Select tool to call"
n6.handle(right) -> n7.handle(left)
n7.handle(top) -> agent.n8.handle(bottom) [label="Return tool call"]
}

很多最大的成本，并不是因为“用了太多 AI”这个抽象问题，而是因为系统结构不佳，而供应商又很乐意按这个价格收钱。

方案 5：先给输出套上缰绳，免得输出反过来套住你

输入 tokens 总是更受关注，但很多产品真正悄悄流血的地方，其实是输出 tokens。模型本来可以只给 JSON，却写出一大段长文。本来只需要一个 patch，却重写了整个文件。本来应用只要一个结果，它却把冗长推理也一股脑吐出来。然后下一步 agent 还得把这些膨胀后的输出再吃进去，于是今天的输出浪费，变成了明天的输入浪费。

解决办法就是输出预算。设置 max_tokens。使用 stop sequence。要求结构化输出。在应用能直接应用的情况下，优先用 diff 和 patch，而不是整文件重写。把 schema 收紧，让模型不那么容易滑进散文模式。这个方法不如“大上下文窗口”那么炫，但它往往才是日常节省最明显的地方。

作为 builder，别人会向你兜售更多上下文、更多 agent 循环、更多自治、更多魔法。很好。但如果你的系统一打喷嚏就回你一篇小作文，那你付的钱其实是在买表演艺术。输出预算，就是团队决定要“有用的答案”，而不是文学装饰的时候。

app { # Orchestrator
n1: circle label:"Task initiated"
n2: rectangle label:"Apply output schema (JSON, Diff)"
n3: rectangle label:"Set max_tokens=500"
n6: diamond label:"Parses correctly?"
n7: rectangle label:"Tighten system prompt"
n8: circle label:"Proceed to next step"
n1.handle(right) -> n2.handle(left)
n2.handle(right) -> n3.handle(left)
n3.handle(bottom) -> provider.n4.handle(top) [label="Strict constraints"]
n6.handle(right) -> n8.handle(left) [label="Yes"]
n6.handle(bottom) -> n7.handle(top) [label="No"]
n7.handle(left) -> n2.handle(bottom) [label="Retry with penalties"]
}

provider { # LLM API
n4: rectangle label:"Process request"
n5: rectangle label:"Generate formatted output"
n4.handle(right) -> n5.handle(left)
n5.handle(top) -> app.n6.handle(bottom) [label="Returns exact shape"]
}

这个方案最大的好处就是立刻能做。你不需要等未来模型发布，也不需要更好的价格档位。你今天就可以决定：输出必须为应用而设计，而不是任由它乱跑。

真正属于这篇文章的 GitHub 工具箱

人很容易迷失在无穷无尽的 GitHub 仓库列表里。所以这里给出一个和这五个方案直接对应的精选工具箱。它们证明了：每一种优化都不是空谈，而是已经有生产级工具在支撑。

方案	仓库或产品	为什么重要
方案 1：路由	RouteLLM	专为模型路由与评估而生。
方案 1：路由	LiteLLM	适合路由、fallback 和多供应商场景的强力生产级 gateway 架构。
方案 2：上下文压缩	Compresr Context Gateway	把上下文压缩当作一层专门的基础设施。
方案 2：上下文压缩	LangChain context engineering	提供上下文选择、排序与结构化的框架。
方案 3：提示缓存	供应商原生功能	只要把 prompt 结构成“静态前缀优先”，就能立即使用。
方案 4：schema 减肥	MCP SEP-1576	确认 schema 膨胀是生态中的真实问题，并提出去重标准。
方案 4：schema 减肥	Speakeasy dynamic toolsets	只加载当前任务真正需要的工具，是非常实用的实现方式。
方案 5：输出预算	Structured-output 功能	大多数 API 都有，但更需要应用层纪律。
相关优化	GPTCache	适合高重复场景中的语义缓存。
相关优化	vCache	当缓存可靠性和误差边界更重要时，适合语义缓存。
相关优化	LLMLingua	在需要压缩 KV cache 时，它是 prompt 压缩的标准工具之一。

语义缓存和提示压缩依然重要。但对于今天大多数 AI 构建者来说，更大的问题是：路由、上下文、工具、缓存和输出控制这些环节上的架构浪费。

结论

大型供应商还会继续把用户往更高档位、更大承诺、更昂贵的工作方式上推。这种趋势大概率不会逆转。但你不必把每一项成本都当成不可避免。你可以更好地路由、压缩上下文、利用缓存、减少工具加载，并像专业人士一样管理输出预算。

如果模型供应商想从 AI 构建者身上榨更多钱，那么现在是时候变得更难被压榨了。

灵感来源

OpenAI Business Pricing: https://openai.com/business/chatgpt-pricing/
ChatGPT Rate Card: https://help.openai.com/en/articles/11481834-chatgpt-rate-card
ChatGPT Enterprise Guide: https://www.hungyichen.com/en/insights/chatgpt-enterprise-guide
Claude 2026 Pricing Guide: https://www.heyuan110.com/posts/ai/2026-04-03-claude-pricing-complete-guide/
Finout Claude Pricing Breakdown: https://www.finout.io/blog/claude-pricing-in-2026-for-individuals-organizations-and-developers
AI Coding Tools Pricing: https://ijonis.com/en/ai-coding-tools-pricing
Anthropic Context Engineering: https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Redis Token Optimization: https://redis.io/blog/llm-token-optimization-speed-up-apps/
RouteLLM GitHub: https://github.com/lm-sys/routellm
RouteLLM Cost Reduction Writeup: https://gaodalie.substack.com/p/routellm-how-i-route-to-the-best
LiteLLM GitHub: https://github.com/BerriAI/litellm
LiteLLM Routing Docs: https://github.com/BerriAI/litellm/blob/main/docs/my-website/docs/routing.md
Claude Automatic Context Compaction: https://platform.claude.com/cookbook/tool-use-automatic-context-compaction
Compresr Context Gateway: https://compresr.ai/gateway
Compresr Architecture Writeup: https://lilting.ch/en/articles/compresr-context-gateway-agent-proxy
LangChain Context Engineering Repo: https://github.com/langchain-ai/context_engineering
SylphAI Prompt Caching Guide: https://sylphai.substack.com/p/the-complete-guide-to-prompt-caching
Prompt Caching Infrastructure Guide: https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
ProjectDiscovery Cost Cut Writeup: https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching
MCP SEP-1576 (Schema Bloat Proposal): https://github.com/modelcontextprotocol/modelcontextprotocol/issues/1576
Speakeasy Dynamic Toolsets: https://www.speakeasy.com/blog/how-we-reduced-token-usage-by-100x-dynamic-toolsets-v2
Chrome DevTools MCP Token Issue: https://github.com/ChromeDevTools/chrome-devtools-mcp/issues/340
Token-Budget-Aware Reasoning: https://gist.github.com/thehunmonkgroup/aceb859a819729711bdb815cc946a34c
Simbian Structured Outputs: https://simbian.ai/blog/using-structured-outputs-to-chain-llm-pipelines
Google ADK Output Limits Issue: https://github.com/google/adk-python/issues/701
GPTCache GitHub: https://github.com/zilliztech/gptcache
GPTCache Overview: https://zilliz.com/what-is-gptcache
vCache GitHub: https://github.com/vcache-project/vCache
LLMLingua GitHub: https://github.com/microsoft/llmlingua
LLMLingua Integration: https://microsoft.github.io/promptflow/integrations/tools/llmlingua-prompt-compression-tool.html

如何优化你的 token 使用：给被成本挤压的 AI 构建者的 5 个方案