How to Save Millions on LLM Tokens

Tokenmaxxing is funny until the bill shows up.

The trend is real. Engineers inside large tech companies are competing to burn more AI tokens. At Meta, an internal leaderboard ranks employees by token usage, competing for the title of "Token Legend."

Token usage, however, is a strange status symbol. Tokens themselves are not work, they are the bill you pay to get work done, and that comes out of gross margin.

Right now, free credit tiers insulate many AI startups from this reality. Credits accelerate building, but they hide the true cost of the system. When the runway ends, token spend becomes a hard product constraint. In voice AI and other high-volume systems, it dictates whether your business actually works.

The teams that survive are the ones that learn to treat prompts as first-class software, and tokens as a cost center they actively manage. By applying standard engineering discipline to this layer, I've cut millions in annualized LLM spend without changing application behavior.

The sequence below serves as a mini playbook of lessons I've learned, with practical decision rules on how to get the most out of every token.

Cache me if you can

The lowest-risk first move is application-level response caching.

Find the calls in your system where the same input should produce the same output: a classification step, an extraction step, any path where the mapping is deterministic. Paying an LLM to regenerate those responses every time is lighting money on fire.

Caching has been a basic infrastructure discipline for decades. I often see it get skipped on LLM systems because the model feels magical, but people forget it is often just a very expensive function call. If that's the case, treat it like one.

You can't fix what you can't see

Caching gets you the easy wins. Everything after depends on understanding how your system actually behaves in production.

That means logging the inputs your prompts receive and the outputs they produce, and sampling from those logs to build benchmarks that reflect real traffic. The implementation varies by stack, but the principle does not. You cannot optimize what you have not measured.

The Ceiling Test

Measurement gets you the data, but it doesn't yet tell you when a change is safe to ship. That's the harder problem.

The reason it's hard is that LLMs are stochastic. If you measure a candidate replacement against a single reference run of production, you are measuring two things at once: how the candidate differs from production, and how much production differs from itself.

Measure production against itself first. The rate of agreement is your noise ceiling. That number estimates the reproducibility of the current system, because production does not agree with itself above that rate.

This is test-retest reliability applied pragmatically to a stochastic system. The intuition is related to classical reliability theory, including Spearman's work on attenuation from measurement error: noisy measurements limit what you can infer from comparisons. LLM outputs are not psychometric instruments, but the practical lesson carries over: first estimate the system's own variability before judging whether a replacement is meaningfully different.

This gives you the ceiling test: a practical optimization criterion for behavior-preserving cost reduction. First, measure production against itself across repeated runs to estimate the noise ceiling overall and for each important output class or semantic slice. A candidate replacement passes if it matches production at the noise ceiling, within statistical uncertainty, both overall and on those slices. Below the ceiling, it is changing behavior. At the ceiling, it is statistically indistinguishable from production on the behavior the benchmark measures. Above the ceiling, you should check whether it is genuinely less noisy or merely overfitting to the reference runs.

The ceiling test establishes behavioral indistinguishability from production on the benchmark, not absolute correctness. If production is systematically wrong, a replacement that matches it inherits that wrongness. For cutting costs on a system stakeholders already consider acceptable, this is the right target.

Everything that follows is gated on the ceiling test. Without it, every change comes down to vibes. With it, every change has a number attached and a clear rule for whether to ship.

Prompts as programs, not poetry: DSPy + GEPA

Production prompts accumulate bloat. Instructions get added over time, examples become stale, and defensive language piles up after incidents. Holding the model fixed and shrinking the prompt can cut length by up to 40% while passing the ceiling test, which at production scale is a sizable reduction in input token costs.

This is where prompting stops being a craft and becomes a search problem. Inside most production teams, prompt engineering is still a black box: people write prompts based on intuition, ship them when the outputs look reasonable, and patch them when something breaks. No measurement, no comparison, no systematic exploration. A.k.a vibes.

DSPy and GEPA are excellent frameworks that help remove the vibes.

DSPy treats LLM systems as programs rather than prompt strings. You declare a task with typed inputs and outputs, supply a metric and a handful of examples, and let the framework's optimizers search over prompt formulations and few-shot selections to maximize the metric. Prompts become artifacts produced by a compiler, not artifacts written by hand. The compiled program is portable across models, so the same pipeline can be re-optimized for a cheaper model later without rewriting any logic. (DSPy also defaults to a key-value output format rather than JSON, which cuts output tokens by roughly half and avoids the reasoning degradation that JSON-mode imposes on many tasks.)

GEPA is a reflective prompt optimizer that runs a genetic-Pareto search. Instead of sampling random variations and keeping what scores well, it examines execution traces, reasons about why specific examples failed, and proposes targeted edits. Candidates are evaluated across multiple objectives and maintained on a Pareto frontier rather than collapsed to a single score, so the optimizer keeps prompts that win on different tradeoffs instead of converging on one local optimum. The published results show it matching or beating reinforcement learning approaches with orders of magnitude fewer rollouts. The edits it proposes are legible enough that you can read the diff and understand why the new prompt is better. GEPA can run inside DSPy as a teleprompter or stand on its own.

Prompts that come out of this process are shorter and more legible than what they replace, because the optimizer is incentivized to remove anything that does not contribute to the metric.

You don't need as much intelligence as you think

After the prompt is tight, try a cheaper model. Run the ceiling test, re-optimize the prompt for the cheaper model, and run it again. Frontier and small-model pricing differ by one to two orders of magnitude, so a successful downgrade can take the per-call cost down by as much as 98%. We often overestimate how much intelligence the task actually needs.

When the cheaper model cannot recover the benchmark even after re-optimization, look at the task itself. Many prompts ask the model to do several things at once. A small model that struggles with the whole thing in one shot can often handle each piece on its own.

This is where DSPy earns its keep a second time. Because the system is already expressed as a program with typed inputs and outputs, breaking a single module into a pipeline of smaller modules is a refactor, not a rewrite. Each module gets its own optimized prompt and can run on the cheapest model that passes the ceiling test for that subtask. Classification, normalization, and routing decisions move to small models. Reasoning-heavy steps stay on the larger model, but with less context to chew through because the upstream modules have already done the parsing and retrieval.

What tokenmaxxing gets backwards

Tokenmaxxing rewards consumption. The actual goal is more useful work per dollar.

Some businesses can subsidize their way to PMF. Uber and DoorDash did it: more users meant denser routing and shorter wait times, and the economics improved with scale. WeWork tried it and went bankrupt because scale never fixed the underlying math.

LLM spend is closer to WeWork. Subsidizing token costs to grow likely isn't buying you a flywheel, it's buying you time to either raise prices, cut costs, or run out of money.

Build the infrastructure early. It costs almost nothing while you're small, and it's there the moment you need it. Not fewer tokens for their own sake, but a system where every token earns its place.

That's real tokenmaxxing: getting the most out of every token.