Deep Dive 01
The Physics and Economics of Tokens
To understand what a token costs, start with what it has to become. Text becomes tokens. Tokens become tensors. Tensors move through memory and silicon. Silicon draws power and produces heat. Datacenters cool the heat, schedule the work, and turn all of that physical activity into a price.
The word “token” makes AI feel abstract, almost weightless. But every token has a path: language gets broken into pieces, moves through memory and silicon, waits on scheduling, and eventually shows up as a few more words on screen.
Chapter I
The strange little unit
Start with the awkward fact: a token is not quite a word. Sometimes it is a full word. Sometimes it is half a word. Sometimes it is punctuation, whitespace, or a fragment that looks meaningless until it is placed back inside a sentence. The tokenizer is where human language is cut into pieces a model can handle.
That awkwardness is the point. The token sits at the seam between language and machinery. On one side, it is a bit of text. On the other side, it is compute, memory movement, heat, latency, electricity, and eventually a number on an API bill.
“Cost per token” is a useful meter, but it is a thin story. The same token count can hide very different work: reading a long prompt, writing a short answer, reusing a cached prefix, or spending extra hidden computation before the visible answer appears.
A better starting question is concrete: what has to happen in the world for one more token to appear on screen?
The piece of text the tokenizer hands to the model.
A request for processors and memory to do another slice of work.
A priced trace of infrastructure turning context into prediction.
A small footprint of energy, bandwidth, and time inside a machine.
Chapter II
The machine behind a message
“The cloud” is a tidy phrase for a messy place. A prompt does not go to an abstract intelligence floating somewhere. It moves through routing software, lands on scarce accelerator capacity, gets packed with other requests when possible, turns into tensors, travels through memory, and passes through layer after layer of the model.
The chain is long: silicon, servers, datacenter, model architecture, serving system, request shape, product experience. Each layer changes the token before it reaches the user. Some layers affect raw energy. Some affect latency. Some affect how much expensive hardware sits idle waiting for demand.
The conversation can still feel magical. It is just useful to remember what the magic is sitting on: memory bandwidth, chip power, networking, queueing, batching, cooling, and hardware that costs money even when it is waiting.
- Silicon: accelerators define the raw ceiling for computation, memory, and power.
- Datacenter: racks, cooling, networking, and power delivery make the chip usable at scale.
- Serving: schedulers, caches, batching, and replicas decide how requests share capacity.
- Request: prompt length, context, tools, and output length decide how much work one interaction creates.
- Experience: latency, quality, and answer shape decide whether the work felt worthwhile.
Chapter III
Training before serving
Before a model can answer a question, it has to become the kind of system that can predict language at all. Training is the expensive, messy, experimental stage where the model adjusts its parameters over and over again until patterns in data become capability.
The economics feel different from everyday usage. Training is an upfront bet. GPU-hours, networking, storage, orchestration, evaluation, data work, and research time are spent before the final capability is known. A failed run is not a failed API call. It can be a very expensive way to learn what did not work.
Inference comes later. It is the repeated act of spending that capability on real prompts. Training creates the machine. Inference is the meter running every time the machine is used.
Capability is created before anyone knows exactly how useful it will be.
Capability is spent again and again through prompts, answers, cache hits, and retries.
A breakthrough matters economically only if it can be served reliably and repeatedly.
Chapter IV
Why tokens are not equal
Once training and inference are separated, the billing categories stop looking arbitrary. Input tokens, output tokens, cached tokens, and reasoning tokens point to different work inside the serving system.
Input tokens ask the model to read. Output tokens ask it to write. Cached tokens reuse something the system has already processed. Reasoning tokens represent hidden work done before the visible response is shown.
The mechanical split is prefill versus decode. During prefill, the model reads the prompt and builds the state needed to answer. During decode, it generates new tokens, often one after another. Long prompts stress one part of the system. Long answers stress another. Pricing reflects that difference.
Prompt, instructions, retrieved context, and conversation history.
The visible answer. Often latency-sensitive because someone is waiting.
Repeated prefixes that can be reused instead of recomputed.
Intermediate computation spent before the final answer appears.
The token bill is a shadow cast by the serving system.
Chapter V
Why prices have shape
A price sheet looks like a menu at first. Read it with the serving system in mind and a few patterns start to show up: fresh input, cached input, output, model size, and whether the work needs to happen while someone is waiting.
Individual prices move around. The categories stick around. Fresh input has to be read. Cached input avoids recomputing a repeated prefix. Output is generated under latency pressure. Larger models occupy scarcer capacity. Batch jobs give the scheduler more room to pack work into the empty spaces.
A map of pressure, not a price sheet.
This is why price sheets tend to reward repetition, shorter outputs, smaller models, and work that can wait. The economics are quietly describing the machinery.
Chapter VI
The physics bill
Text is clean. Datacenters are not. Electrons move. Memory is read. Heat is produced. Cooling systems respond. A modern AI accelerator can draw a few hundred watts on its own, and the building around it adds more load for cooling, power delivery, and everything else needed to keep the rack alive.
A rough calculation makes the abstraction tangible. A 350 W GPU is 0.35 kW of IT load. Add 15% facility overhead and it becomes about 0.4025 kW. Over a day, that is roughly 9.66 kWh. At $0.10 per kWh, facility electricity for that one GPU is about $0.97 per day. At $0.20 per kWh, about $1.93 per day.
That can look small next to cloud GPU rental. It should. The rental price is not just electricity. It includes scarce hardware, depreciation, datacenter capacity, networking, operations, software, reliability, and the cost of keeping enough capacity ready for fast answers. Physics sets the floor. Utilization, markets, and reliability shape the price above it.
H100-class PCIe accelerator
Example efficient datacenter PUE
Cloud rental example
How the floor comes down
Move the physical inputs underneath a token: power price, chip cost, throughput, datacenter overhead, utilization, and serving waste.
Baseline is this starting mix of chips, power, facilities, and utilization.
Biggest dial right now: utilization.
Power is only part of the physics bill. Time matters too. A GPU that is waiting, poorly packed, or reserved for a latency spike still has an economic presence. Idle time can be expensive even when no token is being generated.
Chapter VII
A small physical-economic model
The hidden pieces are easier to see once they are written down. For one interaction, the cost is roughly:
Cost of an AI interaction =
physical work to read the input
+ physical work to generate the output
+ hidden reasoning work
+ retrieval and tool work
+ retries and safety checks
+ idle capacity and latency overhead
Each line behaves differently. Reading a prompt is not the same as generating an answer. A tool call is not the same as a cached prefix. A retry is not the same as the first request. A tight latency promise can force extra capacity to sit ready even when average demand looks tame.
Cost per request =
fresh input tokens
+ cached input tokens
+ output tokens
+ reasoning tokens
+ retrieval, tools, retries, safety, logging
+ idle capacity and latency overhead
Input tokens, retrieved context, cache misses.
Output length, latency pressure, verbosity.
Tools, retrieval, safety checks, logging, retries.
Idle capacity, queueing loss, reserved latency buffers.
Move the work around
No dollars here. Just the shape of the work: longer answers, cache misses, retries, batchable jobs, and idle capacity.
Baseline is the starting setup on this page.
Most sensitive dial here: output discipline.
This is not a perfect accounting system. It is a way to ask better questions. When a request gets expensive, what grew? The prompt? The retrieved context? The answer length? The hidden reasoning? The number of retries after a bad first answer?
Chapter VIII
Where work disappears
Once tokens are understood as work, waste becomes easier to see. Not because the product breaks. Usually it does not. The answer still arrives. The page still looks fine. The waste hides in the extra work the machine did along the way.
A long system prompt. A huge conversation history. Retrieval that brings back six chunks when two would do. A verbose answer style. A tool call that was not needed. A model that is too powerful for the task. A user who has to ask again because the first answer missed the point. None of these feels dramatic in isolation. Together they become the economics of the product.
- Prompt bloat: the model keeps rereading instructions, history, or context that no longer matters.
- Retrieval bloat: the system sends a pile of context instead of a few sharp pieces.
- Verbose defaults: the answer keeps going after the useful part is over.
- Unnecessary reasoning: the model thinks hard about work that should be simple.
- Unnecessary tools: the system calls out to tools when stored state or deterministic code would do.
- Retry loops: the first answer misses, so the real task quietly becomes two or three requests.
- Idle capacity: hardware waits for spikes, latency promises, or demand that arrives in the wrong shape.
Chapter IX
What better economics means
“Better token economics” sounds like it should mean cheaper tokens. Sometimes it does. More often, it means doing less unnecessary work. The same model reads less. A smaller model handles the easy part. The product remembers instead of re-sending history. The first answer is good enough that the user does not need to ask again.
It also means putting work in the right lane. Some work has to happen while a person is waiting. Some can happen in the background. Some can be cached. Some belongs in ordinary software. Some should be deleted.
- Use smaller models when the question is simple. Not every task deserves frontier-scale cognition.
- Cache repeated structure. Stable instructions, schemas, and repeated workflows should not be recomputed forever.
- Make outputs intentional. A short correct answer can be better than a long impressive one.
- Separate real-time from background work. Latency is a product feature, and features have costs.
- Let software do software-shaped work. Validators, state machines, retrieval filters, and memory stores can remove a surprising amount of inference.
- Measure the whole task, not just one call. A cheap call that causes retries may be more expensive than a better first answer.
The pattern is simple enough to remember: do not spend expensive cognition where structure would do. Save the heavy machinery for ambiguity, judgment, language, and synthesis.
Chapter X
Questions worth carrying forward
The topic keeps opening outward. Tokens sit at the crossing point between computer architecture, distributed systems, energy, product design, and markets. A few questions seem especially worth carrying forward.
- How much of AI cost improvement will come from better chips versus better serving systems?
- How much intelligence should live in the model, and how much should live in surrounding software?
- When do longer context windows create value, and when do they just hide retrieval mistakes?
- How should products expose speed, quality, reasoning depth, and cost to users without making the experience feel mechanical?
- What happens when tokens become cheap enough that the bottleneck shifts from compute to attention, trust, or distribution?
Understanding the machinery does not make AI less interesting. It makes the interesting parts easier to see: where language becomes computation, where computation becomes heat, where heat becomes infrastructure, and where infrastructure becomes a product people use without thinking about any of it.
Reference Points
Sources for the numbers above.