Why Your AI Agent Is More Expensive Than It Needs to Be

I burned through more Cursor credits than I expected in my first month of using it seriously. Not because I was using it wrong, but because I did not understand what I was actually paying for. Once I did, the same sessions started costing a fraction of what they used to. Nothing changed except how I structured the work before the agent started.

What is the actual currency of an AI agent?

Tokens. Every interaction with an AI model is priced in tokens. A token is roughly 0.75 words in English. When you send a message, those are input tokens. When the model replies, those are output tokens. Everything the agent reads to form its context (your files, your rules, prior conversation) counts as input. Everything it generates (code, plans, explanations) counts as output.

Input and output are not priced the same, and the gap between them is where most of the cost lives.

Why does output cost so much more than input?

Output tokens are generated one at a time through an expensive forward pass. Input tokens are processed in parallel. Output tokens typically cost 3 to 5 times more than input tokens on most frontier models.

Asking Cursor to explain how a module works costs much less than asking it to rewrite that module. The agent has to read to understand (cheap), then generate code to respond (expensive). Reading is cheaper than writing, for the same technical reason it is faster for a human to read a page than to write one.

That gap shapes every decision I make about how I use the agent.

What actually drives the cost of an agentic loop?

A single prompt-response pair is not how agents work. In Cursor's Agent mode, when you give it a non-trivial task, it runs a loop:

Reads your codebase to gather context (input tokens)
Plans what to do (output tokens)
Writes changes across multiple files (output tokens)
Reads the result and checks for errors (input tokens)
Corrects and continues until it is satisfied (output tokens)

Both meters run for the full duration of that loop. The longer the loop, the higher the bill. A poorly specified task that sends the agent searching across ten files before writing a single line will cost significantly more than a well-specified task with a clear plan already in place.

The model choice compounds this. Claude Sonnet and GPT-4o-mini handle mechanical work well at a fraction of the cost of Claude Opus or GPT-4o. Using a frontier model for every task, including renaming a variable or reformatting a config file, is the equivalent of hiring a specialist to do your grocery shopping.

What does vibe coding actually cost you?

Vibe coding is when you open an agent, describe something loosely, and let it run until something usable comes out. It can feel productive. The economics are closer to a slot machine.

Sometimes the agent figures out what you meant, finds the right files, writes clean code, and exits the loop in two minutes. Sometimes it spends eight minutes exploring the wrong parts of the codebase, writes code based on a wrong assumption, and you spend another five minutes correcting it. You cannot tell which session you are in at the start.

The problem is not the agent. The problem is that ambiguity is expensive. When the task is vague, the agent covers ground broadly: more file reads, more speculative output, longer loops. When the output is wrong, you iterate: more corrections, more output tokens. The cost scales with the uncertainty.

Vibe engineering is different. You do the ambiguity resolution yourself, before the agent starts. You know what you want, you know which files matter, you have a plan. The agent executes a clear task. The loop is short and the cost is predictable.

The distinction is not about using less AI. It is about spending the expensive part (output tokens) on executing something correct rather than on discovering that your assumption was wrong.

What are the practical changes that actually reduce costs?

Write context files and keep them current. I keep an ARCHITECTURE.md and a DECISIONS.md in every project I use seriously with an agent. When I start a new Cursor session, the agent reads these instead of re-discovering what I already know by exploring the codebase. Every session that used to start with the agent searching for orientation now skips that entire step.

The compounding effect is real. With a clear ARCHITECTURE.md, the agent reaches the right files on the first try. Without one, it reads five files to find the one it actually needed, and I pay for all five.

Create a todos.md before asking the agent to write code. I stopped going from requirement to implementation in one step. I ask for a plan first, read it, confirm it makes sense, then ask the agent to execute against it.

A wrong plan discovered at the planning stage costs almost nothing. Wrong code discovered after three loops of corrections costs significantly more. Output tokens spent executing a correct plan are cheap compared to output tokens spent discovering that the assumption was wrong.

The planning conversation is cheap. The correction loop is not.

What does a todos.md actually need to contain?

The quality of the todos file determines whether a standard model can execute it cleanly or whether it needs to guess. A standard model is not limited by intelligence on execution tasks. It is limited by the specificity of the instructions it receives. A well-formed todos removes every decision the model would otherwise make on its own.

A todos that lets a standard model execute without guessing answers four questions per task: what to build, where to put it, what pattern to follow, and what the acceptance criteria is.

In practice, that looks like this:

# Todos: Add Disclosure Hook

## Context
- Pattern reference: `lib/hooks/useToast.ts`
- Types live in: `lib/hooks/types.ts`
- Export barrel: `lib/hooks/index.ts`

---

## Tasks

### 1. Create `useDisclosure` hook
- File: `lib/hooks/useDisclosure.ts`
- Follow the exact same structure as `lib/hooks/useToast.ts`
- State: `isOpen: boolean`, default `false`
- Actions: `open()`, `close()`, `toggle()`
- Return shape: `{ isOpen, open, close, toggle }`

### 2. Add type definition
- File: `lib/hooks/types.ts`
- Add `UseDisclosureReturn` interface matching the return shape above
- Use it as the return type annotation in the hook

### 3. Export from barrel
- File: `lib/hooks/index.ts`
- Add: `export { useDisclosure } from './useDisclosure'`
- Place it alphabetically between existing exports

### 4. Wire into Modal component
- File: `components/modal/Modal.tsx`
- Replace the local `useState(false)` with `useDisclosure()`
- No other changes to the component logic or JSX

File paths, a named reference file to copy the pattern from, exact field names, the exact return shape, where in the barrel file to insert the export, a scope constraint on the last task. The model has zero decisions to make.

The things that force a standard model to guess: "add a hook for modal state" with no reference file (the model invents a pattern that may not match yours), "update the types" without a file path (it may create a new file or modify the wrong one), "no other changes" left unstated (it may refactor adjacent code it considers related).

A useful mental test: could a competent but new team member execute this todos in one pass without asking a single question? If they would ask "which file?", "which pattern?", or "how far should I go?", those are gaps the model fills with a guess. A guess in the planning stage costs nothing to catch. A guess that surfaces after three loops of generated code costs significantly more.

Use project and global rules for recurring instructions. I used to repeat the same instructions at the start of every session: always check the existing pattern before introducing a new one, use the logger from /lib/logger not console.log, never modify shared types without updating the corresponding tests. Now those live in .cursor/rules/ files. The agent reads them automatically.

A rule written once is context that is always present and never forgotten. The sessions where I used to forget to type it and the agent did something inconsistent no longer happen.

Keep sessions focused and context windows lean. The more files open, the more context the agent processes as background input. I close irrelevant files before starting a session, scope each session to one task, and start fresh rather than letting a session sprawl into adjacent work. A session that has accumulated unrelated context is almost always more expensive than a new one.

Match the model to the task. I use a fast, cheaper model for mechanical work: reformatting, renaming, small refactors, generating boilerplate. I use the stronger model for decisions that actually need it: complex debugging, architecture questions, writing logic with subtle constraints. In Cursor, you can switch models per request. Most people do not, including me for a while. Switching consistently is one of the higher-leverage changes I made.

When does a thinking model actually earn its cost?

Thinking models are a specific category worth understanding separately, because they are easy to misapply and the cost difference is not trivial.

A thinking model is one that reasons internally before producing a response. In Anthropic's API, this is called extended thinking. The model is given a token budget to work through the problem before it writes the reply. Those internal reasoning steps are called thinking tokens, and they count as output tokens, billed at the same rate as regular output.

Claude Sonnet 4.6 (standard and thinking-capable) is priced at $3 per million input tokens and $15 per million output tokens. When extended thinking is active, the thinking tokens stack on top of the response tokens, both billed at $15/MTok. A complex query with 45,000 thinking tokens and 3,000 response tokens in the reply costs around $0.72 in output alone. The same query without thinking, producing only the 3,000-token reply, costs $0.045. Extended thinking at full depth can increase the output cost of a single query by 30% or more, often much more depending on how long the model reasons.

That cost is justified when the problem requires multi-step reasoning where the path to the answer is genuinely non-obvious: debugging a subtle race condition, designing a system with competing constraints, working through a logic problem with many interdependencies. These are cases where the model needs to hold many things simultaneously and reason before committing to output. The thinking budget changes the quality of the answer in a way that matters.

It is not justified for tasks where the constraints are already fully specified. Writing prose with a defined style guide, generating boilerplate, following a clear pattern to add a new endpoint. These are generation tasks, not reasoning tasks. A capable standard model with good instructions will produce output that is nearly indistinguishable from a thinking model at a fraction of the cost. The thinking overhead gets spent on things the model already knows how to do.

The practical question to ask before reaching for a thinking model: is the difficulty here in figuring out what the correct answer is, or in producing the correct answer once the path is clear? If it is the former, the thinking budget is working. If it is the latter, you are paying for reasoning you did not need.

What is the actual pattern underneath all of this?

AI agent cost is a function of token consumption. Token consumption is a function of how clearly specified the work is before the loop starts. The agent is not slower or more expensive when the work is hard. It is slower and more expensive when the work is unclear.

Everything I do to reduce ambiguity before the loop starts reduces cost: context files, clear task definitions, planning before coding, rules that carry recurring instructions automatically, sessions that stay on one thing. The common thread is doing the thinking before the tokens start running.

The agents are getting better at handling ambiguity. The cost structure is not going to change proportionally. Knowing the economics and working with them is just cheaper.