AI Coding Costs — How to Stop Burning Money on Large Contexts and MAX Mode

The invoice that changed my behavior

I opened my billing dashboard on a Monday morning and stared at the number. Five hundred and twelve dollars. One week. Not a month — a week.

I hadn't done anything unusual. I was building a feature, debugging an integration, refactoring some tests. Normal engineering work. But I'd been doing it with every setting cranked to maximum: Opus 4.6 Thinking in MAX mode, full repository context, long multi-turn conversations that accumulated thousands of tokens per message.

Each individual request felt harmless. A quick "refactor this module" here, a "debug this error with full context" there. But at the token rates for frontier thinking models, "harmless" adds up fast. I was essentially running a small GPU cluster every time I asked a question.

That invoice was the beginning of a very deliberate process to understand where the money was going — and more importantly, where it didn't need to go.

Where the money actually goes

AI coding costs aren't evenly distributed. After tracking my usage for two weeks, the breakdown was clear:

Context size is the multiplier. Every request sends your conversation history, attached files, and system prompts to the model. A fresh conversation with a small question might use 2,000 tokens. A long conversation with 15 files attached and a multi-turn debugging session can easily hit 100,000+ tokens per request. That's a 50x cost difference for a single message.

Thinking models compound the problem. Models like Opus 4.6 Thinking don't just read your input — they generate an internal chain-of-thought before producing the visible response. That reasoning chain can be 3-5x the length of the final answer, and you're paying for every token of it. A response that looks like 500 tokens might have cost 3,000 tokens behind the scenes.

MAX mode is the premium tier. Running a thinking model in MAX mode removes the output cap and gives you the full reasoning depth. It's extraordinarily capable — and extraordinarily expensive. A single complex request in MAX mode can cost more than an entire day of normal usage.

Here's a rough mental model of the cost tiers:

Configuration	Relative Cost	When It Shines
Fast model, small context	1x	Quick questions, simple edits
Standard model, medium context	5-10x	Feature implementation, code review
Thinking model, large context	30-50x	Complex debugging, architecture decisions
Thinking model, MAX mode, full repo context	100-200x	Multi-file refactors, deep analysis

That bottom row is where my $500 went. I was using the 200x configuration for tasks that a 5x configuration would have handled just as well.

The cost-aware workflow

After the invoice shock, I developed a tiered approach. The core insight: match the model to the task, not the other way around. Using the most powerful model for everything is like taking a helicopter to the grocery store. It works, but you're paying for capabilities you don't need.

Tier 1: Fast model for mechanical tasks

Most of what we do with AI coding assistants is mechanical. Renaming a variable across a file. Generating a type from a JSON sample. Writing a unit test for a pure function. Adding error handling to a try/catch block.

These tasks don't require reasoning. They require pattern matching and code generation — exactly what fast, cheap models excel at. I switched to using the fastest available model for anything that fits this description:

Boilerplate generation
Simple refactors (rename, extract function, inline variable)
Writing tests for straightforward functions
Generating types, interfaces, or schemas
Formatting or restructuring code
Documentation and comments

This alone cut my daily cost by 60%. The output quality for these tasks is virtually identical between a fast model and a frontier thinking model.

Tier 2: Standard model for feature work

When I'm implementing a feature — writing new logic, integrating an API, building a component — I use a standard-tier model without thinking mode. It's smart enough to understand intent, generate idiomatic code, and handle moderate complexity.

The key discipline here is context management. Instead of attaching my entire codebase and asking "build this feature," I attach only the files that are directly relevant:

The file I'm editing
The types/interfaces it depends on
One or two examples of similar patterns in the codebase

Three to five files, not thirty. This keeps the context window small and the cost predictable. It also produces better results — models perform worse with too much irrelevant context, not better.

Tier 3: Thinking model for hard problems

I reserve the expensive thinking models for genuinely hard problems — the ones where I need the model to reason, not just generate:

Debugging a race condition across multiple services
Designing the architecture for a new system component
Understanding a complex error with a deep stack trace
Reviewing critical code for subtle bugs
Untangling a gnarly type error in a generic TypeScript function

These are the tasks where thinking models earn their cost. The extended chain-of-thought lets them consider edge cases, weigh trade-offs, and catch issues that standard models miss. But they represent maybe 10-15% of my daily work.

Tier 4: MAX mode — the nuclear option

MAX mode with full context gets used once or twice a week, tops. It's for moments when I'm genuinely stuck and need the model to analyze a large surface area of code with deep reasoning:

A bug that spans five files and three abstraction layers
A major refactor where the model needs to understand the entire module to suggest a safe approach
Reviewing an entire PR for architectural issues

Before I reach for MAX mode, I ask myself: "Have I tried solving this with a cheaper model first?" If the answer is no, I start there. Most of the time, Tier 2 or 3 gets me to the answer.

Practical strategies that actually save money

Beyond the tiered model approach, a few habits made a significant difference:

Start fresh conversations frequently. Long conversations accumulate context. Every new message includes the entire conversation history. By message 20, you're sending a novel-length prompt for every request. I now start a new conversation every time I switch tasks — and sometimes mid-task when the conversation gets long.

Be specific in your prompts. "Fix this bug" with 10 files attached is expensive and slow. "The processOrder function in order-service.ts throws a null reference on line 47 when customer.address is undefined — add a guard clause" is cheap and fast. Specificity reduces the work the model needs to do, which reduces tokens, which reduces cost.

Use the AI for planning, then execute yourself. For complex features, I'll use a thinking model once to design the approach — which files to change, what patterns to follow, what edge cases to handle. Then I execute the plan using a fast model (or just my own hands). One expensive planning call replaces ten expensive implementation calls.

Read the code yourself first. This sounds obvious, but it's the habit I lost. When every answer is a prompt away, you stop reading the code. You ask the AI "what does this function do?" instead of spending two minutes reading it. Those two-minute questions, at thinking-model token rates, cost real money. And reading the code yourself builds understanding that no model can substitute.

Leverage cached and indexed context. Many AI coding tools maintain a local index of your codebase. Queries against the index are cheap or free. Use search, symbol lookup, and go-to-definition before attaching files manually. Let the tool find the relevant context instead of dumping everything into the prompt.

The 90/10 rule

After a month of deliberate cost management, my weekly spend dropped from $500 to around $80-100 — an 80% reduction. My productivity didn't noticeably change. If anything, it improved, because I was thinking more carefully about what I was asking and why.

The uncomfortable truth is that most AI-assisted coding doesn't need frontier models. It needs fast, cheap models applied to well-scoped tasks. The frontier models are genuinely transformative for the hard 10% of problems — the ones where you're stuck, confused, or making a decision with significant consequences. Using them for everything is not just expensive; it's a crutch that atrophies your own engineering judgment.

The best AI-assisted workflow I've found is one where I do the thinking about what to build and the AI helps me build it faster. When I reverse that — when I outsource the thinking to the AI and become a prompt jockey — both the quality and the cost go in the wrong direction.

A note on the economics

AI model pricing will continue to drop. What costs $500 today might cost $50 in a year. But the principle will remain: there will always be a hierarchy of model capabilities and costs, and the expensive tier will always be tempting. The discipline of matching the tool to the task — of not reflexively reaching for the most powerful option — is a skill that pays dividends regardless of the price per token.

And if you're expensing this to your company, you have an even stronger reason to be intentional. A team of ten engineers each burning $500/week on AI tools is $260,000 per year. That's a senior engineer's salary. At some point, someone in finance will notice — and you'd rather have a story about deliberate, optimized usage than "we just had everything on MAX mode."

This article is part of a series on AI engineering and developer productivity.

I Spent $500 in a Week on AI-Assisted Coding. Here's What I Learned About Not Doing That.