Cost guide
Akmon is explicit about token usage and estimated spend so you can manage AI work as an engineering budget, not a surprise invoice.
What actually drives costs
For coding agents, the largest cost driver is usually cumulative input tokens, not output tokens. Each model call resends core context plus recent conversation history.
Real session example:
- 35 API calls
- 672k input tokens
- 35k output tokens
- 258k cache-read tokens
- total around $0.68
Using Haiku rates:
- input: 672000 * $0.80 / 1M = $0.5376
- output: 35000 * $4.00 / 1M = $0.1400
- cache reads: 258000 * $0.08 / 1M = $0.0206
Prompt caching and why it matters
Cached prompt reads are much cheaper than fresh prompt tokens. Akmon surfaces cache usage in the footer and session summary so you can see when repeated context is becoming efficient.
Interpretation:
- high cache read ratio often means repeated shared context is being billed at discount rates,
- low cache ratio with high input often indicates noisy/volatile context.
Cost by task type
| Task | Model | Typical cost | Notes |
|---|---|---|---|
| Single-file edit | Haiku | $0.01-$0.03 | few turns |
| 3-5 file feature | Haiku | $0.05-$0.20 | moderate context |
| Build small app from scratch | Haiku | $0.30-$0.80 | many turns |
| Complex refactor | Haiku | $0.20-$0.50 | exploration heavy |
| Architecture design | Sonnet | $0.50-$2.00 | stronger reasoning |
Model selection strategy
- Haiku: default for most implementation work.
- Sonnet: architecture and hard reasoning spikes.
- GPT-4o-mini: strong budget option if OpenAI is preferred.
- Ollama local models: free token cost, but lower capability and potentially higher latency.
Practical cost controls
Use multiple levers together:
--max-budget-usdfor hard stop,- plan/spec workflow to avoid repeated exploratory context,
- smaller focused tasks,
- context hygiene (
/clearwhen a session gets noisy), - use
/contextand/costduring long runs.
Common mistakes and troubleshooting
- Mistake: using premium models for trivial edits.
- Mistake: allowing sessions to drift into repeated read loops.
- Mistake: ignoring cache/read metrics and only watching final cost.
- Fix: split work by phase and use cheaper models for discovery.