Cost guide

Akmon is explicit about token usage and estimated spend so you can manage AI work as an engineering budget, not a surprise invoice.

What actually drives costs

For coding agents, the largest cost driver is usually cumulative input tokens, not output tokens. Each model call resends core context plus recent conversation history.

Real session example:

  • 35 API calls
  • 672k input tokens
  • 35k output tokens
  • 258k cache-read tokens
  • total around $0.68

Using Haiku rates:

  • input: 672000 * $0.80 / 1M = $0.5376
  • output: 35000 * $4.00 / 1M = $0.1400
  • cache reads: 258000 * $0.08 / 1M = $0.0206

Prompt caching and why it matters

Cached prompt reads are much cheaper than fresh prompt tokens. Akmon surfaces cache usage in the footer and session summary so you can see when repeated context is becoming efficient.

Interpretation:

  • high cache read ratio often means repeated shared context is being billed at discount rates,
  • low cache ratio with high input often indicates noisy/volatile context.

Cost by task type

TaskModelTypical costNotes
Single-file editHaiku$0.01-$0.03few turns
3-5 file featureHaiku$0.05-$0.20moderate context
Build small app from scratchHaiku$0.30-$0.80many turns
Complex refactorHaiku$0.20-$0.50exploration heavy
Architecture designSonnet$0.50-$2.00stronger reasoning

Model selection strategy

  • Haiku: default for most implementation work.
  • Sonnet: architecture and hard reasoning spikes.
  • GPT-4o-mini: strong budget option if OpenAI is preferred.
  • Ollama local models: free token cost, but lower capability and potentially higher latency.

Practical cost controls

Use multiple levers together:

  • --max-budget-usd for hard stop,
  • plan/spec workflow to avoid repeated exploratory context,
  • smaller focused tasks,
  • context hygiene (/clear when a session gets noisy),
  • use /context and /cost during long runs.

Common mistakes and troubleshooting

  • Mistake: using premium models for trivial edits.
  • Mistake: allowing sessions to drift into repeated read loops.
  • Mistake: ignoring cache/read metrics and only watching final cost.
  • Fix: split work by phase and use cheaper models for discovery.