Models & cost optimisation.
This tab is about which models power your agents and how to keep the bill sensible as you scale up usage. It's the “tune it after you've built it” layer — read Tabs 1–5 and build something first, then come back here when you want to optimise.
The short answer: our recommended stack
For most builders doing personal AI agent work with development/technical tasks:
| Tier | What it does | Recommended model | How to access |
|---|---|---|---|
| Orchestration | Planning, architecture, hard decisions, writing | Claude Opus | Claude subscription or API |
| Coding & execution | Writing code, reviewing, refactoring, debugging | GPT Codex | ChatGPT Plus/Pro subscription |
| Grunt work | Summarisation, classification, routing, cleanup | Local LLM (Gemma, Qwen) via Ollama, OR Claude Haiku | Ollama on your Mac/PC, or Anthropic API |
Why this specific mix?
- Claude Opus for orchestration — unmatched at strategic reasoning and long-context analysis. Low volume, high value per call. Cost stays manageable.
- GPT Codex for coding — a ChatGPT Plus subscription (~$30 USD/month) gives effectively unlimited access to coding models. The equivalent volume on Claude Opus via API could run to hundreds of dollars. The capability gap is small; the cost gap is massive.
- Local LLM or Haiku for grunt work — classification and routing are high-volume, low-stakes tasks. Running them locally (free, private) or on Haiku (cheap, fast) is the sensible call.
The exception: Claude Code itself only runs on Claude models. When you're using the copilot pattern from Tab 4, you're using Claude whether you like it or not. That's fine — the copilot pattern is high-value, low-volume.
Understanding the three tiers
The orchestration tier
The model that makes decisions. Characteristics: low volume, long and context-heavy conversations, quality matters much more than speed, mistakes are expensive.
Right model: The best reasoning model you have access to. Claude Opus is the current top choice.
The execution tier
The model that does things. Characteristics: medium volume, clear input/output, speed and cost matter alongside quality, mistakes are cheap to fix.
Right model: For code, GPT Codex via subscription. For general drafting, Claude Sonnet or GPT-4-class both work well.
The grunt work tier
The model that handles repetitive, high-volume, low-stakes tasks. Characteristics: very high volume, tiny and simple, speed matters enormously, quality threshold is “good enough.”
Right model: Whatever is cheapest and fastest. Local Ollama (free, private) or Claude Haiku (cheap, fast, cloud).
Model provider options in depth
Anthropic Claude
Models: Opus (top tier), Sonnet (mid tier), Haiku (fast tier).
Strengths: Best-in-class at long-context reasoning and nuanced instruction-following. Excellent at staying in character (important for personalised agents). Claude Code copilot is Claude-only.
Weaknesses: Pay-per-token pricing via API gets expensive at volume. No flat-rate subscription for API access.
When to use: Orchestration tier (Opus). Personalised agents. Long-form writing. The copilot pattern.
OpenAI GPT
Strengths: Flat-rate subscription for coding work via ChatGPT Plus/Pro — the killer feature for cost-conscious builders. Strong at structured output and function calling.
Weaknesses: Tone and character can drift more than Claude. Chat subscription doesn't cover API access.
When to use: Coding tier (the clear cost winner via subscription). Structured output tasks. Function calling.
Local models via Ollama
Ollama runs open-source models on your own computer. Download once, run locally — no API calls, no per-token costs, no data leaving your machine.
| Model | Size | Strengths | When to use |
|---|---|---|---|
| Gemma 3 (4B) | ~3 GB | Very fast on Apple Silicon, good at classification | Fast helper tier — routing, “is this urgent?” |
| Gemma 3 (27B) | ~18 GB | Near-Sonnet quality for many tasks | Grunt work with higher quality bar |
| Qwen 3 (7B) | ~5 GB | Strong at reasoning for its size | Mid-tier when cloud isn't an option |
| Qwen 3 (14B) | ~9 GB | Approaches Sonnet-class for many tasks | Execution tier on capable hardware |
| Llama 3.3 (70B) | ~40 GB | High quality, needs serious hardware | Orchestration-adjacent for offline work |
Getting started with Ollama:
# Install Ollama
brew install ollama
# Pull a model
ollama pull gemma3:4b
# Test it
ollama run gemma3:4b "Classify this as urgent or not: 'reminder your domain renews tomorrow'"LiteLLM — the model routing proxy
If you're running a multi-model stack, you need a way to route different tasks to different models. LiteLLM is a thin proxy server that presents a unified OpenAI-compatible API. Your agent talks to LiteLLM as if it's one model. LiteLLM routes the request based on rules you define.
Why you want it
- Centralised routing — change model choices in one config, not across every agent
- Cost tracking — logs token usage per model
- Fallbacks — if Ollama is down, automatically fall back to Haiku
- Caching — cache responses to identical prompts
- Unified API — one interface regardless of provider
- Model swaps — one-line config change, not a code rewrite
Basic setup
# Install LiteLLM with proxy support
pipx install 'litellm[proxy]'
# Create config directory
mkdir -p ~/.litellmExample config routing three tiers to three providers:
model_list:
# Tier 1: Orchestration
- model_name: orchestrator
litellm_params:
model: anthropic/claude-opus-4-6
api_key: os.environ/ANTHROPIC_API_KEY
# Tier 2: Execution (coding via OpenAI)
- model_name: coder
litellm_params:
model: openai/gpt-5-codex
api_key: os.environ/OPENAI_API_KEY
# Tier 3a: Fast helper (local)
- model_name: helper-local
litellm_params:
model: ollama/gemma3:4b
api_base: http://localhost:11434
# Tier 3b: Fast helper (cloud fallback)
- model_name: helper-cloud
litellm_params:
model: anthropic/claude-haiku-4-5
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
drop_params: true
cache: true
cache_params:
type: localClaude Code prompt to set this up:
The identity consistency principle
Your identity files (SOUL.md and friends) are what keep your agent recognisably “yours” across model swaps. The model is the voice; the identity is the character.
- Invest heavily in SOUL.md before optimising models. Tab 2 Phase 5 is the most important hour of the whole build.
- Expect to iterate when you change models. A SOUL.md that produces perfect tone on Opus might need tightening on GPT-4 or Gemma.
- Test identity consistency when you swap models. Send the same knowledge question before and after.
- Keep rules explicit, not implicit. Don't write “be professional” — write “no preamble, lead with the answer, use Australian English, avoid corporate filler.”
Claude Code prompt to test identity consistency across models:
Other valid stacks
The “simplicity” stack — all Anthropic
Opus + Sonnet + Haiku. Single provider, single API key, consistent voice. More expensive at volume but simplest mental model. Best for: people who value simplicity over cost optimisation.
The “privacy” stack — local-first
Opus (cloud, unavoidable) + Qwen 14B or Llama 70B (local) + Gemma 3 4B (local). Minimises what goes to cloud providers. Requires 32GB+ unified memory. Best for: sensitive content (legal, medical, financial).
The “budget” stack — cheap and cheerful
Sonnet (not Opus) + GPT-4-class via ChatGPT + Gemma 3 4B local or Haiku. Probably under $30/month total. Best for: starting out or lower-stakes work.
The “premium” stack — capability at any cost
Opus for everything. Budget $100–300 USD/month. Best for: professional work where the cost of a bad answer far exceeds the cost of tokens.
Cost optimisation patterns
1. Separate API keys per agent for cost tracking
Create a distinct API key for each agent. See which one is burning tokens and investigate.
2. Cache aggressively
LiteLLM's cache catches identical prompts. For scheduled workflows with similar inputs, this can cut costs by 30–60%.
3. Don't over-schedule
Start with: heartbeat every 30 minutes, daily briefing once per morning, classification passes every hour. Add frequency only when you feel a gap.
4. Monitor weekly, not monthly
Check API usage dashboards weekly. Catch disproportionate token burn early — don't wait for the monthly invoice.
5. Don't downgrade during builds
Use your best model while building or troubleshooting. The token savings from downgrading are trivial compared to the debugging time wasted on cheaper model mistakes.
Recommendations by starting situation
“I'm just starting”
Use Claude Sonnet for everything. Don't set up LiteLLM yet. Build the agent, use it for a few weeks, learn your usage patterns, then come back here.
“I've been running a month and want to cut costs”
Introduce LiteLLM and route grunt work to Haiku. Most savings come from pushing high-volume tasks to the cheapest tier.
“I care about privacy”
Set up Ollama and route sensitive tasks locally. Start with Gemma 3 4B. If you have 32GB+ unified memory, add Qwen 14B.
“I want the best possible output”
Run everything on Claude Opus. Budget for $100–300 USD/month in API costs. Accept that you're paying for quality.
“I'm doing heavy coding work”
Opus + Codex is the winning combination. Opus for architecture via Claude Code, Codex for code generation via ChatGPT subscription. This is the stack that kept costs manageable during the real build.
Tab complete
The short version: start simple (Sonnet for everything), tune when you feel the cost. When you tune, the recommended stack is Opus orchestration + Codex coding + local/Haiku grunt work, proxied via LiteLLM. Keep your identity files strong so model swaps don't break your agent's character.