TAB 6 — TUNE IT

Models & cost optimisation.

Which models to use for which tier of work, and how to keep the bill sensible.
This tab is about which models power your agents and how to keep the bill sensible as you scale up usage. It's the “tune it after you've built it” layer — read Tabs 1–5 and build something first, then come back here when you want to optimise.

The short answer: our recommended stack

For most builders doing personal AI agent work with development/technical tasks:

TierWhat it doesRecommended modelHow to access
OrchestrationPlanning, architecture, hard decisions, writingClaude OpusClaude subscription or API
Coding & executionWriting code, reviewing, refactoring, debuggingGPT CodexChatGPT Plus/Pro subscription
Grunt workSummarisation, classification, routing, cleanupLocal LLM (Gemma, Qwen) via Ollama, OR Claude HaikuOllama on your Mac/PC, or Anthropic API

Why this specific mix?

  • Claude Opus for orchestration — unmatched at strategic reasoning and long-context analysis. Low volume, high value per call. Cost stays manageable.
  • GPT Codex for coding — a ChatGPT Plus subscription (~$30 USD/month) gives effectively unlimited access to coding models. The equivalent volume on Claude Opus via API could run to hundreds of dollars. The capability gap is small; the cost gap is massive.
  • Local LLM or Haiku for grunt work — classification and routing are high-volume, low-stakes tasks. Running them locally (free, private) or on Haiku (cheap, fast) is the sensible call.

The exception: Claude Code itself only runs on Claude models. When you're using the copilot pattern from Tab 4, you're using Claude whether you like it or not. That's fine — the copilot pattern is high-value, low-volume.

Understanding the three tiers

The orchestration tier

The model that makes decisions. Characteristics: low volume, long and context-heavy conversations, quality matters much more than speed, mistakes are expensive.

Right model: The best reasoning model you have access to. Claude Opus is the current top choice.

The execution tier

The model that does things. Characteristics: medium volume, clear input/output, speed and cost matter alongside quality, mistakes are cheap to fix.

Right model: For code, GPT Codex via subscription. For general drafting, Claude Sonnet or GPT-4-class both work well.

The grunt work tier

The model that handles repetitive, high-volume, low-stakes tasks. Characteristics: very high volume, tiny and simple, speed matters enormously, quality threshold is “good enough.”

Right model: Whatever is cheapest and fastest. Local Ollama (free, private) or Claude Haiku (cheap, fast, cloud).

Model provider options in depth

Anthropic Claude

Models: Opus (top tier), Sonnet (mid tier), Haiku (fast tier).

Strengths: Best-in-class at long-context reasoning and nuanced instruction-following. Excellent at staying in character (important for personalised agents). Claude Code copilot is Claude-only.

Weaknesses: Pay-per-token pricing via API gets expensive at volume. No flat-rate subscription for API access.

When to use: Orchestration tier (Opus). Personalised agents. Long-form writing. The copilot pattern.

OpenAI GPT

Strengths: Flat-rate subscription for coding work via ChatGPT Plus/Pro — the killer feature for cost-conscious builders. Strong at structured output and function calling.

Weaknesses: Tone and character can drift more than Claude. Chat subscription doesn't cover API access.

When to use: Coding tier (the clear cost winner via subscription). Structured output tasks. Function calling.

Local models via Ollama

Ollama runs open-source models on your own computer. Download once, run locally — no API calls, no per-token costs, no data leaving your machine.

ModelSizeStrengthsWhen to use
Gemma 3 (4B)~3 GBVery fast on Apple Silicon, good at classificationFast helper tier — routing, “is this urgent?”
Gemma 3 (27B)~18 GBNear-Sonnet quality for many tasksGrunt work with higher quality bar
Qwen 3 (7B)~5 GBStrong at reasoning for its sizeMid-tier when cloud isn't an option
Qwen 3 (14B)~9 GBApproaches Sonnet-class for many tasksExecution tier on capable hardware
Llama 3.3 (70B)~40 GBHigh quality, needs serious hardwareOrchestration-adjacent for offline work

Getting started with Ollama:

bash
# Install Ollama
brew install ollama

# Pull a model
ollama pull gemma3:4b

# Test it
ollama run gemma3:4b "Classify this as urgent or not: 'reminder your domain renews tomorrow'"

LiteLLM — the model routing proxy

If you're running a multi-model stack, you need a way to route different tasks to different models. LiteLLM is a thin proxy server that presents a unified OpenAI-compatible API. Your agent talks to LiteLLM as if it's one model. LiteLLM routes the request based on rules you define.

Why you want it

  1. Centralised routing — change model choices in one config, not across every agent
  2. Cost tracking — logs token usage per model
  3. Fallbacks — if Ollama is down, automatically fall back to Haiku
  4. Caching — cache responses to identical prompts
  5. Unified API — one interface regardless of provider
  6. Model swaps — one-line config change, not a code rewrite

Basic setup

bash
# Install LiteLLM with proxy support
pipx install 'litellm[proxy]'

# Create config directory
mkdir -p ~/.litellm

Example config routing three tiers to three providers:

~/.litellm/config.yamlyaml
model_list:
  # Tier 1: Orchestration
  - model_name: orchestrator
    litellm_params:
      model: anthropic/claude-opus-4-6
      api_key: os.environ/ANTHROPIC_API_KEY

  # Tier 2: Execution (coding via OpenAI)
  - model_name: coder
    litellm_params:
      model: openai/gpt-5-codex
      api_key: os.environ/OPENAI_API_KEY

  # Tier 3a: Fast helper (local)
  - model_name: helper-local
    litellm_params:
      model: ollama/gemma3:4b
      api_base: http://localhost:11434

  # Tier 3b: Fast helper (cloud fallback)
  - model_name: helper-cloud
    litellm_params:
      model: anthropic/claude-haiku-4-5
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  drop_params: true
  cache: true
  cache_params:
    type: local

Claude Code prompt to set this up:

▶ Claude Code Prompt
I want to install and configure LiteLLM as a model routing proxy on this server. Please: 1. Install LiteLLM via pipx: `pipx install 'litellm[proxy]'` 2. Create the config directory at `~/.litellm` 3. Create a config.yaml that defines four model routes: `orchestrator` (Claude Opus), `coder` (GPT Codex), `helper-local` (Ollama Gemma 3 4B if available), `helper-cloud` (Claude Haiku as fallback) 4. Set up LiteLLM as a system-level systemd service listening on `127.0.0.1:4000` 5. Make sure it loads API keys from `/etc/agent-board/shared.env` 6. Start the service and verify it's responding on port 4000 7. Tell me how to update my existing agents to route through LiteLLM

The identity consistency principle

Your identity files (SOUL.md and friends) are what keep your agent recognisably “yours” across model swaps. The model is the voice; the identity is the character.

Practical implications
  1. Invest heavily in SOUL.md before optimising models. Tab 2 Phase 5 is the most important hour of the whole build.
  2. Expect to iterate when you change models. A SOUL.md that produces perfect tone on Opus might need tightening on GPT-4 or Gemma.
  3. Test identity consistency when you swap models. Send the same knowledge question before and after.
  4. Keep rules explicit, not implicit. Don't write “be professional” — write “no preamble, lead with the answer, use Australian English, avoid corporate filler.”

Claude Code prompt to test identity consistency across models:

▶ Claude Code Prompt
I just swapped my agent's model from [old model] to [new model]. Please help me test whether the identity is still consistent: 1. Remind me of the exact model change I made 2. Suggest 3-5 identity test prompts I should send the agent (knowledge questions, voice tests, rule tests) 3. After I send each one and paste the response, help me compare against what I'd expect from my SOUL.md 4. If the responses drift from my intended voice, help me tighten SOUL.md to make the rules more explicit 5. Restart the service after any identity updates and retest

Other valid stacks

The “simplicity” stack — all Anthropic

Opus + Sonnet + Haiku. Single provider, single API key, consistent voice. More expensive at volume but simplest mental model. Best for: people who value simplicity over cost optimisation.

The “privacy” stack — local-first

Opus (cloud, unavoidable) + Qwen 14B or Llama 70B (local) + Gemma 3 4B (local). Minimises what goes to cloud providers. Requires 32GB+ unified memory. Best for: sensitive content (legal, medical, financial).

The “budget” stack — cheap and cheerful

Sonnet (not Opus) + GPT-4-class via ChatGPT + Gemma 3 4B local or Haiku. Probably under $30/month total. Best for: starting out or lower-stakes work.

The “premium” stack — capability at any cost

Opus for everything. Budget $100–300 USD/month. Best for: professional work where the cost of a bad answer far exceeds the cost of tokens.

Cost optimisation patterns

1. Separate API keys per agent for cost tracking

Create a distinct API key for each agent. See which one is burning tokens and investigate.

2. Cache aggressively

LiteLLM's cache catches identical prompts. For scheduled workflows with similar inputs, this can cut costs by 30–60%.

3. Don't over-schedule

Start with: heartbeat every 30 minutes, daily briefing once per morning, classification passes every hour. Add frequency only when you feel a gap.

4. Monitor weekly, not monthly

Check API usage dashboards weekly. Catch disproportionate token burn early — don't wait for the monthly invoice.

5. Don't downgrade during builds

Use your best model while building or troubleshooting. The token savings from downgrading are trivial compared to the debugging time wasted on cheaper model mistakes.

Recommendations by starting situation

“I'm just starting”

Use Claude Sonnet for everything. Don't set up LiteLLM yet. Build the agent, use it for a few weeks, learn your usage patterns, then come back here.

“I've been running a month and want to cut costs”

Introduce LiteLLM and route grunt work to Haiku. Most savings come from pushing high-volume tasks to the cheapest tier.

“I care about privacy”

Set up Ollama and route sensitive tasks locally. Start with Gemma 3 4B. If you have 32GB+ unified memory, add Qwen 14B.

“I want the best possible output”

Run everything on Claude Opus. Budget for $100–300 USD/month in API costs. Accept that you're paying for quality.

“I'm doing heavy coding work”

Opus + Codex is the winning combination. Opus for architecture via Claude Code, Codex for code generation via ChatGPT subscription. This is the stack that kept costs manageable during the real build.

Tab complete

The short version: start simple (Sonnet for everything), tune when you feel the cost. When you tune, the recommended stack is Opus orchestration + Codex coding + local/Haiku grunt work, proxied via LiteLLM. Keep your identity files strong so model swaps don't break your agent's character.