The AI Cost Iceberg: Transitioning to Total Cost of Operation (TCOp)

Stop focusing on sticker prices; manage hidden 'thinking' tokens and context taxes through a dynamic total cost of operation framework.

Conclusion

Organisations frequently underestimate the financial implications of generative artificial intelligence (AI) (GenAI) by focusing solely on model inference rates. The reality of GenAI deployment reveals that the sticker price per token represents only a fraction of actual spend. To avoid budget overruns, leaders must transition from traditional total cost of ownership (TCO) models that are too often used to guestimate the ongoing cost of running an AI initiative to a dynamic total cost of operation (TCOp) framework. This approach accounts for hidden consumption drivers – including thinking tokens, context window taxes, the massive cost jump due to agentic loops, retrieval infrastructure, and ongoing human verification and testing.

By understanding the true TCOp, IT leaders will be able to demonstrate the true value proposition GenAI offers the organisation.

Observations

Evolving Economics of Model Pricing: The foundational error in AI budgeting is assuming a linear relationship between input volume and cost. While vendors price models per million tokens, distinct factors distort this calculation in practice, often leading to unexpected usage taxes.

  • The Reasoning Multiplier (The Write Tax): Advanced system 2 models achieve higher performance not merely through better training, but by generating thousands of internal thinking tokens before producing a visible response. A user query that requires 500 output tokens may consume 1,500 hidden tokens to process the reasoning path. Consequently, users are billed for compute volume they never see, effectively creating a hidden multiplier on output costs.
  • The Read Tax and Context Penalties: Pricing structures are increasingly tiered based on context length. For example, costs for premium models like Gemini 2.5 Pro and Claude Sonnet 4 can increase by over 50 per cent for input and 20 per cent for output if a prompt exceeds 200,000 tokens. Table 3 provides a summary of the popular model context taxes. This penalises the lazy architectural pattern of dumping entire codebases or 50-page legal PDFs into the context window – a practice that was financially viable with previous generations but is now cost-prohibitive. For pure reading and comprehension tasks where deep reasoning is not required, using previous-generation models can offer far better value.
  • The Hidden Layers of Total Cost of Operation (TCOp): The most significant consideration in a TCOp framework is the Accuracy Trap, where the perceived savings of a low-cost model are erased by the operational costs of its failures. While the sticker price of a premium model like Claude Sonnet 4.5 is approximately 45 times higher per task than a model like Mistral Small ($0.054 vs. $0.0012), this delta only accounts for compute. A robust budget must look beyond inference to the full TCOp, defined as the marginal cost of delivering an AI-enabled outcome. This comprises several often-overlooked categories:
  • Technical Consumption Costs:
    • Context and RAG Bloat: Chatty user experiences with long retained context mean each new message pays to resend previous content. Similarly, retrieval-augmented generation (RAG) often suffers from over-retrieval – fetching 20 documents where four would suffice – or poor chunking that inflates input tokens.
    • Observability Tax: Modern AI applications log full prompts, responses, and tool parameters for debugging. At scale, the storage and indexing costs for these verbose logs, especially for agentic traces, can rival the compute spend itself.
    • Continuous Evaluation: Quality assurance is not a one-time project phase but a continuous operational cost. Every new prompt template or model variant requires re-running evaluation datasets, incurring extra inference and retrieval costs purely for testing.
  • Quality and Risk Costs:
    • Human-in-the-Loop: High-stakes workflows often require pre-commit verification, where a human checks an AI draft, or post-hoc sampling for compliance. The time cost per reviewed item, multiplied by volume, must be factored into the unit economics.
    • Error and Rework: Every low-quality output triggers remediation, whether it is a user re-running a query, a support ticket, or a manual correction. If a cheaper model has a higher failure rate, the operational cost of this rework often exceeds the savings from the lower token price.

Agentic AI – The Cost Mega-Multiplier

Agentic systems amplify operational costs by adding new dimensions of consumption. A single user goal, such as book a trip, is no longer a single inference but a chain of events: planning, tool calls, parsing, and potential self-correction.

  • Planning and Looping: Agents often use step-by-step thinking or self-reflection loops, where the model critiques its own output. Each step is another call to the large language model (LLM). Without strict budgets, an agent that gets stuck can rapidly burn through significant resources.
  • Simulation and Staging: Responsible deployments often run agents in simulation modes or shadow runs before allowing them to act in production. This means for every real task, there may be multiple simulated runs, multiplying usage.
  • Cost Variance: The choice of model for agents is critical. A 5-step agent task using a premium model like Claude Sonnet 4.5 might cost around $0.054 per task, whereas the same task using a small model like Mistral Small could cost as little as $0.0012 – a 45x delta. This difference is material and highlights the importance of setting model-selection parameters for multi-step agentic workflows.

Best Practices for Accurate Budgeting

To maintain fiscal control, organisations should adopt the following architectural and financial strategies:

  • Implement Tiered Routing: Avoid the one-model-fits-all fallacy. A two-tier cascade strategy by first routing everything to a cheap model (e. g., Mistral Small) and only escalating to a premium model (e. g., o3) if the first model signals low confidence, can reduce costs by 76 per cent to 91 per cent compared to an all-premium baseline.
  • Split Roles: Use small models for structure, big models for language. A cheap model can extract entities, classify intent, and structure facts, while the expensive model is reserved only for generating nuanced narratives or complex reasoning based on those facts. See Tables 1 and 2 for a list of popular models.
  • Establish Budgets per Execution: For agentic workflows, enforce hard limits on the number of steps, tool calls, and tokens allowed per task. Treat cost as a constraint, instructing the system to return a partial result or ask for human help if the budget is exhausted.
  • Monitor Unit Economics: Move reporting from total monthly spend to cost per outcome (e. g., cost per support ticket resolved). This granularity exposes inefficient workflows where the cost of AI exceeds the business value generated.

Next Steps

  • Audit current AI workloads to identify lazy context usage and implement summarisation or windowing strategies.
  • Deploy a model router to direct simple classification and extraction tasks to lightweight models.
  • Define explicit per-run budgets (token limits and step counts) for all autonomous agent workflows.
  • Update financial forecasting to include line items for hidden costs: vector storage, evaluation runs, and verification time.
  • Establish a cost-per-outcome metric for key use cases to track economic viability.

AI Model Cost Comparison (Q4 2025)

Table 1: High-Reasoning/Premium LLMs

Designed for complex reasoning, planning, and high-stakes tasks (AUD pricing as of December 2025).

Model Provider Input Cost (per 1M) Output Cost (per 1M) Notes
o3 OpenAI $2.00 $8.00 Flagship reasoning model; pricing reflects post-June 2025 cuts.
Claude Sonnet 4.5 Anthropic $3.00 $15.00 Premium tier; highest cost per token in this class.
Gemini 2.5 Pro Google $1.25 $10.00 Pricing for standard context (<200k). Costs rise for long context.
o3-mini OpenAI $1.10 $4.40 Cost-efficient reasoning alternative.
DeepSeek-R1 DeepSeek $0.55 $2.19 Cache miss price. Significantly cheaper for structured thought tasks.

Table 2: Low-Cost/Lightweight LLMs

Suitable for classification, routing, extraction, and high-volume tasks. (AUD pricing as at December 2025)

Model Provider Input Cost (per 1M) Output Cost (per 1M) Notes
Mistral Medium 3 Mistral $0.40 $2.00 Balanced mid-tier performance.
Gemini 2.5 Flash-Lite Google $0.10 $0.40 Ultra-low cost; ideal for routing and metadata extraction.
Mistral Small 3.1 Mistral $0.10 $0.30 Extremely efficient for basic tasks.
Phi-3-mini Microsoft $0.13 $0.52 Edge/low-latency focus.
GLM-4–9B Zhipu/SiliconFlow $0.086 $0.086 Lowest commercial rate for constrained tasks.

Table 3: Large Language Model Context Window Taxes

Vendor Model Threshold where pricing steps up Price at/under threshold Price over threshold
Anthropic

(Claude)

Claude Sonnet 4/Sonnet 4.5 (when 1M context window is enabled) > 200,000 input tokens (all tokens billed at long-context rates once exceeded) Input: $3/1M tokens Output: $15/1M tokens Input: $6/1M tokens Output: $22.50/1M tokens
Google (Gemini API / AI Studio)

(Google AI for Developers)

Gemini 2.5 Pro > 200,000 tokens per prompt Input: $1.25 / 1M tokens Output: $10/1M tokens Input: $2.50 / 1M tokens Output: $15/1M tokens
Google Cloud (Vertex AI)

(Google Cloud)

Gemini 1.5 Pro > 128,000 input tokens Text input: $0.0003125/1k characters Text output: $0.00125/1k characters Text input: $0.000625/1k characters Text output: $0.0025/1k characters
Google Cloud (Vertex AI)

(Google Cloud)

Gemini 1.5 Flash > 128,000 input tokens Text input: $0.00001875/1k characters Text output: $0.000075/1k characters Text input: $0.0000375/1k characters Text output: $0.00015/1k characters
xAI (Grok)

(OpenRouter)

Grok 4 Total tokens in request > 128,000 (pricing increases beyond this point) Reported: Input: $3/1M tokens Output: $15/1M tokens Reported: Input: $6/1M tokens Output: $30/1M tokens

Generative AI Budgeting Checklist – Have You Considered…

1. Inference & Compute

  • Model Selection: Have we right-sized the model for the task? (e. g., Are we using a reasoning model for simple text extraction?)
  • Token Estimation: Have we accounted for hidden thinking tokens (billed at the output rate) in reasoning models?
  • Context Management: Is there a mechanism to truncate or summarise long conversation histories to prevent context bloat?
  • Long-Context Penalty: Does the use case require >200k context? If so, have we budgeted for the premium pricing tier?

2. Infrastructure & Operations (Technical TCOp)

  • Retrieval (RAG): Have we budgeted for vector database storage, query costs, and embedding updates?
  • Orchestration: Are costs for API gateways, Cloud functions, and queuing systems included?
  • Observability: Have we estimated the storage costs for logging full prompts/responses and agent traces?
  • Networking: Are there cross-region data transfer or egress fees associated with the model API calls?

3. Quality & Risk (Operational TCOp)

  • Human-in-the-Loop: Have we calculated the labour cost for human verification of AI outputs?
  • Error Recovery: Is there a budget buffer for re-running failed queries or handling user escalations due to poor AI performance?
  • Safety Checks: Are we paying for separate API calls for content moderation (toxicity, PII redaction) before/after the main LLM call?

4. Agentic Specifics

  • Loop Limits: Are there hard caps on the number of steps an agent can take per task?
  • Tool Costs: Have we budgeted for the API costs of the tools the agent calls (e. g., paid search APIs, database queries)?
  • Simulation: Have we budgeted for shadow runs or simulation environments used to test agent behaviour before production?

5. Lifecycle & Governance

  • Evaluation: Is there a recurring budget for running evaluation sets to test prompt/model updates?
  • Maintenance: Have we allocated resources for prompt ops – updating and refining templates as edge cases are discovered?
  • Shadow AI: Is there a contingency for uncontrolled consumption (e. g., teams bypassing central governance)?

Trouble viewing this article?

Search

Register for complimentary membership where you will receive:
  • Complimentary research
  • Free vendor analysis
  • Invitations to events and webinars
Delivered to your inbox each week