Back to Blog
AI & LLMs February 12, 2026 · 9 min read

Why We Use 7 AI Providers (Not Just One) — And How We Track Every Cent

GM

Gonzalo Monzón

Founder & Lead Architect

When OpenAI has an outage, your whole product goes down. When Google changes Gemini pricing, your margins evaporate. When Anthropic rate-limits you during peak hours, your users wait. We decided early on: no single point of failure for AI. So we built an AI Gateway that routes every call across 7+ providers — and our total AI bill last month was ~$184 for 11,200+ calls.

The Single-Provider Trap

Most companies pick one AI provider — usually OpenAI — and build everything around it. One SDK, one billing dashboard, one set of docs. Easy at first. But then reality hits:

  • Outages: Every major provider has had multi-hour outages in the past year. OpenAI alone had 4 significant incidents in 2025.
  • Price changes: GPT-4o pricing has changed 3 times since launch. Gemini went from free to paid to "free with limits" in a single quarter.
  • Rate limits: Hit your tier ceiling? Your users get errors. Scaling a tier up costs hundreds more per month.
  • Quality variance: Different models excel at different tasks. Claude is great at structured output. Gemini Flash is fastest for classification. DeepSeek dominates code generation. No single model wins everywhere.

We run 14+ interconnected products — from travel agencies to radiology systems. A single provider outage could take down patient appointment confirmations, travel booking responses, and content generation simultaneously. That's not acceptable.

Our Solution: The AI Gateway

Every AI call in the Cadences ecosystem — text generation, image creation, voice synthesis, transcription, vision analysis — passes through a centralized AI Gateway running on Cloudflare Workers. The gateway adds less than 5ms of overhead per request and handles everything: validation, routing, cost tracking, error detection, and automatic fallback.

The Pipeline (Every Request)

Here's exactly what happens when any service in our ecosystem makes an AI call:

  1. Validation: Is the model active and not marked as failed? Does the organization have access? Are parameters valid?
  2. Rate Limiting: Check against the org's plan tier (FREE: 60 req/min → ENTERPRISE: unlimited). Per-model limits also apply.
  3. API Key Resolution: Three-level cascade — organization key first, then user key, then system fallback. This lets clients use their own API keys if they want lower costs or higher limits.
  4. Provider Call: Route to the appropriate provider with a provider-specific adapter, configured timeout, and retry logic.
  5. Logging: Record everything — input/output tokens, calculated cost, latency, provider, model, status, category (LLM, TTI, I2I, TTS, STT, Vision, Web).
  6. Error Handling: If it fails → detect error pattern (deprecated, rate_limit, timeout, auth_error, content_filter) → if pattern is recurrent, mark model as failed → automatic fallback.

7+ Providers, 30+ Models, 7 Modalities

This isn't just about text. We integrate AI across 7 distinct modalities:

LLM (Text Generation)

Google Gemini (2.5 Flash, 2.5 Pro, 2.0 Flash, 1.5 Pro) — Our workhorse. ~5,200 calls last month. Fast, cheap, excellent for most tasks.

Anthropic Claude (Sonnet 4, Haiku) — Best for structured output, long documents, and nuanced analysis. ~820 calls, $52.

OpenAI GPT-4o (GPT-4o, GPT-4o-mini) — Strong at creative content and vision. ~580 calls, $38.

DeepSeek (Chat, Reasoner) — Excellent for code generation. Very fast, very cheap.

Groq (Llama 4 Scout, Llama 4 Maverick, Compound Beta) — Blazing fast inference. We use Compound Beta for agentic web search.

Cloudflare Workers AI (Llama 3.x, Mistral, Phi-3) — Free. Always available. ~1,100 calls as fallback.

xAI Grok — For specific use cases requiring real-time information.

Image Generation (TTI + I2I)

FLUX Schnell on Cloudflare — Free, fast, our default. ~1,800 images last month at $0 cost. DALL-E 3, Imagen 3, and Recraft V3 for premium quality when needed.

Voice (TTS + STT)

ElevenLabs for production TTS (Multilingual v2, Turbo v2.5) and STT with speaker diarization (Scribe). OpenAI Whisper and Groq Whisper for fast transcription. Edge TTS for free local TTS. Twilio for actual phone calls.

Real Numbers: February 2026

This isn't a projection or an estimate. These are actual numbers from our production dashboard:

📊 Total calls: 11,200+

💰 Total cost: ~$184

Average latency: 1.2s (gateway overhead: <5ms)

Error rate: 2.4%

🔄 Fallback activations: ~4.5% of requests

Cost Breakdown by Provider

🟢 Gemini 2.5 Flash: ~5,200 calls · $28 · p95 latency: 0.8s

🟣 Claude Sonnet 4: ~820 calls · $52 · p95: 2.1s

🔵 GPT-4o: ~580 calls · $38 · p95: 1.5s

🟡 FLUX Schnell: ~1,800 calls · $0.00 · p95: 3.2s

Workers AI Llama: ~1,100 calls · $0.00 · p95: 0.6s

🔴 Other (DeepSeek, Groq, ElevenLabs, etc.): ~1,700 calls · $66

The insight: Gemini handles 46% of our calls at just 15% of the cost. Claude handles 7% of calls but costs 28% — we only route complex analytical tasks there. Workers AI and FLUX handle 26% of calls for free. Intelligent routing saves us ~35% compared to sending everything to GPT-4o.

The Fallback Chain: Never Lose a Call

When a provider fails, the Gateway doesn't just return an error. It cascades:

  1. Primary provider (e.g., Gemini 2.5 Flash) → fails
  2. Secondary provider (e.g., Groq Llama 4) → if also fails...
  3. Cloudflare Workers AI (Llama 3.x) → always available, zero cost, on our own edge infrastructure

The user always gets a response. The system marks whether fallback was used for transparency and quality tracking.

If a model fails 3+ times consecutively, it's automatically marked as "failed" and removed from rotation until an admin manually re-tests it. No silent degradation.

Smart Error Detection

The Gateway doesn't just log errors — it detects patterns:

  • deprecated: Model being phased out → alert + migration suggestion
  • rate_limit: Hitting provider limits → automatic throttle + request queue
  • timeout: Slow responses → increase timeout or trigger fallback
  • auth_error: API key issues → verify key, notify admin immediately
  • content_filter: Provider rejected the prompt → log + retry with modified prompt

We have 6 pre-built SQL views in D1 that give us instant answers: cost by model, by provider, by category, by day, errors by model, and performance percentiles (p50/p95/p99) by model. When something changes, we see it in real-time.

The Cost Dashboard

Every cent is tracked. Our dashboard shows:

  • Overview: Total monthly cost, top models, trend line
  • By Provider: Breakdown with actual spend per provider
  • By Category: Distribution across LLM vs Image vs Voice vs Vision
  • By Model: Detailed table — calls, tokens, cost, error rate, latency percentiles
  • Timeline: Daily cost trend with usage spikes visualized
  • Errors: Error patterns, failed models, automatic recommendations

This data directly informs our billing tiers. We know exactly how much a FREE user costs us (mostly Workers AI = $0) vs a PROFESSIONAL user ($0.15-0.40/day average).

The API Key Cascade

One design decision that proved invaluable: the three-level API key resolution.

  1. Organization key: If the client has their own OpenAI/Gemini key, we use it. They pay provider costs directly, we charge only for the platform.
  2. User key: Individual users can add their own keys for specific providers.
  3. System key: Our default keys, included in the subscription cost.

This flexibility lets us serve everything from free-tier hobbyists (Workers AI only) to enterprise clients who want to use their own negotiated API contracts.

Rate Limiting by Plan

🆓 FREE: 60 req/min · Workers AI only · 10 images/day

🟢 STARTER: 120 req/min · Groq + DeepSeek · 50 images/day

🔵 PROFESSIONAL: 300 req/min · All providers · 200 images/day

🟣 ENTERPRISE: Unlimited · All providers + priority routing · Unlimited images

Key Takeaways

After running this system in production across 14+ products:

  1. Abstract early. Don't call OpenAI directly from components. Build a service layer from day one.
  2. Log everything. Tokens, cost, latency, provider, status. You'll discover that 30% of your AI spend goes to tasks that a cheaper model handles equally well.
  3. Free fallback is essential. Cloudflare Workers AI is free and runs on your own edge. It's lower quality than GPT-4o, but it's infinitely better than an error.
  4. Let clients bring their own keys. Enterprise clients with negotiated API contracts save money. You reduce your infrastructure costs. Everyone wins.
  5. Track errors as patterns, not incidents. A single timeout is noise. Five timeouts from the same model in an hour is a signal. Automate the response.

The total investment: one Cloudflare Worker (~200 lines of routing logic), 4 D1 tables, and 6 SQL views. Gateway overhead: <5ms per request. Monthly cost to track everything: included in the Cloudflare Workers plan ($5/month). The savings from intelligent routing pay for it 10x over.

Tags

AI Gateway Multi-Provider Cost Optimization Cloudflare Workers LLMs Fallback Edge Computing

About the Author

Gonzalo Monzón

Gonzalo Monzón

Founder & Lead Architect

Gonzalo Monzón is a Senior Solutions Architect & AI Engineer with over 26 years building mission-critical systems in Healthcare, Industrial Automation, and enterprise AI. Founder of Cadences Lab, he specializes in bridging legacy infrastructure with cutting-edge technology.

Stay in the loop

Get notified when we publish new articles about AI automation, use cases, and practical guides.