Why We Use 7 AI Providers (Not Just One) — And How We Track Every Cent
Gonzalo Monzón
Founder & Lead Architect
When OpenAI has an outage, your whole product goes down. When Google changes Gemini pricing, your margins evaporate. When Anthropic rate-limits you during peak hours, your users wait. We decided early on: no single point of failure for AI. So we built an AI Gateway that routes every call across 7+ providers — and our total AI bill last month was ~$184 for 11,200+ calls.
The Single-Provider Trap
Most companies pick one AI provider — usually OpenAI — and build everything around it. One SDK, one billing dashboard, one set of docs. Easy at first. But then reality hits:
- Outages: Every major provider has had multi-hour outages in the past year. OpenAI alone had 4 significant incidents in 2025.
- Price changes: GPT-4o pricing has changed 3 times since launch. Gemini went from free to paid to "free with limits" in a single quarter.
- Rate limits: Hit your tier ceiling? Your users get errors. Scaling a tier up costs hundreds more per month.
- Quality variance: Different models excel at different tasks. Claude is great at structured output. Gemini Flash is fastest for classification. DeepSeek dominates code generation. No single model wins everywhere.
We run 14+ interconnected products — from travel agencies to radiology systems. A single provider outage could take down patient appointment confirmations, travel booking responses, and content generation simultaneously. That's not acceptable.
Our Solution: The AI Gateway
Every AI call in the Cadences ecosystem — text generation, image creation, voice synthesis, transcription, vision analysis — passes through a centralized AI Gateway running on Cloudflare Workers. The gateway adds less than 5ms of overhead per request and handles everything: validation, routing, cost tracking, error detection, and automatic fallback.
The Pipeline (Every Request)
Here's exactly what happens when any service in our ecosystem makes an AI call:
- Validation: Is the model active and not marked as failed? Does the organization have access? Are parameters valid?
- Rate Limiting: Check against the org's plan tier (FREE: 60 req/min → ENTERPRISE: unlimited). Per-model limits also apply.
- API Key Resolution: Three-level cascade — organization key first, then user key, then system fallback. This lets clients use their own API keys if they want lower costs or higher limits.
- Provider Call: Route to the appropriate provider with a provider-specific adapter, configured timeout, and retry logic.
- Logging: Record everything — input/output tokens, calculated cost, latency, provider, model, status, category (LLM, TTI, I2I, TTS, STT, Vision, Web).
- Error Handling: If it fails → detect error pattern (deprecated, rate_limit, timeout, auth_error, content_filter) → if pattern is recurrent, mark model as failed → automatic fallback.
7+ Providers, 30+ Models, 7 Modalities
This isn't just about text. We integrate AI across 7 distinct modalities:
LLM (Text Generation)
Google Gemini (2.5 Flash, 2.5 Pro, 2.0 Flash, 1.5 Pro) — Our workhorse. ~5,200 calls last month. Fast, cheap, excellent for most tasks.
Anthropic Claude (Sonnet 4, Haiku) — Best for structured output, long documents, and nuanced analysis. ~820 calls, $52.
OpenAI GPT-4o (GPT-4o, GPT-4o-mini) — Strong at creative content and vision. ~580 calls, $38.
DeepSeek (Chat, Reasoner) — Excellent for code generation. Very fast, very cheap.
Groq (Llama 4 Scout, Llama 4 Maverick, Compound Beta) — Blazing fast inference. We use Compound Beta for agentic web search.
Cloudflare Workers AI (Llama 3.x, Mistral, Phi-3) — Free. Always available. ~1,100 calls as fallback.
xAI Grok — For specific use cases requiring real-time information.
Image Generation (TTI + I2I)
FLUX Schnell on Cloudflare — Free, fast, our default. ~1,800 images last month at $0 cost. DALL-E 3, Imagen 3, and Recraft V3 for premium quality when needed.
Voice (TTS + STT)
ElevenLabs for production TTS (Multilingual v2, Turbo v2.5) and STT with speaker diarization (Scribe). OpenAI Whisper and Groq Whisper for fast transcription. Edge TTS for free local TTS. Twilio for actual phone calls.
Real Numbers: February 2026
This isn't a projection or an estimate. These are actual numbers from our production dashboard:
📊 Total calls: 11,200+
💰 Total cost: ~$184
⚡ Average latency: 1.2s (gateway overhead: <5ms)
❌ Error rate: 2.4%
🔄 Fallback activations: ~4.5% of requests
Cost Breakdown by Provider
🟢 Gemini 2.5 Flash: ~5,200 calls · $28 · p95 latency: 0.8s
🟣 Claude Sonnet 4: ~820 calls · $52 · p95: 2.1s
🔵 GPT-4o: ~580 calls · $38 · p95: 1.5s
🟡 FLUX Schnell: ~1,800 calls · $0.00 · p95: 3.2s
⚪ Workers AI Llama: ~1,100 calls · $0.00 · p95: 0.6s
🔴 Other (DeepSeek, Groq, ElevenLabs, etc.): ~1,700 calls · $66
The insight: Gemini handles 46% of our calls at just 15% of the cost. Claude handles 7% of calls but costs 28% — we only route complex analytical tasks there. Workers AI and FLUX handle 26% of calls for free. Intelligent routing saves us ~35% compared to sending everything to GPT-4o.
The Fallback Chain: Never Lose a Call
When a provider fails, the Gateway doesn't just return an error. It cascades:
- Primary provider (e.g., Gemini 2.5 Flash) → fails
- Secondary provider (e.g., Groq Llama 4) → if also fails...
- Cloudflare Workers AI (Llama 3.x) → always available, zero cost, on our own edge infrastructure
The user always gets a response. The system marks whether fallback was used for transparency and quality tracking.
If a model fails 3+ times consecutively, it's automatically marked as "failed" and removed from rotation until an admin manually re-tests it. No silent degradation.
Smart Error Detection
The Gateway doesn't just log errors — it detects patterns:
- deprecated: Model being phased out → alert + migration suggestion
- rate_limit: Hitting provider limits → automatic throttle + request queue
- timeout: Slow responses → increase timeout or trigger fallback
- auth_error: API key issues → verify key, notify admin immediately
- content_filter: Provider rejected the prompt → log + retry with modified prompt
We have 6 pre-built SQL views in D1 that give us instant answers: cost by model, by provider, by category, by day, errors by model, and performance percentiles (p50/p95/p99) by model. When something changes, we see it in real-time.
The Cost Dashboard
Every cent is tracked. Our dashboard shows:
- Overview: Total monthly cost, top models, trend line
- By Provider: Breakdown with actual spend per provider
- By Category: Distribution across LLM vs Image vs Voice vs Vision
- By Model: Detailed table — calls, tokens, cost, error rate, latency percentiles
- Timeline: Daily cost trend with usage spikes visualized
- Errors: Error patterns, failed models, automatic recommendations
This data directly informs our billing tiers. We know exactly how much a FREE user costs us (mostly Workers AI = $0) vs a PROFESSIONAL user ($0.15-0.40/day average).
The API Key Cascade
One design decision that proved invaluable: the three-level API key resolution.
- Organization key: If the client has their own OpenAI/Gemini key, we use it. They pay provider costs directly, we charge only for the platform.
- User key: Individual users can add their own keys for specific providers.
- System key: Our default keys, included in the subscription cost.
This flexibility lets us serve everything from free-tier hobbyists (Workers AI only) to enterprise clients who want to use their own negotiated API contracts.
Rate Limiting by Plan
🆓 FREE: 60 req/min · Workers AI only · 10 images/day
🟢 STARTER: 120 req/min · Groq + DeepSeek · 50 images/day
🔵 PROFESSIONAL: 300 req/min · All providers · 200 images/day
🟣 ENTERPRISE: Unlimited · All providers + priority routing · Unlimited images
Key Takeaways
After running this system in production across 14+ products:
- Abstract early. Don't call OpenAI directly from components. Build a service layer from day one.
- Log everything. Tokens, cost, latency, provider, status. You'll discover that 30% of your AI spend goes to tasks that a cheaper model handles equally well.
- Free fallback is essential. Cloudflare Workers AI is free and runs on your own edge. It's lower quality than GPT-4o, but it's infinitely better than an error.
- Let clients bring their own keys. Enterprise clients with negotiated API contracts save money. You reduce your infrastructure costs. Everyone wins.
- Track errors as patterns, not incidents. A single timeout is noise. Five timeouts from the same model in an hour is a signal. Automate the response.
The total investment: one Cloudflare Worker (~200 lines of routing logic), 4 D1 tables, and 6 SQL views. Gateway overhead: <5ms per request. Monthly cost to track everything: included in the Cloudflare Workers plan ($5/month). The savings from intelligent routing pay for it 10x over.
Tags
About the Author
Gonzalo Monzón
Founder & Lead Architect
Gonzalo Monzón is a Senior Solutions Architect & AI Engineer with over 26 years building mission-critical systems in Healthcare, Industrial Automation, and enterprise AI. Founder of Cadences Lab, he specializes in bridging legacy infrastructure with cutting-edge technology.
Related Articles
Edge Computing: Why We Bet Everything on Cloudflare (And What $65/Month Gets You)
No servers, no containers, no Kubernetes. We run 14+ interconnected products on 9 Cloudflare products — Workers, D1, R2, Durable Objects, Pages, KV, Vectorize, Workers AI and WAF. $65/month for what would cost $400-600 on AWS. Here's the full architecture.
Synapse Studio: A 2D Virtual Office Where AI Agents Do the Real Work
We built a SimTower-style animated office where AI agents with multimodal capabilities — vision, image generation, web search, iterative image evolution — collaborate on real tasks. Zero dependencies, pure Vanilla JS, running on Cloudflare.
Perspectiva Studio: 19,000 Lines of Vanilla JS That Create Audiobooks, Blogs, and AI Coach Sessions
We built a full content creation engine — audiobooks with 15+ ElevenLabs voices, blog articles with AI-generated images from 5 providers, PDF documents, and real-time AI Coach sessions — all in zero-dependency Vanilla JS running on Cloudflare.