AI & Automation
2023 — Present

AI Gateway

Centralized Multi-Provider AI Infrastructure

~34K lines of AI-specific code managing all AI calls across the ecosystem. Smart routing with 4 strategies, 6-level API key resolution, cost tracking with 7 SQL views, alerting system, neuron budget management, and automatic fallback to free Workers AI.

Year

2023 — Present

Role

Architect & Developer

Tech Stack

11 technologies

The Challenge

An ecosystem with 14+ products calling multiple AI providers needs centralized management — not scattered API calls.

  • Cost control across 9+ providers with different pricing models, token counting, and billing units
  • Intelligent routing — choosing the cheapest, fastest, or highest-quality model depending on the request context
  • API key resolution across organizations, users, and system tiers — with fallback to free models when all else fails
  • Real-time monitoring: alerting when spend spikes, error pattern detection, and model health tracking

The Approach

A gateway layer that intercepts every AI call in the ecosystem:

  • Smart Router with 4 strategies — cheapest, fastest, quality, or balanced (scoring: quality×2 + speed - log₁₀(cost)×2). Each request routes to the optimal model based on the chosen strategy
  • 6-level API key resolution — user → org → system-tier → system-all → env → free Workers AI. The first valid key wins; if all fail, the call falls back to Cloudflare Workers AI at zero cost
  • Provider-specific adapters — each provider has its own adapter normalizing requests/responses to a common format. Custom providers (LMStudio, Ollama, vLLM) use an openai_compatible adapter
  • Neuron budget management — Workers AI's free tier gives 10K neurons/day. The gateway tracks consumption and selects models that fit the remaining budget

The Solution

Full request lifecycle managed end-to-end:

  • Pipeline — Request → Tier Verification (DB) → Rate Limiting (tier-based, fail-open) → Smart Routing → API Key Resolution (6 levels) → Provider Adapter → callAI() wrapper → Response Normalization (all providers → OpenAI format)
  • 9 modalities — LLM chat, Vision/OCR, Image Generation (TTI), TTS, STT, Video Generation, Embeddings, Gemini Live (bidirectional voice), ElevenLabs Conversational AI
  • Error handling — 9 error types detected (rate_limit, auth, timeout, model_not_found, content_filter, quota_exceeded...), model health auto-updated, category-specific fallback chains
  • Alerting — $1/hr warning, $5/hr critical thresholds. 7 SQL views for cost analytics, per-model usage, and error patterns
  • Rate limiting tiers — FREE: 10/day, PERSONAL: 100/day, PROFESSIONAL: 1000/day, BUSINESS: unlimited pay-as-you-go

Key Results

  • ~34K lines across 66+ files (backend 15.5K, frontend 14K, voice 2.5K)
  • 9 cloud providers + custom (LMStudio, Ollama, vLLM, any OpenAI-compatible)
  • 9 modalities: LLM, Vision/OCR, Image Gen, TTS, STT, Video Gen, Embeddings, Gemini Live, Conversational AI
  • Smart Router: 4 strategies with quality×2 + speed - log₁₀(cost)×2 scoring
  • 6-level API key resolution: user → org → system-tier → system-all → env → Workers AI
  • 12+ D1 tables, 7 SQL views, alerting system ($1/hr warning, $5/hr critical)
  • Workers AI neuron budget tracking (10K/day free, budget-aware model selection)
  • Tier-based rate limiting: FREE 10/day → BUSINESS unlimited pay-as-you-go

Tech Stack

Cloudflare Pages Functions D1 Gemini OpenAI Anthropic DeepSeek Groq xAI Perplexity Together AI ElevenLabs
$ cat project.json
{
"name": "AI Gateway",
"status": "production",
"stack": [11],
"results": [8]
}