From 4-Hour Response Time to Instant: How Our AI Voice Agents Make Real Phone Calls
Gonzalo Monzón
Founder & Lead Architect
The first time we let an AI call a real person, we held our breath. The agent introduced itself, confirmed the appointment details, asked if the time still worked, and smoothly handled a reschedule — all in 47 seconds. Our team had been spending 15 minutes per call doing the exact same thing. Six months and 1,200+ calls/month later, most people on the other end don't realize they're talking to AI.
Why Voice Agents, Not Just Chatbots
In Spain and Latin America — our primary markets — phone calls still dominate business communication. Especially in healthcare, travel, and real estate:
- Medical clinics: Patients prefer calling to confirm or reschedule appointments. They don't check email.
- Travel agencies: Leads expect a call within hours of an inquiry. A 24h delay means they've booked elsewhere.
- Real estate: Viewing schedules, document reminders, and follow-ups happen over the phone.
Chatbots are great for self-service. But when the human expects to talk, you need something that talks back — naturally, in their language, at conversation speed.
The Technical Stack: Three Technologies, One Conversation
Our voice agent system layers three core technologies on top of Cloudflare Workers:
1. Twilio — The Phone Infrastructure
Twilio handles the telephony layer: making outbound calls, receiving inbound calls, managing phone numbers across countries. We use Twilio's Media Streams API to get real-time audio in/out of calls as WebSocket streams — this is what makes real-time AI conversation possible.
2. Gemini 2.5 Flash — The Brain
The conversation logic runs on Gemini 2.5 Flash specifically. Why not GPT-4o or Claude? Latency. In voice conversations, anything over 500ms feels like the other person is "not listening." Gemini Flash consistently delivers under 200ms for responses, which creates the illusion of natural conversation flow.
Each call gets a system prompt tailored to the business context: clinic name, appointment details, patient history, available times for rescheduling, escalation rules, and the voice agent's "personality" (professional but warm, never pushy).
3. ElevenLabs — The Voice
ElevenLabs provides 15+ realistic voices in Spanish (Castilian and Latin American variants), English, Catalan, and French. We use the Turbo v2.5 model for real-time TTS with streaming — the voice starts speaking before the full sentence is generated, reducing perceived latency to near-zero.
Voice selection matters more than you'd think. We tested extensively: a female voice with mid-range pitch and moderate pace scored 23% higher in satisfaction surveys than a "standard" male voice. Each client can choose and customize their agent's voice.
Anatomy of an AI Call: Step by Step
Here's what happens when the system makes an outbound appointment confirmation call:
Workflow trigger: "Appointment in 48h"
│
├── Worker fetches context from D1:
│ Patient name, appointment time, doctor, clinic address
│
├── Twilio places outbound call
│ └── Twilio Media Streams → WebSocket audio stream
│
├── Patient picks up (or voicemail detected)
│ ├── Voicemail → Leave pre-recorded message → End
│ └── Human detected → Start conversation
│
├── AI agent speaks:
│ "Hola María, llamo del Centre Mèdic per confirmar
│ la seva cita de dimarts a les 10h amb el Dr. Martí.
│ Li va bé l'horari?"
│
├── Patient responds → Groq Whisper STT → Text
│ └── Gemini Flash processes → Generates response
│ └── ElevenLabs TTS → Audio back to call
│
├── Conversation continues (avg 35 seconds)
│ ├── Confirmed → Log in D1 + send WhatsApp confirmation
│ ├── Reschedule → Show available slots, book new time
│ └── Complex request → Transfer to human
│
└── Call ends → Full transcript saved to D1
→ Workflow continues with result
What Actually Works in Production
After 6+ months of real deployment across medical clinics, travel agencies, and a real estate group, here's what AI voice agents handle well:
- ✅ Appointment confirmations: 89% resolution rate, average call 35 seconds. The agent confirms time, doctor, location, and handles simple rescheduling.
- ✅ Lead qualification: Asks 3-4 screening questions (budget, timeline, preferences), scores the lead 0-100 with AI, routes to the right human agent — or schedules a detailed call.
- ✅ After-hours support: 24/7 availability for basic queries: clinic hours, directions, available services, pricing information.
- ✅ Multilingual support: Seamlessly switches between Spanish, English, and Catalan mid-conversation. The agent detects the language from the first response and adapts.
- ✅ Follow-up calls: "You didn't confirm your appointment — would you like to reschedule?" These recovery calls save ~15% of appointments that would otherwise be no-shows.
What Doesn't Work (Yet) — Honesty Time
Not everything is sunshine. Here's where AI voice agents still struggle:
- ❌ Complex negotiations: Anything involving back-and-forth on pricing, custom packages, or exceptions. The AI can present options but can't negotiate creatively. Always transfers to human.
- ❌ Emotional situations: Complaints, health concerns, frustrated clients. The AI detects negative sentiment and escalates immediately — trying to "solve" emotional situations with a robot voice makes everything worse.
- ❌ Long conversations: Beyond 3 minutes, conversation quality degrades. The context window fills up, responses get less contextual, and the uncanny valley effect kicks in. Our hard limit is 4 minutes — after that, transfer to human.
- ❌ Background noise: Construction sites, cars, crowds. Whisper STT accuracy drops significantly. We detect low-confidence transcriptions and ask the caller to repeat or offer to call back later.
God Mode: Human Supervision in Real Time
This is our secret weapon, and it's what makes deployment viable for clients who are nervous about "letting AI talk to customers." God Mode is a real-time supervision dashboard in Cadences Nexus:
Live Monitoring
A supervisor sees all active AI calls in a dashboard. Each call shows: live transcription of both parties, the AI's "thinking" (what it's about to say), sentiment indicators, and a confidence score.
Whisper Mode
The supervisor can "whisper" instructions to the AI agent that the caller cannot hear. Example: the AI is about to say "we don't have availability this week" — the supervisor whispers "offer Thursday 4pm, Dr. García just had a cancellation." The AI integrates this instruction naturally into its next response. The caller has no idea a human intervened.
Takeover Mode
If things go sideways, the supervisor clicks "Take Over" and their voice replaces the AI's. The transition is seamless — the caller hears a slightly different voice but the conversation continues without interruption. The AI drops to transcription-only mode, logging the rest of the call.
Call History & Analytics
Every call is recorded, transcribed, and scored. Supervisors can review calls later, flag issues, and the AI learns from corrections. Performance metrics include: resolution rate, average duration, sentiment score, and transfer rate (lower is better).
The Latency Challenge: Why 200ms Changes Everything
Natural conversation has a rhythm. Humans tolerate ~500ms of silence between dialogue turns. Go beyond that, and the caller starts saying "hello? are you there?" — the universal sign that the conversation feels robotic.
Our total latency budget per turn:
- Groq Whisper STT: ~80ms for speech-to-text (Groq's hardware-accelerated Whisper is the fastest option)
- Gemini Flash response: ~120-180ms for generating the reply text
- ElevenLabs TTS: Streaming mode — first audio chunk available in ~90ms, plays while the rest generates
- Total perceived latency: ~200ms from end of human speech to start of AI speech
We tried GPT-4o (350-500ms response time) and Claude Haiku (280-400ms). Both are too slow for natural-feeling conversation. Gemini Flash at 120-180ms is the sweet spot — fast enough to feel natural, smart enough to handle complex dialogue.
Voice Selection: More Important Than You Think
We offer 15+ voices via ElevenLabs, and we learned that voice selection dramatically impacts call outcomes:
- Healthcare: Female voice, calm pace, mid-pitch — scored highest for trust and compliance
- Sales/Travel: Male voice, slightly upbeat, natural enthusiasm — best for lead qualification
- Reminders/Follow-ups: Gender-neutral, professional, brief — people don't want a chatty reminder call
Each client configures their preferred voice, greeting style, and communication pace. The voice becomes part of their brand identity — some clients' customers refer to the AI by the name we gave it ("Ana from the clinic called me to confirm").
Integration with Cadences Workflows
Voice calls aren't standalone — they're nodes in the workflow engine. This enables powerful automations:
- AI Voice Call node: Place outbound call with dynamic script, receive structured result (confirmed/rescheduled/transferred/no-answer)
- Inbound Call Trigger: When a client calls a Twilio number, the AI answers and routes based on intent
- Post-call actions: Based on call result → update CRM, send WhatsApp confirmation, schedule follow-up, notify sales team
- Escalation workflows: If AI transfers to human and human isn't available → voicemail → email → SMS cascade
Cost Breakdown: €0.08 vs €2.50 Per Call
📞 Twilio cost: ~€0.02/min for outbound calls in Spain
🗣️ ElevenLabs TTS: ~€0.03 per call (avg 35 sec of generated speech)
🧠 Gemini Flash: ~€0.01 per call (conversation processing)
🎤 Groq Whisper STT: ~€0.02 per call (transcription)
💰 Total per call: ~€0.08
👤 Human equivalent: ~€2.50 (15 min at €10/hr including overhead)
At 1,200+ calls/month, that's ~€96/month for AI vs ~€3,000/month for a dedicated phone operator. The AI handles routine calls 24/7 — the human team focuses on complex cases and sales.
Production Numbers After 6 Months
📞 Calls handled: 1,200+ per month across all clients
⚡ Average response latency: ~200ms end-to-end
✅ Resolution rate: 89% without human intervention
📅 Appointment confirmation: 92% success rate
⏱️ Average call duration: 35 seconds (confirmations), 2.5 min (qualification)
😊 Satisfaction score: 4.2/5 (post-call survey — most didn't realize it was AI)
🔄 Transfer to human: 11% of calls
🗓️ No-show reduction: 15% fewer missed appointments
What We've Learned
- Speed beats perfection: A 200ms "good enough" response feels more natural than a 2-second "perfect" one. Optimize for latency first, quality second.
- Know when to shut up: The AI agents that perform best are the ones with strict boundaries. Don't try to handle everything — detect complexity and transfer fast.
- Voice is brand: Clients get attached to their AI's voice. Changing it mid-deployment caused complaints. Treat voice selection like a branding decision.
- God Mode is the closer: Without live supervision capability, no client would have deployed. The safety net of "I can always take over" is what gets the initial yes.
- After-hours calls are pure gold: The AI handles calls at 11 PM on a Sunday. That appointment confirmation that would have waited until Monday morning? Confirmed instantly. Clients love this more than anything.
Tags
About the Author
Gonzalo Monzón
Founder & Lead Architect
Gonzalo Monzón is a Senior Solutions Architect & AI Engineer with over 26 years building mission-critical systems in Healthcare, Industrial Automation, and enterprise AI. Founder of Cadences Lab, he specializes in bridging legacy infrastructure with cutting-edge technology.
Related Articles
No-Code Workflows That Actually Work in Production — 7,000 Lines of Execution Engine
Most "no-code" tools break at the first real-world edge case. We built a visual workflow engine with 20+ node types, Canvas API at 60fps, Durable Objects for long-running execution, and step-by-step debugging. Here's how 7,073 lines of engine code make drag-and-drop actually production-grade.
Perspectiva Studio: 19,000 Lines of Vanilla JS That Create Audiobooks, Blogs, and AI Coach Sessions
We built a full content creation engine — audiobooks with 15+ ElevenLabs voices, blog articles with AI-generated images from 5 providers, PDF documents, and real-time AI Coach sessions — all in zero-dependency Vanilla JS running on Cloudflare.
NutriNen Baby: A Gamified Baby Nutrition App with a 33-Tool AI Chatbot and a Music Box
We built a mobile-first baby nutrition app with meal tracking, a virtual fridge (6 categories), an AI chatbot with 33 function-calling tools, a music box with 22 melodies generated via Web Audio API, WHO growth charts, and full gamification — all in 19,600 lines of zero-dependency Vanilla JS running as a native Android app via Capacitor.