Building a Voice AI Customer Intake Workflow with the Realtime API
Replace your form-and-follow-up intake cycle with a single structured voice conversation.
What matters today
Replace your form-and-follow-up intake cycle with a single structured voice conversation.
Key points
- The Five Components
- Cost at Scale
What You'll Learn
- The five components of a minimum viable voice intake workflow
- The system prompt template that produces consistent, professional intake conversations
- How to scope this as a two-week development sprint
Most inbound customer intake processes have the same problem: they ask customers to fill out a form, wait for a human to review it, and then schedule a follow-up call to clarify what the form should have captured the first time. That two-step process adds days to every new customer relationship and costs real staff time on calls that could be replaced by a structured voice conversation.
The OpenAI Realtime API makes it practical to replace that cycle with a single voice conversation that collects structured data in real time. This Gem shows the workflow structure and the system prompt that drives it.
A proof-of-concept implementation requires one backend developer, a Twilio account, and Realtime API access. Realistic timeline: five working days for a functional POC, two additional weeks to reach production quality.
SUBSCRIBER BREAK -- Premium Content Below
The Five Components
- Inbound routing. A phone number via Twilio or similar that forwards audio to your WebSocket server running the Realtime API connection. Setup time: two hours for a developer familiar with Twilio.
- System prompt. Instructions that tell the AI what information to collect, in what order, and how to handle edge cases. The prompt is the entire behavior specification. See the template below.
- Structured output handler. A function call definition that the model invokes when all required fields are collected. The function call carries intake data as structured JSON to your CRM or database.
- Escalation trigger. A condition (customer says "speak to a human," three failed clarification attempts) that routes the call to a live agent and passes collected data so the agent has context immediately.
- Confirmation message. The AI reads back the collected information before ending the call. The customer confirms or corrects. This single step eliminates most data quality issues.
You are an intake specialist for [Company]. Your job is to welcome callers and collect the information needed to prepare for their first consultation. Collect the following in order, one question at a time: 1. Full name 2. Email address (spell it back to confirm) 3. Company name and role 4. Brief description of what they are looking to address (one to two sentences) 5. Preferred appointment time window (morning, afternoon, or specific days) Rules: - Be warm and professional. Never robotic. - Confirm each piece of information before moving to the next. - If a caller is unclear, ask one clarifying question. - If the caller asks a question you cannot answer, say: "I will make sure the team is prepared to address that in your consultation." - Do not make promises about pricing, timelines, or specific outcomes. - After all fields are collected, summarize and ask: "Does that look right?" - Once confirmed, say: "Perfect. You will receive a calendar invite within the next two hours." - If at any point the caller asks to speak with a person, trigger the escalation function immediately.
Cost at Scale
At 100 calls per month averaging five minutes each, API costs run approximately $15-25 depending on audio token rates. A single human intake specialist handling the same volume at 15 minutes per call costs 10-20x more in labor. The ROI case is straightforward at any meaningful call volume.
Three deep dives. Four useful moves. One email worth opening.
PromptHacker turns the AI firehose into practical next steps for work, health, family, and everything time keeps trying to steal.