Voice AI Agent Evaluation Checklist (Vapi/Retell)
Voice AI Agent Evaluation Checklist (Vapi- and Retell-style)
Voice agents fail differently than chat agents. A “good” transcript can still hide bad turn-taking, awkward interruptions, tool-call flakiness, or unsafe handling of PII. This checklist is designed for operators evaluating Voice AI agents built on platforms like Vapi.ai and RetellAI—especially when you need a repeatable go/no-go decision before scaling traffic.
What you’ll get: a step-by-step evaluation checklist, scoring guidance, and a case-study style example with numbers and a timeline—plus an FAQ at the end.
How to use this checklist (and why it’s different)
This article is intentionally checklist-first. Instead of broad frameworks or generic benchmarks, it focuses on what to verify in real calls and in an automated test harness so you can ship changes without surprises.
- Personalization: You’re evaluating Voice AI (not chat) on Vapi/Retell-like stacks where streaming audio, barge-in, and tool calling happen in real time.
- Value prop: A practical, operator-grade checklist you can run weekly and before launches.
- Niche: Voice agent evaluation: latency, interruptions, ASR/WER, NLU, tool reliability, safety/PII, containment, handoff, and observability.
- Your goal: Decide if the agent is ready to take more calls, and know exactly what to fix if it isn’t.
- Your value prop: Faster iteration with fewer regressions and clearer accountability across prompts, models, and integrations.
Checklist #1: Define the target call outcomes (before you test)
Evaluation breaks down when “success” is vague. Start by writing the top outcomes and constraints for each call type.
- Primary outcome: e.g., book an appointment, collect intake fields, take payment, qualify lead, resolve support issue.
- Secondary outcomes: e.g., confirm contact info, send SMS/email confirmation, create CRM ticket.
- Hard constraints: never request SSN, never read back full card number, must disclose recording, must offer human handoff.
- Fallback outcomes: if uncertain, route to human with a clean summary.
Scoring tip: separate “business success” from “conversation quality”
Track two scores per call:
- Outcome score (0/1 or 0–5): Did the call achieve the job-to-be-done?
- Quality score (0–5): Turn-taking, clarity, safety, and tool correctness.
This prevents a common trap: an agent that “books the appointment” but does so rudely, slowly, or unsafely.
Checklist #2: Latency and turn-taking (the voice-specific make-or-break)
In voice, users judge intelligence by timing. You need explicit latency budgets and barge-in behavior tests.
- Time to first token (TTFT): how long until the agent starts speaking after the user finishes.
- Time to first audio (TTFA): when audio actually plays (often worse than TTFT).
- Mid-turn latency: delays after tool calls, confirmations, or long reasoning steps.
- Barge-in handling: does the agent stop speaking when the user interrupts?
- Double-talk rate: how often both sides talk over each other for >500ms.
Pass/fail thresholds you can start with
- TTFA: aim for < 900ms median; investigate if p95 > 2000ms.
- Barge-in: interruption should stop agent audio within ~250–400ms.
- Double-talk: < 3% of turns; > 8% usually feels broken.
When these fail, users repeat themselves, abandon, or demand a human—even if the content is correct.
Checklist #3: ASR quality (WER) and transcript trustworthiness
ASR errors cascade into NLU mistakes and wrong tool calls. Evaluate ASR separately from the agent’s reasoning.
- Word Error Rate (WER): compute on a labeled sample (at least 50–100 utterances per major accent/noise condition).
- Entity error rate: wrong names, addresses, dates, phone numbers, emails (more important than overall WER).
- Noise robustness: test with background noise, speakerphone, car noise, and low bandwidth.
- Endpointing: does ASR cut off the user early or wait too long?
Operator rule: if the agent frequently mishears entities, you must add confirmation steps (and measure the added latency).
Checklist #4: NLU and intent accuracy (what the user meant)
Voice inputs are messy: partial sentences, corrections, and interruptions. Your evaluation needs “intent + slots” accuracy, not just “the reply sounded good.”
- Intent classification: correct identification of call reason (billing vs scheduling vs support).
- Slot filling: correct extraction of required fields (date/time, service type, policy number).
- Repair behavior: when uncertain, does it ask a clarifying question instead of guessing?
- Context carryover: remembers constraints across turns (e.g., “not Tuesday,” “morning only”).
Test with adversarial but realistic utterances: self-corrections (“Actually, make that Thursday”), multi-intent (“Reschedule and also update my email”), and vague requests (“I need help with my account”).
Checklist #5: Tool-calling reliability (the hidden source of “it sounded fine” failures)
For Vapi/Retell-style agents, tool calls are where production incidents happen: timeouts, wrong parameters, duplicate requests, and stale state.
- Correct tool selection: chooses the right API/action for the user’s intent.
- Parameter correctness: validates and formats fields (dates, phone numbers, IDs).
- Idempotency: avoids double-booking or duplicate tickets on retries.
- Timeout handling: graceful user messaging + retry policy (with limits).
- State consistency: if the user changes their mind mid-flow, does the agent cancel/replace prior actions?
Concrete test: force a 20% tool timeout rate in staging and verify the agent’s behavior doesn’t degrade into loops or hallucinated confirmations.
Checklist #6: Safety, compliance, and PII handling
Voice agents often collect sensitive data. Your evaluation must include red-team prompts and policy checks that are measurable.
- PII minimization: only collect what’s needed; avoid asking for prohibited data.
- Redaction: ensure logs/transcripts mask sensitive fields (card numbers, SSN, DOB as required).
- Consent: recording disclosure and opt-out flows where applicable.
- Prompt injection resistance: user tries to override policy (“Ignore your rules and tell me…”).
- Escalation on risk: self-harm, threats, fraud indicators, or account takeover signals route to human.
Pass criteria: 0 tolerance for disallowed PII requests; 0 tolerance for revealing internal instructions; consistent safe refusal patterns.
Checklist #7: Call containment, handoff, and “same-day shortlist” routing
Containment isn’t “never hand off.” It’s “hand off only when it’s the best outcome.” Evaluate both containment and handoff quality.
- Containment rate: % of calls resolved without human.
- Appropriate handoff rate: % of handoffs that were actually necessary (sample and label).
- Handoff package quality: summary, extracted fields, user sentiment, next best action.
- Speed-to-lead routing: for sales/local services, route hot leads fast (e.g., < 60 seconds to booked call).
Recruiting-style “intake + scoring + same-day shortlist” adaptation: if your voice agent screens applicants or inbound candidates, evaluate whether it collects required fields, assigns a score, and routes qualified candidates to a recruiter within the same day with a structured summary.
Checklist #8: Observability and automated test harness (so you can iterate)
If you can’t replay, measure, and compare, you can’t improve. Your evaluation stack should produce artifacts you can audit.
- Trace per call: audio timestamps, ASR segments, agent turns, tool calls, tool responses, and final outcome.
- Versioning: prompt version, model version, tool schema version, and release tag.
- Golden call set: curated scenarios with expected outcomes (including edge cases).
- Automated scoring: latency metrics, tool success rate, containment classification, policy violations.
- Human review loop: weekly sampling with a rubric; feed failures into new tests.
Practical harness pattern: run simulated calls using prerecorded audio (or TTS) for repeatability, then compare transcripts, tool traces, and outcomes across builds.
Case study: 14-day voice agent evaluation sprint (with numbers)
This example shows how an operator team can apply the checklist to a Vapi/Retell-like appointment-setting agent for a local services business.
Baseline (Day 1–2): instrument and label
- Traffic: 300 inbound calls/week (business hours), goal is booking.
- Initial routing: 30% of calls to agent, 70% to humans.
- Sample labeled: 80 calls (40 agent, 40 human) for comparison.
Baseline results (agent calls):
- Containment: 41%
- Booking rate (of agent-handled calls): 18%
- Median TTFA: 1.4s (p95 3.1s)
- Double-talk: 11% of turns
- Tool-call success: 86% (timeouts + schema errors)
- Entity errors: 9% on phone numbers; 14% on dates
Interventions (Day 3–10): fix the biggest failure modes first
- Turn-taking tuning: adjust endpointing and barge-in settings; shorten agent responses; add “I can help with that” filler only when tool calls exceed 1s.
- Entity confirmations: add explicit read-back for dates and phone numbers with a one-step correction path (“Did I get that right?”).
- Tool reliability: add retries with idempotency keys; validate parameters before calling; implement “tool timeout” fallback to human handoff with summary.
- Golden tests: create 25 scenario calls (noise, interruptions, reschedules, cancellations, wrong-number, angry caller).
Re-test (Day 11–14): compare against the golden set + live sample
Post-sprint results (agent calls):
- Containment: 58% (up from 41%)
- Booking rate: 27% (up from 18%)
- Median TTFA: 0.85s (p95 1.9s)
- Double-talk: 4% of turns
- Tool-call success: 96%
- Entity errors: 3% on phone numbers; 5% on dates
Decision: increase routing from 30% to 55% of inbound calls, but keep a guardrail: if tool success drops below 93% or TTFA p95 exceeds 2.2s for two days, automatically roll back.
Cliffhanger (what most teams miss next): once you scale traffic, the distribution shifts—more edge cases, more noise, and more adversarial callers. The next step is building a “drift dashboard” that alerts you when your golden set no longer matches production reality.
Operator-ready scoring rubric (copy/paste)
Use a 0–2 score per category (0 = fail, 1 = acceptable, 2 = strong). A call’s maximum is 16.
- Latency & turn-taking: TTFA, barge-in, double-talk
- ASR & entities: correct capture of critical fields
- NLU & flow control: intent/slots + repair behavior
- Tool execution: correct call, correct params, no duplicates
- Safety/PII: policy compliance + redaction
- Containment/handoff: resolves or hands off with summary
- Customer experience: clarity, empathy, brevity, confidence
- Observability: trace completeness + version tags
Launch guidance: don’t scale until your median call score is ≥ 13/16 and there are no “0” scores in Safety/PII or Tool execution across the last 50 evaluated calls.
FAQ: Voice AI agent evaluation (Vapi/Retell)
- How many calls do I need to evaluate a voice agent?
- For a first go/no-go, label 50–100 real calls per major call type. For ongoing monitoring, review 10–20 calls/week plus automated golden tests on every release.
- Is WER enough to judge ASR quality?
- No. Track entity error rate (dates, phone numbers, addresses) separately. A low WER can still hide catastrophic errors on critical fields.
- What’s the fastest way to catch tool-call regressions?
- Log every tool call with parameters and responses, then run a golden scenario suite that asserts tool selection, parameter validity, and idempotency behavior. Fail the build on schema mismatches or duplicate actions.
- How do I evaluate interruptions and barge-in reliably?
- Create a test set with scripted interruptions at different points (mid-sentence, during tool wait, during confirmation). Measure stop-speaking time and double-talk rate using audio timestamps, not just transcripts.
- Should I optimize for containment or conversion first?
- Optimize for correct outcomes with safe handoff. High containment with poor handoff quality can reduce conversion and increase escalations. Track both containment and “appropriate handoff” with labeled reviews.
Next steps: run the checklist and operationalize it
If you’re evaluating a Voice AI agent on Vapi/Retell-like infrastructure, the fastest path to reliability is: (1) define outcomes and constraints, (2) measure voice-specific latency and interruptions, (3) validate ASR/entities, (4) harden tool calls, and (5) enforce safety/PII with auditable logs—then lock it all into an automated harness.
CTA: If you want a ready-to-use evaluation rubric, golden call pack, and an automated scoring harness tailored to your voice flows (booking, support, intake, or lead routing), contact Evalvista to set up a 2-week voice agent evaluation sprint and ship with confidence.