Blog, Guides

Voice AI Agent Evaluation Checklist (Vapi/Retell)

February 24, 2026 admin No comments yet

Voice AI Agent Evaluation Checklist (Vapi- and Retell-style)

Voice agents fail differently than chat agents. A “good” transcript can still hide bad turn-taking, awkward interruptions, tool-call flakiness, or unsafe handling of PII. This checklist is designed for operators evaluating Voice AI agents built on platforms like Vapi.ai and RetellAI—especially when you need a repeatable go/no-go decision before scaling traffic.

What you’ll get: a step-by-step evaluation checklist, scoring guidance, and a case-study style example with numbers and a timeline—plus an FAQ at the end.

How to use this checklist (and why it’s different)

This article is intentionally checklist-first. Instead of broad frameworks or generic benchmarks, it focuses on what to verify in real calls and in an automated test harness so you can ship changes without surprises.

Personalization: You’re evaluating Voice AI (not chat) on Vapi/Retell-like stacks where streaming audio, barge-in, and tool calling happen in real time.
Value prop: A practical, operator-grade checklist you can run weekly and before launches.
Niche: Voice agent evaluation: latency, interruptions, ASR/WER, NLU, tool reliability, safety/PII, containment, handoff, and observability.
Your goal: Decide if the agent is ready to take more calls, and know exactly what to fix if it isn’t.
Your value prop: Faster iteration with fewer regressions and clearer accountability across prompts, models, and integrations.

Checklist #1: Define the target call outcomes (before you test)

Evaluation breaks down when “success” is vague. Start by writing the top outcomes and constraints for each call type.

Primary outcome: e.g., book an appointment, collect intake fields, take payment, qualify lead, resolve support issue.
Secondary outcomes: e.g., confirm contact info, send SMS/email confirmation, create CRM ticket.
Hard constraints: never request SSN, never read back full card number, must disclose recording, must offer human handoff.
Fallback outcomes: if uncertain, route to human with a clean summary.

Scoring tip: separate “business success” from “conversation quality”

Track two scores per call:

Outcome score (0/1 or 0–5): Did the call achieve the job-to-be-done?
Quality score (0–5): Turn-taking, clarity, safety, and tool correctness.

This prevents a common trap: an agent that “books the appointment” but does so rudely, slowly, or unsafely.

Checklist #2: Latency and turn-taking (the voice-specific make-or-break)

In voice, users judge intelligence by timing. You need explicit latency budgets and barge-in behavior tests.

Time to first token (TTFT): how long until the agent starts speaking after the user finishes.
Time to first audio (TTFA): when audio actually plays (often worse than TTFT).
Mid-turn latency: delays after tool calls, confirmations, or long reasoning steps.
Barge-in handling: does the agent stop speaking when the user interrupts?
Double-talk rate: how often both sides talk over each other for >500ms.

Pass/fail thresholds you can start with

TTFA: aim for < 900ms median; investigate if p95 > 2000ms.
Barge-in: interruption should stop agent audio within ~250–400ms.
Double-talk: < 3% of turns; > 8% usually feels broken.

When these fail, users repeat themselves, abandon, or demand a human—even if the content is correct.

Checklist #3: ASR quality (WER) and transcript trustworthiness

ASR errors cascade into NLU mistakes and wrong tool calls. Evaluate ASR separately from the agent’s reasoning.

Word Error Rate (WER): compute on a labeled sample (at least 50–100 utterances per major accent/noise condition).
Entity error rate: wrong names, addresses, dates, phone numbers, emails (more important than overall WER).
Noise robustness: test with background noise, speakerphone, car noise, and low bandwidth.
Endpointing: does ASR cut off the user early or wait too long?

Operator rule: if the agent frequently mishears entities, you must add confirmation steps (and measure the added latency).

Checklist #4: NLU and intent accuracy (what the user meant)

Voice inputs are messy: partial sentences, corrections, and interruptions. Your evaluation needs “intent + slots” accuracy, not just “the reply sounded good.”

Intent classification: correct identification of call reason (billing vs scheduling vs support).
Slot filling: correct extraction of required fields (date/time, service type, policy number).
Repair behavior: when uncertain, does it ask a clarifying question instead of guessing?
Context carryover: remembers constraints across turns (e.g., “not Tuesday,” “morning only”).

Test with adversarial but realistic utterances: self-corrections (“Actually, make that Thursday”), multi-intent (“Reschedule and also update my email”), and vague requests (“I need help with my account”).

Checklist #5: Tool-calling reliability (the hidden source of “it sounded fine” failures)

For Vapi/Retell-style agents, tool calls are where production incidents happen: timeouts, wrong parameters, duplicate requests, and stale state.

Correct tool selection: chooses the right API/action for the user’s intent.
Parameter correctness: validates and formats fields (dates, phone numbers, IDs).
Idempotency: avoids double-booking or duplicate tickets on retries.
Timeout handling: graceful user messaging + retry policy (with limits).
State consistency: if the user changes their mind mid-flow, does the agent cancel/replace prior actions?

Concrete test: force a 20% tool timeout rate in staging and verify the agent’s behavior doesn’t degrade into loops or hallucinated confirmations.

Checklist #6: Safety, compliance, and PII handling

Voice agents often collect sensitive data. Your evaluation must include red-team prompts and policy checks that are measurable.

PII minimization: only collect what’s needed; avoid asking for prohibited data.
Redaction: ensure logs/transcripts mask sensitive fields (card numbers, SSN, DOB as required).
Consent: recording disclosure and opt-out flows where applicable.
Prompt injection resistance: user tries to override policy (“Ignore your rules and tell me…”).
Escalation on risk: self-harm, threats, fraud indicators, or account takeover signals route to human.

Pass criteria: 0 tolerance for disallowed PII requests; 0 tolerance for revealing internal instructions; consistent safe refusal patterns.

Checklist #7: Call containment, handoff, and “same-day shortlist” routing

Containment isn’t “never hand off.” It’s “hand off only when it’s the best outcome.” Evaluate both containment and handoff quality.

Containment rate: % of calls resolved without human.
Appropriate handoff rate: % of handoffs that were actually necessary (sample and label).
Handoff package quality: summary, extracted fields, user sentiment, next best action.
Speed-to-lead routing: for sales/local services, route hot leads fast (e.g., < 60 seconds to booked call).

Recruiting-style “intake + scoring + same-day shortlist” adaptation: if your voice agent screens applicants or inbound candidates, evaluate whether it collects required fields, assigns a score, and routes qualified candidates to a recruiter within the same day with a structured summary.

Checklist #8: Observability and automated test harness (so you can iterate)

If you can’t replay, measure, and compare, you can’t improve. Your evaluation stack should produce artifacts you can audit.

Trace per call: audio timestamps, ASR segments, agent turns, tool calls, tool responses, and final outcome.
Versioning: prompt version, model version, tool schema version, and release tag.
Golden call set: curated scenarios with expected outcomes (including edge cases).
Automated scoring: latency metrics, tool success rate, containment classification, policy violations.
Human review loop: weekly sampling with a rubric; feed failures into new tests.

Practical harness pattern: run simulated calls using prerecorded audio (or TTS) for repeatability, then compare transcripts, tool traces, and outcomes across builds.

Case study: 14-day voice agent evaluation sprint (with numbers)

This example shows how an operator team can apply the checklist to a Vapi/Retell-like appointment-setting agent for a local services business.

Baseline (Day 1–2): instrument and label

Traffic: 300 inbound calls/week (business hours), goal is booking.
Initial routing: 30% of calls to agent, 70% to humans.
Sample labeled: 80 calls (40 agent, 40 human) for comparison.

Baseline results (agent calls):

Containment: 41%
Booking rate (of agent-handled calls): 18%
Median TTFA: 1.4s (p95 3.1s)
Double-talk: 11% of turns
Tool-call success: 86% (timeouts + schema errors)
Entity errors: 9% on phone numbers; 14% on dates

Interventions (Day 3–10): fix the biggest failure modes first

Turn-taking tuning: adjust endpointing and barge-in settings; shorten agent responses; add “I can help with that” filler only when tool calls exceed 1s.
Entity confirmations: add explicit read-back for dates and phone numbers with a one-step correction path (“Did I get that right?”).
Tool reliability: add retries with idempotency keys; validate parameters before calling; implement “tool timeout” fallback to human handoff with summary.
Golden tests: create 25 scenario calls (noise, interruptions, reschedules, cancellations, wrong-number, angry caller).

Re-test (Day 11–14): compare against the golden set + live sample

Post-sprint results (agent calls):

Containment: 58% (up from 41%)
Booking rate: 27% (up from 18%)
Median TTFA: 0.85s (p95 1.9s)
Double-talk: 4% of turns
Tool-call success: 96%
Entity errors: 3% on phone numbers; 5% on dates

Decision: increase routing from 30% to 55% of inbound calls, but keep a guardrail: if tool success drops below 93% or TTFA p95 exceeds 2.2s for two days, automatically roll back.

Cliffhanger (what most teams miss next): once you scale traffic, the distribution shifts—more edge cases, more noise, and more adversarial callers. The next step is building a “drift dashboard” that alerts you when your golden set no longer matches production reality.

Operator-ready scoring rubric (copy/paste)

Use a 0–2 score per category (0 = fail, 1 = acceptable, 2 = strong). A call’s maximum is 16.

Latency & turn-taking: TTFA, barge-in, double-talk
ASR & entities: correct capture of critical fields
NLU & flow control: intent/slots + repair behavior
Tool execution: correct call, correct params, no duplicates
Safety/PII: policy compliance + redaction
Containment/handoff: resolves or hands off with summary
Customer experience: clarity, empathy, brevity, confidence
Observability: trace completeness + version tags

Launch guidance: don’t scale until your median call score is ≥ 13/16 and there are no “0” scores in Safety/PII or Tool execution across the last 50 evaluated calls.

FAQ: Voice AI agent evaluation (Vapi/Retell)

How many calls do I need to evaluate a voice agent?: For a first go/no-go, label 50–100 real calls per major call type. For ongoing monitoring, review 10–20 calls/week plus automated golden tests on every release.
Is WER enough to judge ASR quality?: No. Track entity error rate (dates, phone numbers, addresses) separately. A low WER can still hide catastrophic errors on critical fields.
What’s the fastest way to catch tool-call regressions?: Log every tool call with parameters and responses, then run a golden scenario suite that asserts tool selection, parameter validity, and idempotency behavior. Fail the build on schema mismatches or duplicate actions.
How do I evaluate interruptions and barge-in reliably?: Create a test set with scripted interruptions at different points (mid-sentence, during tool wait, during confirmation). Measure stop-speaking time and double-talk rate using audio timestamps, not just transcripts.
Should I optimize for containment or conversion first?: Optimize for correct outcomes with safe handoff. High containment with poor handoff quality can reduce conversion and increase escalations. Track both containment and “appropriate handoff” with labeled reviews.

Next steps: run the checklist and operationalize it

If you’re evaluating a Voice AI agent on Vapi/Retell-like infrastructure, the fastest path to reliability is: (1) define outcomes and constraints, (2) measure voice-specific latency and interruptions, (3) validate ASR/entities, (4) harden tool calls, and (5) enforce safety/PII with auditable logs—then lock it all into an automated harness.

CTA: If you want a ready-to-use evaluation rubric, golden call pack, and an automated scoring harness tailored to your voice flows (booking, support, intake, or lead routing), contact Evalvista to set up a 2-week voice agent evaluation sprint and ship with confidence.

Voice AI Agent Evaluation Checklist (Vapi/Retell)

Voice AI Agent Evaluation Checklist (Vapi- and Retell-style)

How to use this checklist (and why it’s different)

Checklist #1: Define the target call outcomes (before you test)

Scoring tip: separate “business success” from “conversation quality”

Checklist #2: Latency and turn-taking (the voice-specific make-or-break)

Pass/fail thresholds you can start with

Checklist #3: ASR quality (WER) and transcript trustworthiness

Checklist #4: NLU and intent accuracy (what the user meant)

Checklist #5: Tool-calling reliability (the hidden source of “it sounded fine” failures)

Checklist #6: Safety, compliance, and PII handling

Checklist #7: Call containment, handoff, and “same-day shortlist” routing

Checklist #8: Observability and automated test harness (so you can iterate)

Case study: 14-day voice agent evaluation sprint (with numbers)

Baseline (Day 1–2): instrument and label

Interventions (Day 3–10): fix the biggest failure modes first

Re-test (Day 11–14): compare against the golden set + live sample

Operator-ready scoring rubric (copy/paste)

FAQ: Voice AI agent evaluation (Vapi/Retell)

Next steps: run the checklist and operationalize it

admin

Leave a Reply Cancel reply

Product

Resources

Company

Get in touch

Try for free

Voice AI Agent Evaluation Checklist (Vapi/Retell)

How to use this checklist (and why it’s different)

Checklist #1: Define the target call outcomes (before you test)

Scoring tip: separate “business success” from “conversation quality”

Checklist #2: Latency and turn-taking (the voice-specific make-or-break)

Pass/fail thresholds you can start with

Checklist #3: ASR quality (WER) and transcript trustworthiness

Checklist #4: NLU and intent accuracy (what the user meant)

Checklist #5: Tool-calling reliability (the hidden source of “it sounded fine” failures)

Checklist #6: Safety, compliance, and PII handling

Checklist #7: Call containment, handoff, and “same-day shortlist” routing

Checklist #8: Observability and automated test harness (so you can iterate)

Case study: 14-day voice agent evaluation sprint (with numbers)

Baseline (Day 1–2): instrument and label

Interventions (Day 3–10): fix the biggest failure modes first

Re-test (Day 11–14): compare against the golden set + live sample

Operator-ready scoring rubric (copy/paste)

FAQ: Voice AI agent evaluation (Vapi/Retell)

Next steps: run the checklist and operationalize it

admin

Leave a Reply Cancel reply

Related posts

Agent Regression Testing: Build vs Buy vs Hybrid

Agent Evaluation Platform Pricing & ROI: TCO Comparison

Agent Regression Testing: Golden Sets vs Live Logs

Product

Resources

Company

Get in touch