Blog

Agent Regression Testing: Shadow Mode vs Replay vs Sim

April 7, 2026 admin No comments yet

Agent regression testing is hard for one reason: your “software” is a probabilistic system interacting with messy users, tools, and policies. When you change a prompt, tool schema, model, memory, or routing logic, you need confidence you didn’t break what already worked—without slowing iteration to a crawl.

This guide is a comparison of three approaches teams actually use in production: shadow mode (a.k.a. dark launches), conversation replay (log-based re-execution), and simulation (synthetic users and environments). You’ll see what each catches, what it misses, how to measure pass/fail, and how to combine them into a repeatable workflow.

Why this comparison matters (Personalization + Value Prop)

If you’re shipping an AI agent that answers customers, qualifies leads, books meetings, or triages tickets, your goal is usually the same: increase capability while keeping reliability flat or improving. The value prop of regression testing is straightforward: detect degradations early, quantify risk, and ship changes faster with fewer rollbacks.

But teams get stuck picking a single method. In practice, the best programs use all three—at different points in the lifecycle—because they answer different questions:

Replay: “Did we break known, real conversations?”
Simulation: “How do we behave in edge cases we haven’t seen yet?”
Shadow mode: “Will this work on today’s live distribution without harming users?”

Define the niche and goal (Niche + Their Goal)

Evalvista’s core audience is teams building AI agents with tools (CRM, ticketing, calendars, databases), policies (PII rules, escalation), and business outcomes (conversion, time-to-resolution). For these agents, “regression” isn’t just response quality—it includes:

Tool reliability: correct API calls, parameters, and retries
Policy compliance: safe handling of PII, refusals, disclosure
Workflow correctness: state transitions, handoffs, escalations
Business KPIs: bookings, deflection, CSAT proxies, AHT

Your goal is to ship changes (model upgrades, prompt edits, new tools) with measurable confidence—not “it looked good in a few chats.”

Three approaches compared at a glance (Their Value Prop)

Here’s a practical comparison you can use to choose your next step. Most teams start with replay, add simulation for coverage, then use shadow mode for final validation.

Conversation replay: Re-run historical conversations through the new agent version and compare outcomes.
Simulation: Generate synthetic conversations (users, environments, tool responses) to stress specific behaviors.
Shadow mode: Run the new agent alongside production on live traffic, but don’t let it affect the user; compare decisions and outcomes.

Method 1: Conversation replay (log-based regression)

What it is: You take real past interactions (messages, tool calls, tool outputs, final resolutions) and re-execute them using the candidate agent version. You score the run against expected outcomes or against the prior version’s behavior.

Where replay shines

High realism: It reflects real user language, real intent mix, and real tool data patterns.
Fast iteration: You can run hundreds/thousands of conversations overnight as a gating step.
Great for “did we break X?” Particularly strong for known workflows: password reset, refund eligibility, meeting booking, lead qualification.

Where replay fails (common traps)

Stale context: Tool data changes. A replay may diverge because the CRM record is different today than it was then.
Distribution shift: New product launches, new user segments, or new policies can make historical logs less representative.
Hidden coupling: If your agent depends on timing, concurrency, or external side effects, naive replay is misleading.

Best practice: Store tool call inputs/outputs (or snapshots) so replay is deterministic. If that’s not possible, classify tests into:

Deterministic replay: tool outputs are fixed; ideal for regression gates
Live replay: tools are called live; useful for integration testing, not strict regression

Method 2: Simulation (synthetic users + environments)

What it is: You create simulated conversations that target behaviors you care about—edge cases, adversarial prompts, policy constraints, multi-turn confusion, tool failures. Simulations can be authored (handcrafted), generated (LLM-based user simulators), or templated (parameterized scenarios).

Where simulation shines

Coverage for rare but costly failures: PII leakage, jailbreak attempts, incorrect refunds, wrong meeting scheduling.
Systematic exploration: You can vary one factor at a time (tone, ambiguity, missing fields) to see sensitivity.
Early-stage testing: Before you have enough production logs, simulation can bootstrap a regression suite.

Where simulation fails (and how to mitigate)

Unrealistic users: LLM-generated users can be too cooperative. Mitigation: seed with real transcripts, add “friction” behaviors (non-answers, contradictions, impatience).
Overfitting to your simulator: The agent learns to “beat” a narrow set of simulated patterns. Mitigation: diversify generators and keep a holdout set.
False confidence: Passing synthetic tests doesn’t prove live performance. Mitigation: treat simulation as coverage and safety, not final validation.

Best practice: Build a scenario library tied to explicit risk categories: compliance, tool correctness, escalation, and business outcomes. Each scenario should specify: starting context, user goal, allowed tools, and success criteria.

Method 3: Shadow mode (live traffic, zero user impact)

What it is: The candidate agent runs in parallel on live requests. The user still receives responses from the production agent (or a human). You log what the shadow agent would have done: messages, tool calls, decisions, and predicted outcomes.

Shadow mode is the closest you get to “production truth” without risking customer experience.

Where shadow mode shines

Real distribution: You test on today’s traffic mix, not last month’s.
Integration realism: Real tool latency, partial outages, permission errors, rate limits.
Pre-launch confidence: Especially valuable for model upgrades or major routing changes.

Where shadow mode fails

Harder to score: You may not know the “correct” answer immediately. You often need proxy metrics or delayed labels.
Cost: You pay for running two agents (tokens + tool calls) unless you stub tools.
Operational complexity: Requires routing, logging, and strict isolation so the shadow agent cannot mutate state.

Best practice: In shadow mode, default to read-only tools or sandboxed tool credentials. If you must call write tools, add idempotency keys and strict no-op modes.

How to choose: a decision framework (Comparison you can implement)

Use this framework to pick the right method per change type. The key is aligning risk with evidence strength.

Small prompt edits / rubric tweaks: Start with replay. Add a small simulation set for known failure modes.
Model upgrade (e.g., new base model): Replay + simulation, then shadow mode for 24–72 hours before rollout.
New tool integration or schema changes: Simulation with tool-failure scenarios + replay with deterministic tool snapshots; shadow mode if tool behavior is variable.
Policy/compliance changes: Simulation-heavy (adversarial and boundary cases), then replay on any historical compliance incidents, then shadow mode with strict monitoring.

Another practical lens: what question are you trying to answer?

“Did we regress known workflows?” → Replay
“Are we safe under stress and weird inputs?” → Simulation
“Will this work on the current live mix?” → Shadow mode

Scoring and pass/fail: make regressions measurable

All three methods fail if you can’t score outcomes consistently. A practical scoring stack for agent regression testing includes:

Task success: Did the agent complete the user goal (booking created, ticket resolved, refund eligibility determined)?
Tool correctness: Correct tool selected, correct parameters, correct sequence, bounded retries.
Policy compliance: No disallowed content, correct refusal, PII redaction, correct disclosures.
Conversation quality: Clarity, brevity, tone, and grounding (where relevant).

Operationally, teams use a mix of:

Deterministic checks: JSON schema validation, tool-call assertions, regex/PII detectors, state machine invariants.
LLM-as-judge rubrics: For nuanced quality and goal completion—paired with calibration and spot-checking.
Human review sampling: Focused on high-risk categories and disagreements between versions.

Case study: pipeline-fill agent regression using replay + sim + shadow

This example uses the “Agencies: pipeline fill and booked calls” template, because it’s a common agent workflow: qualify inbound leads, route to the right rep, and book meetings via calendar tools.

Company: 25-person marketing agency running an AI SDR agent on website chat and email.
Baseline: 1,200 inbound conversations/week; 18% lead-to-meeting rate; 9% escalation to humans.
Change: Upgrade model + new qualification rubric + new calendar tool schema.

Timeline and implementation

Days 1–2 (Replay suite build): Collected 800 past conversations, labeled 200 for “meeting booked / not booked” and “correct routing.” Snapshotted tool outputs for deterministic replay.
Days 3–4 (Simulation coverage): Added 120 synthetic scenarios: ambiguous budgets, multiple stakeholders, time zone confusion, and tool failures (calendar API 500s, rate limits). Built assertions for tool-call correctness and escalation rules.
Days 5–7 (Shadow mode): Ran shadow agent on 30% of live traffic (read-only calendar). Logged predicted routing, proposed meeting times, and escalation decisions. Compared to production outcomes and human reviews on a 10% sample.

Results (numbers)

Replay: Task success improved from 82% → 89% on the labeled subset; but tool-call correctness dropped from 96% → 88% due to the new calendar schema.
Simulation: Escalation compliance improved from 71% → 93% on “policy boundary” cases (e.g., user asks for pricing promises); tool failure handling improved (infinite retry incidents reduced from 14 cases to 1 in the sim suite).
Shadow mode: On live traffic, the shadow agent proposed a meeting for 21% of conversations vs 19% in production, but initially had a 6% higher “wrong rep” routing rate. After adjusting routing rules and rerunning shadow for 48 hours, wrong-rep routing fell to within +0.5% of baseline.

Rollout outcome: After launch, lead-to-meeting rate increased from 18% → 20.7% over two weeks (relative +15%), while escalation stayed flat at ~9%. The key was that replay caught the schema regression early, simulation hardened failure handling, and shadow mode validated the new traffic mix before exposing users.

Common operational pitfalls (and how to avoid them)

Only testing “happy paths”: Add a minimum quota of adversarial and failure-mode simulations (tool timeouts, missing fields, contradictory user inputs).
Not versioning everything: Version prompts, tool schemas, routing policies, memory settings, and evaluation rubrics so regressions are attributable.
Conflating quality with success: A polite answer that doesn’t book the meeting is still a failure. Separate “task success” from “style.”
No ownership for eval failures: Assign a DRI for each failure category (tools, policy, routing) and track fix SLAs.

FAQ

What’s the best starting point for agent regression testing?: Start with conversation replay on a small, high-signal set (50–200 real conversations) with deterministic tool snapshots. It gives fast, realistic feedback.
How many tests do we need before we can trust results?: Enough to cover your top workflows and top risks. Many teams begin with 200–1,000 replay conversations plus 50–200 targeted simulations, then expand based on failures and new features.
Can we skip shadow mode if replay and simulation look good?: For low-risk prompt tweaks, sometimes. For model upgrades, routing changes, or new tool integrations, shadow mode is the safest way to validate performance on today’s traffic distribution without user impact.
How do we score regressions when there isn’t a single “right answer”?: Use layered scoring: deterministic checks for tools/policy + rubric-based judging for goal completion and clarity. Track deltas versus baseline and require no statistically meaningful drop on critical metrics.
What should we log to enable replay and shadow mode?: At minimum: user messages, agent messages, tool call names/arguments, tool outputs (or snapshots), timestamps, version identifiers (model/prompt/tool schema), and final outcome labels when available.

Putting it together: the repeatable workflow (Cliffhanger)

If you want a practical, repeatable program, combine the three methods into a single release pipeline:

Replay gate (every change): Run deterministic replay; block merges on critical metric regressions (tool correctness, policy violations, task success).
Simulation gate (weekly + before risky launches): Expand scenario coverage for new features and newly discovered failure modes.
Shadow validation (before rollout): Run 24–72 hours on live traffic with read-only tools; investigate deltas and only then expose users.

The missing piece is consistency: a single evaluation harness that versions datasets, runs judges, aggregates metrics, and makes regressions obvious in CI—not buried in logs.

CTA: make regression testing a release habit

If you’re ready to turn agent regression testing into a predictable release process, Evalvista helps you build repeatable eval suites (replay, simulation, and shadow-mode comparisons), benchmark versions, and ship changes with confidence.

Book a demo to see how to set up an agent evaluation framework that catches regressions before your users do—and accelerates iteration without sacrificing reliability.

Agent Regression Testing: Shadow Mode vs Replay vs Sim

Why this comparison matters (Personalization + Value Prop)

Define the niche and goal (Niche + Their Goal)

Three approaches compared at a glance (Their Value Prop)

Method 1: Conversation replay (log-based regression)

Where replay shines

Where replay fails (common traps)

Method 2: Simulation (synthetic users + environments)

Where simulation shines

Where simulation fails (and how to mitigate)

Method 3: Shadow mode (live traffic, zero user impact)

Where shadow mode shines

Where shadow mode fails

How to choose: a decision framework (Comparison you can implement)

Scoring and pass/fail: make regressions measurable

Case study: pipeline-fill agent regression using replay + sim + shadow

Timeline and implementation

Results (numbers)

Common operational pitfalls (and how to avoid them)

FAQ

Putting it together: the repeatable workflow (Cliffhanger)

CTA: make regression testing a release habit

admin

Leave a Reply Cancel reply

Product

Resources

Company

Get in touch

Try for free

Agent Regression Testing: Shadow Mode vs Replay vs Sim

Why this comparison matters (Personalization + Value Prop)

Define the niche and goal (Niche + Their Goal)

Three approaches compared at a glance (Their Value Prop)

Method 1: Conversation replay (log-based regression)

Where replay shines

Where replay fails (common traps)

Method 2: Simulation (synthetic users + environments)

Where simulation shines

Where simulation fails (and how to mitigate)

Method 3: Shadow mode (live traffic, zero user impact)

Where shadow mode shines

Where shadow mode fails

How to choose: a decision framework (Comparison you can implement)

Scoring and pass/fail: make regressions measurable

Case study: pipeline-fill agent regression using replay + sim + shadow

Timeline and implementation

Results (numbers)

Common operational pitfalls (and how to avoid them)

FAQ

Putting it together: the repeatable workflow (Cliffhanger)

CTA: make regression testing a release habit

admin

Leave a Reply Cancel reply

Related posts

Agent Regression Testing Case Study: Trial-to-Paid Lift

Agent Regression Testing Case Study: Speed-to-Lead Routing

Agent Regression Testing: Build vs Buy vs Hybrid

Product

Resources

Company

Get in touch