Skip to main content
What is Session Replay? Record every put, find, and ask operation during an agent session, then replay it with different parameters, models, or frozen context for debugging and auditing.

Overview

1

Record

Start a session and perform operations (put, find, ask). Every action is captured with full context.
2

Save

End the session. Frames, results, answers, tokens, cost, and grounding scores are persisted.
3

Replay

Re-run the session with different parameters or frozen context to debug or audit.

Debug Mode

Re-execute searches with different --top-k or --adaptive settings to find why results were missed

Audit Mode

Freeze retrieval context and replay with different LLMs using --audit --use-model --diff

Key Features

FeatureDescription
Frozen ContextReplay with exact same frames - no retrieval drift
Model A/B TestingCompare GPT-4 vs Claude vs Gemini with identical input
Cost TrackingToken counts and USD cost per query
Grounding ScoresDetect hallucination risk (0-100%)
Answer CachingSkip redundant LLM calls, save money
Diff ReportsSee exactly how answers changed

Quick Example

# 1. Start recording
memvid session start knowledge.mv2 --name "Audit 2024-12"

# 2. Ask questions (tokens, cost, grounding tracked)
memvid ask knowledge.mv2 --question "What was the revenue?" --use-model openai:gpt-4o-mini
# tokens: 3112 + 42 = 3154   cost: $0.0005   grounding: 95% (HIGH)

# 3. End session
memvid session end knowledge.mv2
# Session ended. 5 actions recorded.

# 4. Replay with different model + diff
memvid session replay knowledge.mv2 --session <id> \
  --audit --use-model claude:claude-4-sonnet --diff
# Diff: IDENTICAL ✓

How It Works

1. Start Recording

# CLI
memvid session start knowledge.mv2 --name "Audit Session"
# Python SDK
session_id = mem.session_start("Audit Session")

2. Perform Operations

All operations are recorded with full context:
# Ask questions - frames, tokens, and answers are captured
memvid ask knowledge.mv2 --question "What was the acquisition price?" \
  --use-model openai:gpt-4o-mini

# Output shows cost and grounding
# tokens: 3112 input + 19 output = 3131   cost: $0.000478
# grounding: 100% (HIGH) - 2/2 sentences grounded

3. End Session

memvid session end knowledge.mv2
# Output: Session ended. 12 actions recorded.

4. View Session Details

memvid session view knowledge.mv2 --session <session-id>

# Output:
# Actions:
#   [0] FIND - Find { query: "acquisition price", mode: "Hybrid", result_count: 8 }
#   [1] ASK - Ask { query: "What was the acquisition price?", provider: "openai", model: "gpt-4o-mini" }

Replay Modes

Standard replay re-runs retrieval to compare results:
memvid session replay knowledge.mv2 --session <id> --adaptive --verbose

Audit Replay (Frozen Context)

Audit mode uses the exact frames from the original session:
memvid session replay knowledge.mv2 --session <id> --audit
Output shows frozen frames:
✓ Step 3/12 ask
   Question: "What was the acquisition price?"
   Mode: AUDIT (frozen retrieval)
   Original Model: openai:gpt-4o-mini
   Frozen frames: [66, 68, 61, 170, 22, 67, 57, 0]
   Context: VERIFIED (frames frozen)
   Original Answer: "The acquisition was valued at $2 billion."

Model A/B Testing

Compare different models with identical context:
# Original used GPT-4o-mini, replay with Claude
memvid session replay knowledge.mv2 --session <id> \
  --audit \
  --use-model claude:claude-3-5-sonnet \
  --diff
Output shows comparison:
✓ Step 3/12 ask
   Question: "What was the acquisition price?"
   Mode: AUDIT (frozen retrieval)
   Original Model: openai:gpt-4o-mini
   Frozen frames: [66, 68, 61, 170, 22, 67, 57, 0]
   Override Model: claude:claude-3-5-sonnet
   Original Answer: "The acquisition was valued at $2 billion."
   Context: VERIFIED (frames frozen)
   New Answer: "The acquisition price was $2B according to the documents."
   Diff: CHANGED

Replay Options

OptionDescription
--auditFreeze retrieval - use recorded frames instead of re-searching
--use-model <model>Override the LLM model for comparison (e.g., openai:gpt-4o, gemini:gemini-2.5-flash)
--diffGenerate diff report comparing original vs new answers
--adaptiveEnable adaptive retrieval (debug mode only)
--top-k NOverride top-k value (debug mode only)
--skip-asksSkip LLM operations during replay
--skip-findsSkip search operations during replay
--from-checkpointStart replay from a specific checkpoint
--webLaunch Time Machine web UI

Token & Cost Tracking

Every ask operation tracks token usage and estimated cost:
memvid ask knowledge.mv2 --question "Summarize the report" \
  --use-model openai:gpt-4o-mini --json
{
  "answer": "The report covers...",
  "usage": {
    "input_tokens": 3648,
    "output_tokens": 36,
    "total_tokens": 3684,
    "cost_usd": 0.000569
  },
  "grounding": {
    "score": 1.0,
    "label": "HIGH",
    "sentence_count": 1,
    "grounded_sentences": 1,
    "has_warning": false
  },
  "cached": false
}

Supported Models & Pricing (Dec 2025)

ProviderModelInput/1MOutput/1M
OpenAIgpt-4o-mini$0.15$0.60
OpenAIgpt-4o$2.50$10.00
OpenAIgpt-4.5$75.00$150.00
Claudeclaude-3-haiku$0.25$1.25
Claudeclaude-4-sonnet$3.00$15.00
Claudeclaude-4-opus$15.00$75.00
Geminigemini-2.5-flash$0.15$3.50
Geminigemini-2.5-pro$1.25$10.00
xAIgrok-4$3.00$15.00
Groqllama-3.3-70b$0.59$0.79
Mistralmistral-large$0.50$1.50

Grounding & Hallucination Detection

Every answer is scored for grounding - how well it’s supported by the retrieved context:
grounding: 100% (HIGH) - 2/2 sentences grounded
ScoreLabelMeaning
70-100%HIGHWell-grounded in context
40-69%MEDIUMPartially grounded
0-39%LOWPotential hallucination - warning shown
When grounding is low, you’ll see a warning:
grounding: 25% (LOW) - 1/4 sentences grounded
⚠ Warning: Some statements may not be supported by context

Answer Caching

Repeated questions with the same context return cached answers instantly:
# First call - hits LLM
memvid ask knowledge.mv2 --question "What is the revenue?" --use-model openai
# tokens: 2500 input + 50 output   cost: $0.00042

# Second call - cached
memvid ask knowledge.mv2 --question "What is the revenue?" --use-model openai
# cached: true   cost: $0.00 (saved $0.00042)
Cache key is based on: model + query + context hash

Use Case Examples

1. Debug Missing Results

# Record the failing scenario
memvid session start knowledge.mv2 --name "Missing Results Debug"
memvid ask knowledge.mv2 --question "What did Databricks purchase?" --use-model openai
memvid session end knowledge.mv2

# Replay with adaptive retrieval
memvid session replay knowledge.mv2 --session <id> --adaptive --verbose
# Reveals: Document existed at rank 12, adaptive found it

2. Compliance Audit Trail

# Record all decisions for audit
memvid session start knowledge.mv2 --name "Compliance Review 2024-12"
memvid ask knowledge.mv2 --question "Is this transaction fraudulent?" --use-model openai
memvid session end knowledge.mv2

# Later: Replay with frozen context to verify decision
memvid session replay knowledge.mv2 --session <id> --audit
# Shows exact frames and answer - reproducible for auditors

3. Model Comparison

# Ask with GPT-4o
memvid session start knowledge.mv2 --name "Model Comparison"
memvid ask knowledge.mv2 --question "Summarize the key findings" --use-model openai:gpt-4o
memvid session end knowledge.mv2

# Replay with different models to compare
memvid session replay knowledge.mv2 --session <id> --audit --use-model gemini:gemini-2.5-pro --diff
memvid session replay knowledge.mv2 --session <id> --audit --use-model claude:claude-4-sonnet --diff

CLI Commands Reference

CommandDescription
memvid session start <file> --name <name>Start recording
memvid session end <file>End recording and save
memvid session list <file>List all sessions
memvid session view <file> --session <id>View session details
memvid session replay <file> --session <id>Replay session
memvid session delete <file> --session <id>Delete session
memvid session checkpoint <file>Create checkpoint
memvid session compare <file> -a <id1> -b <id2>Compare two sessions

SDK Support

Python SDK (Full Support)

from memvid_sdk import create

mem = create('knowledge.mv2', enable_vec=True)

# Record session
session_id = mem.session_start("Audit Session")
result = mem.ask("What was the revenue?", model="openai:gpt-4o-mini")
print(f"Cost: ${result.usage.cost_usd:.6f}")
print(f"Grounding: {result.grounding.score:.0%}")
summary = mem.session_end()

# Replay with audit mode
replay = mem.session_replay(
    session_id,
    audit=True,
    use_model="claude:claude-4-sonnet",
    diff=True
)
for action in replay.ask_results:
    print(f"Original: {action.original_answer}")
    print(f"New: {action.new_answer}")
    print(f"Diff: {action.diff_status}")

Node.js SDK

import { create } from '@anthropic/memvid-sdk';

const mem = await create('knowledge.mv2', { enableVec: true });

// Record session
const sessionId = await mem.sessionStart("Audit Session");
const result = await mem.ask("What was the revenue?", { model: "openai:gpt-4o-mini" });
console.log(`Cost: $${result.usage.costUsd.toFixed(6)}`);
await mem.sessionEnd();

// Replay
const replay = await mem.sessionReplay(sessionId, {
  audit: true,
  useModel: "gemini:gemini-2.5-flash",
  diff: true
});

Best Practices

  1. Use descriptive session names: Include date and purpose, e.g., “Fraud Detection Audit 2024-12-27”
  2. Record minimal reproductions: Capture just enough to reproduce the issue
  3. Use audit mode for compliance: Frozen context ensures reproducibility
  4. Compare models with identical context: Use --audit --use-model --diff for fair comparisons
  5. Monitor grounding scores: Low scores indicate potential hallucination

Next Steps