Session Replay

What is Session Replay? Record every put, find, and ask operation during an agent session, then replay it with different parameters, models, or frozen context for debugging and auditing.

Overview

Record

Start a session and perform operations (put, find, ask). Every action is captured with full context.

Save

End the session. Frames, results, answers, tokens, cost, and grounding scores are persisted.

Replay

Re-run the session with different parameters or frozen context to debug or audit.

Debug Mode

Re-execute searches with different --top-k or --adaptive settings to find why results were missed

Audit Mode

Freeze retrieval context and replay with different LLMs using --audit --use-model --diff

Key Features

Feature	Description
Frozen Context	Replay with exact same frames - no retrieval drift
Model A/B Testing	Compare GPT-4 vs Claude vs Gemini with identical input
Cost Tracking	Token counts and USD cost per query
Grounding Scores	Detect hallucination risk (0-100%)
Answer Caching	Skip redundant LLM calls, save money
Diff Reports	See exactly how answers changed

Quick Example

# 1. Start recording
memvid session start knowledge.mv2 --name "Audit 2024-12"

# 2. Ask questions (tokens, cost, grounding tracked)
memvid ask knowledge.mv2 --question "What was the revenue?" --use-model openai:gpt-4o-mini
# tokens: 3112 + 42 = 3154   cost: $0.0005   grounding: 95% (HIGH)

# 3. End session
memvid session end knowledge.mv2
# Session ended. 5 actions recorded.

# 4. Replay with different model + diff
memvid session replay knowledge.mv2 --session <id> \
  --audit --use-model claude:claude-4-sonnet --diff
# Diff: IDENTICAL ✓

How It Works

1. Start Recording

# CLI
memvid session start knowledge.mv2 --name "Audit Session"

# Python SDK
session_id = mem.session_start("Audit Session")

2. Perform Operations

All operations are recorded with full context:

# Ask questions - frames, tokens, and answers are captured
memvid ask knowledge.mv2 --question "What was the acquisition price?" \
  --use-model openai:gpt-4o-mini

# Output shows cost and grounding
# tokens: 3112 input + 19 output = 3131   cost: $0.000478
# grounding: 100% (HIGH) - 2/2 sentences grounded

3. End Session

memvid session end knowledge.mv2
# Output: Session ended. 12 actions recorded.

4. View Session Details

memvid session view knowledge.mv2 --session <session-id>

# Output:
# Actions:
#   [0] FIND - Find { query: "acquisition price", mode: "Hybrid", result_count: 8 }
#   [1] ASK - Ask { query: "What was the acquisition price?", provider: "openai", model: "gpt-4o-mini" }

Replay Modes

Debug Replay (Re-executes Search)

Standard replay re-runs retrieval to compare results:

memvid session replay knowledge.mv2 --session <id> --adaptive --verbose

Audit Replay (Frozen Context)

Audit mode uses the exact frames from the original session:

memvid session replay knowledge.mv2 --session <id> --audit

Output shows frozen frames:

✓ Step 3/12 ask
   Question: "What was the acquisition price?"
   Mode: AUDIT (frozen retrieval)
   Original Model: openai:gpt-4o-mini
   Frozen frames: [66, 68, 61, 170, 22, 67, 57, 0]
   Context: VERIFIED (frames frozen)
   Original Answer: "The acquisition was valued at $2 billion."

Model A/B Testing

Compare different models with identical context:

# Original used GPT-4o-mini, replay with Claude
memvid session replay knowledge.mv2 --session <id> \
  --audit \
  --use-model claude:claude-3-5-sonnet \
  --diff

Output shows comparison:

✓ Step 3/12 ask
   Question: "What was the acquisition price?"
   Mode: AUDIT (frozen retrieval)
   Original Model: openai:gpt-4o-mini
   Frozen frames: [66, 68, 61, 170, 22, 67, 57, 0]
   Override Model: claude:claude-3-5-sonnet
   Original Answer: "The acquisition was valued at $2 billion."
   Context: VERIFIED (frames frozen)
   New Answer: "The acquisition price was $2B according to the documents."
   Diff: CHANGED

Replay Options

Option	Description
`--audit`	Freeze retrieval - use recorded frames instead of re-searching
`--use-model <model>`	Override the LLM model for comparison (e.g., `openai:gpt-4o`, `gemini:gemini-2.5-flash`)
`--diff`	Generate diff report comparing original vs new answers
`--adaptive`	Enable adaptive retrieval (debug mode only)
`--top-k N`	Override top-k value (debug mode only)
`--skip-asks`	Skip LLM operations during replay
`--skip-finds`	Skip search operations during replay
`--from-checkpoint`	Start replay from a specific checkpoint
`--web`	Launch Time Machine web UI

Token & Cost Tracking

Every ask operation tracks token usage and estimated cost:

memvid ask knowledge.mv2 --question "Summarize the report" \
  --use-model openai:gpt-4o-mini --json

{
  "answer": "The report covers...",
  "usage": {
    "input_tokens": 3648,
    "output_tokens": 36,
    "total_tokens": 3684,
    "cost_usd": 0.000569
  },
  "grounding": {
    "score": 1.0,
    "label": "HIGH",
    "sentence_count": 1,
    "grounded_sentences": 1,
    "has_warning": false
  },
  "cached": false
}

Supported Models & Pricing (Dec 2025)

Provider	Model	Input/1M	Output/1M
OpenAI	gpt-4o-mini	$0.15	$0.60
OpenAI	gpt-4o	$2.50	$10.00
OpenAI	gpt-4.5	$75.00	$150.00
Claude	claude-3-haiku	$0.25	$1.25
Claude	claude-4-sonnet	$3.00	$15.00
Claude	claude-4-opus	$15.00	$75.00
Gemini	gemini-2.5-flash	$0.15	$3.50
Gemini	gemini-2.5-pro	$1.25	$10.00
xAI	grok-4	$3.00	$15.00
Groq	llama-3.3-70b	$0.59	$0.79
Mistral	mistral-large	$0.50	$1.50

Grounding & Hallucination Detection

Every answer is scored for grounding - how well it’s supported by the retrieved context:

grounding: 100% (HIGH) - 2/2 sentences grounded

Score	Label	Meaning
70-100%	HIGH	Well-grounded in context
40-69%	MEDIUM	Partially grounded
0-39%	LOW	Potential hallucination - warning shown

When grounding is low, you’ll see a warning:

grounding: 25% (LOW) - 1/4 sentences grounded
⚠ Warning: Some statements may not be supported by context

Answer Caching

Repeated questions with the same context return cached answers instantly:

# First call - hits LLM
memvid ask knowledge.mv2 --question "What is the revenue?" --use-model openai
# tokens: 2500 input + 50 output   cost: $0.00042

# Second call - cached
memvid ask knowledge.mv2 --question "What is the revenue?" --use-model openai
# cached: true   cost: $0.00 (saved $0.00042)

Cache key is based on: model + query + context hash

Use Case Examples

1. Debug Missing Results

# Record the failing scenario
memvid session start knowledge.mv2 --name "Missing Results Debug"
memvid ask knowledge.mv2 --question "What did Databricks purchase?" --use-model openai
memvid session end knowledge.mv2

# Replay with adaptive retrieval
memvid session replay knowledge.mv2 --session <id> --adaptive --verbose
# Reveals: Document existed at rank 12, adaptive found it

2. Compliance Audit Trail

# Record all decisions for audit
memvid session start knowledge.mv2 --name "Compliance Review 2024-12"
memvid ask knowledge.mv2 --question "Is this transaction fraudulent?" --use-model openai
memvid session end knowledge.mv2

# Later: Replay with frozen context to verify decision
memvid session replay knowledge.mv2 --session <id> --audit
# Shows exact frames and answer - reproducible for auditors

3. Model Comparison

# Ask with GPT-4o
memvid session start knowledge.mv2 --name "Model Comparison"
memvid ask knowledge.mv2 --question "Summarize the key findings" --use-model openai:gpt-4o
memvid session end knowledge.mv2

# Replay with different models to compare
memvid session replay knowledge.mv2 --session <id> --audit --use-model gemini:gemini-2.5-pro --diff
memvid session replay knowledge.mv2 --session <id> --audit --use-model claude:claude-4-sonnet --diff

CLI Commands Reference

Command	Description
`memvid session start <file> --name <name>`	Start recording
`memvid session end <file>`	End recording and save
`memvid session list <file>`	List all sessions
`memvid session view <file> --session <id>`	View session details
`memvid session replay <file> --session <id>`	Replay session
`memvid session delete <file> --session <id>`	Delete session
`memvid session checkpoint <file>`	Create checkpoint
`memvid session compare <file> -a <id1> -b <id2>`	Compare two sessions

SDK Support

Python SDK (Full Support)

from memvid_sdk import create

mem = create('knowledge.mv2', enable_vec=True)

# Record session
session_id = mem.session_start("Audit Session")
result = mem.ask("What was the revenue?", model="openai:gpt-4o-mini")
print(f"Cost: ${result.usage.cost_usd:.6f}")
print(f"Grounding: {result.grounding.score:.0%}")
summary = mem.session_end()

# Replay with audit mode
replay = mem.session_replay(
    session_id,
    audit=True,
    use_model="claude:claude-4-sonnet",
    diff=True
)
for action in replay.ask_results:
    print(f"Original: {action.original_answer}")
    print(f"New: {action.new_answer}")
    print(f"Diff: {action.diff_status}")

Node.js SDK

import { create } from '@anthropic/memvid-sdk';

const mem = await create('knowledge.mv2', { enableVec: true });

// Record session
const sessionId = await mem.sessionStart("Audit Session");
const result = await mem.ask("What was the revenue?", { model: "openai:gpt-4o-mini" });
console.log(`Cost: $${result.usage.costUsd.toFixed(6)}`);
await mem.sessionEnd();

// Replay
const replay = await mem.sessionReplay(sessionId, {
  audit: true,
  useModel: "gemini:gemini-2.5-flash",
  diff: true
});

Best Practices

Use descriptive session names: Include date and purpose, e.g., “Fraud Detection Audit 2024-12-27”
Record minimal reproductions: Capture just enough to reproduce the issue
Use audit mode for compliance: Frozen context ensures reproducibility
Compare models with identical context: Use --audit --use-model --diff for fair comparisons
Monitor grounding scores: Low scores indicate potential hallucination

Next Steps

CLI Reference

Full CLI reference for session commands

Python SDK

Session recording in Python

LLM Providers

Configure OpenAI, Claude, Gemini, and more

Adaptive Retrieval

Learn about adaptive retrieval strategies

Get Started

Comparisons

Install

Hosting

Architecture

Search & Retrieval

Enrichment

Media Processing

Embeddings

Security & Limits

Performance

CLI

Python SDK

Node.js SDK

Examples & Packages

Testing

Help

​Overview

Debug Mode

Audit Mode

​Key Features

​Quick Example

​How It Works

​1. Start Recording

​2. Perform Operations

​3. End Session

​4. View Session Details

​Replay Modes

​Debug Replay (Re-executes Search)

​Audit Replay (Frozen Context)

​Model A/B Testing

​Replay Options

​Token & Cost Tracking

​Supported Models & Pricing (Dec 2025)

​Grounding & Hallucination Detection

​Answer Caching

​Use Case Examples

​1. Debug Missing Results

​2. Compliance Audit Trail

​3. Model Comparison

​CLI Commands Reference

​SDK Support

​Python SDK (Full Support)

​Node.js SDK

​Best Practices

​Next Steps

CLI Reference

Python SDK

LLM Providers

Adaptive Retrieval

Overview

Key Features

Quick Example

How It Works

1. Start Recording

2. Perform Operations

3. End Session

4. View Session Details

Replay Modes

Debug Replay (Re-executes Search)

Audit Replay (Frozen Context)

Model A/B Testing

Replay Options

Token & Cost Tracking

Supported Models & Pricing (Dec 2025)

Grounding & Hallucination Detection

Answer Caching

Use Case Examples

1. Debug Missing Results

2. Compliance Audit Trail

3. Model Comparison

CLI Commands Reference

SDK Support

Python SDK (Full Support)

Node.js SDK

Best Practices

Next Steps