What is Session Replay? Record every put, find, and ask operation during an agent session, then replay it with different parameters, models, or frozen context for debugging and auditing.
Overview
Record
Start a session and perform operations (put, find, ask). Every action is captured with full context.
Save
End the session. Frames, results, answers, tokens, cost, and grounding scores are persisted.
Replay
Re-run the session with different parameters or frozen context to debug or audit.
Debug Mode Re-execute searches with different --top-k or --adaptive settings to find why results were missed
Audit Mode Freeze retrieval context and replay with different LLMs using --audit --use-model --diff
Key Features
Feature Description Frozen Context Replay with exact same frames - no retrieval drift Model A/B Testing Compare GPT-4 vs Claude vs Gemini with identical input Cost Tracking Token counts and USD cost per query Grounding Scores Detect hallucination risk (0-100%) Answer Caching Skip redundant LLM calls, save money Diff Reports See exactly how answers changed
Quick Example
# 1. Start recording
memvid session start knowledge.mv2 --name "Audit 2024-12"
# 2. Ask questions (tokens, cost, grounding tracked)
memvid ask knowledge.mv2 --question "What was the revenue?" --use-model openai:gpt-4o-mini
# tokens: 3112 + 42 = 3154 cost: $0.0005 grounding: 95% (HIGH)
# 3. End session
memvid session end knowledge.mv2
# Session ended. 5 actions recorded.
# 4. Replay with different model + diff
memvid session replay knowledge.mv2 --session < i d > \
--audit --use-model claude:claude-4-sonnet --diff
# Diff: IDENTICAL ✓
How It Works
1. Start Recording
# CLI
memvid session start knowledge.mv2 --name "Audit Session"
# Python SDK
session_id = mem.session_start( "Audit Session" )
All operations are recorded with full context:
# Ask questions - frames, tokens, and answers are captured
memvid ask knowledge.mv2 --question "What was the acquisition price?" \
--use-model openai:gpt-4o-mini
# Output shows cost and grounding
# tokens: 3112 input + 19 output = 3131 cost: $0.000478
# grounding: 100% (HIGH) - 2/2 sentences grounded
3. End Session
memvid session end knowledge.mv2
# Output: Session ended. 12 actions recorded.
4. View Session Details
memvid session view knowledge.mv2 --session < session-i d >
# Output:
# Actions:
# [0] FIND - Find { query: "acquisition price", mode: "Hybrid", result_count: 8 }
# [1] ASK - Ask { query: "What was the acquisition price?", provider: "openai", model: "gpt-4o-mini" }
Replay Modes
Debug Replay (Re-executes Search)
Standard replay re-runs retrieval to compare results:
memvid session replay knowledge.mv2 --session < i d > --adaptive --verbose
Audit Replay (Frozen Context)
Audit mode uses the exact frames from the original session:
memvid session replay knowledge.mv2 --session < i d > --audit
Output shows frozen frames:
✓ Step 3/12 ask
Question: "What was the acquisition price?"
Mode: AUDIT (frozen retrieval)
Original Model: openai:gpt-4o-mini
Frozen frames: [66, 68, 61, 170, 22, 67, 57, 0]
Context: VERIFIED (frames frozen)
Original Answer: "The acquisition was valued at $2 billion."
Model A/B Testing
Compare different models with identical context :
# Original used GPT-4o-mini, replay with Claude
memvid session replay knowledge.mv2 --session < i d > \
--audit \
--use-model claude:claude-3-5-sonnet \
--diff
Output shows comparison:
✓ Step 3/12 ask
Question: "What was the acquisition price?"
Mode: AUDIT (frozen retrieval)
Original Model: openai:gpt-4o-mini
Frozen frames: [66, 68, 61, 170, 22, 67, 57, 0]
Override Model: claude:claude-3-5-sonnet
Original Answer: "The acquisition was valued at $2 billion."
Context: VERIFIED (frames frozen)
New Answer: "The acquisition price was $2B according to the documents."
Diff: CHANGED
Replay Options
Option Description --auditFreeze retrieval - use recorded frames instead of re-searching --use-model <model>Override the LLM model for comparison (e.g., openai:gpt-4o, gemini:gemini-2.5-flash) --diffGenerate diff report comparing original vs new answers --adaptiveEnable adaptive retrieval (debug mode only) --top-k NOverride top-k value (debug mode only) --skip-asksSkip LLM operations during replay --skip-findsSkip search operations during replay --from-checkpointStart replay from a specific checkpoint --webLaunch Time Machine web UI
Token & Cost Tracking
Every ask operation tracks token usage and estimated cost:
memvid ask knowledge.mv2 --question "Summarize the report" \
--use-model openai:gpt-4o-mini --json
{
"answer" : "The report covers..." ,
"usage" : {
"input_tokens" : 3648 ,
"output_tokens" : 36 ,
"total_tokens" : 3684 ,
"cost_usd" : 0.000569
},
"grounding" : {
"score" : 1.0 ,
"label" : "HIGH" ,
"sentence_count" : 1 ,
"grounded_sentences" : 1 ,
"has_warning" : false
},
"cached" : false
}
Supported Models & Pricing (Dec 2025)
Provider Model Input/1M Output/1M OpenAI gpt-4o-mini $0.15 $0.60 OpenAI gpt-4o $2.50 $10.00 OpenAI gpt-4.5 $75.00 $150.00 Claude claude-3-haiku $0.25 $1.25 Claude claude-4-sonnet $3.00 $15.00 Claude claude-4-opus $15.00 $75.00 Gemini gemini-2.5-flash $0.15 $3.50 Gemini gemini-2.5-pro $1.25 $10.00 xAI grok-4 $3.00 $15.00 Groq llama-3.3-70b $0.59 $0.79 Mistral mistral-large $0.50 $1.50
Grounding & Hallucination Detection
Every answer is scored for grounding - how well it’s supported by the retrieved context:
grounding: 100% (HIGH) - 2/2 sentences grounded
Score Label Meaning 70-100% HIGH Well-grounded in context 40-69% MEDIUM Partially grounded 0-39% LOW Potential hallucination - warning shown
When grounding is low, you’ll see a warning:
grounding: 25% (LOW) - 1/4 sentences grounded
⚠ Warning: Some statements may not be supported by context
Answer Caching
Repeated questions with the same context return cached answers instantly:
# First call - hits LLM
memvid ask knowledge.mv2 --question "What is the revenue?" --use-model openai
# tokens: 2500 input + 50 output cost: $0.00042
# Second call - cached
memvid ask knowledge.mv2 --question "What is the revenue?" --use-model openai
# cached: true cost: $0.00 (saved $0.00042)
Cache key is based on: model + query + context hash
Use Case Examples
1. Debug Missing Results
# Record the failing scenario
memvid session start knowledge.mv2 --name "Missing Results Debug"
memvid ask knowledge.mv2 --question "What did Databricks purchase?" --use-model openai
memvid session end knowledge.mv2
# Replay with adaptive retrieval
memvid session replay knowledge.mv2 --session < i d > --adaptive --verbose
# Reveals: Document existed at rank 12, adaptive found it
2. Compliance Audit Trail
# Record all decisions for audit
memvid session start knowledge.mv2 --name "Compliance Review 2024-12"
memvid ask knowledge.mv2 --question "Is this transaction fraudulent?" --use-model openai
memvid session end knowledge.mv2
# Later: Replay with frozen context to verify decision
memvid session replay knowledge.mv2 --session < i d > --audit
# Shows exact frames and answer - reproducible for auditors
3. Model Comparison
# Ask with GPT-4o
memvid session start knowledge.mv2 --name "Model Comparison"
memvid ask knowledge.mv2 --question "Summarize the key findings" --use-model openai:gpt-4o
memvid session end knowledge.mv2
# Replay with different models to compare
memvid session replay knowledge.mv2 --session < i d > --audit --use-model gemini:gemini-2.5-pro --diff
memvid session replay knowledge.mv2 --session < i d > --audit --use-model claude:claude-4-sonnet --diff
CLI Commands Reference
Command Description memvid session start <file> --name <name>Start recording memvid session end <file>End recording and save memvid session list <file>List all sessions memvid session view <file> --session <id>View session details memvid session replay <file> --session <id>Replay session memvid session delete <file> --session <id>Delete session memvid session checkpoint <file>Create checkpoint memvid session compare <file> -a <id1> -b <id2>Compare two sessions
SDK Support
Python SDK (Full Support)
from memvid_sdk import create
mem = create( 'knowledge.mv2' , enable_vec = True )
# Record session
session_id = mem.session_start( "Audit Session" )
result = mem.ask( "What was the revenue?" , model = "openai:gpt-4o-mini" )
print ( f "Cost: $ { result.usage.cost_usd :.6f} " )
print ( f "Grounding: { result.grounding.score :.0%} " )
summary = mem.session_end()
# Replay with audit mode
replay = mem.session_replay(
session_id,
audit = True ,
use_model = "claude:claude-4-sonnet" ,
diff = True
)
for action in replay.ask_results:
print ( f "Original: { action.original_answer } " )
print ( f "New: { action.new_answer } " )
print ( f "Diff: { action.diff_status } " )
Node.js SDK
import { create } from '@anthropic/memvid-sdk' ;
const mem = await create ( 'knowledge.mv2' , { enableVec: true });
// Record session
const sessionId = await mem . sessionStart ( "Audit Session" );
const result = await mem . ask ( "What was the revenue?" , { model: "openai:gpt-4o-mini" });
console . log ( `Cost: $ ${ result . usage . costUsd . toFixed ( 6 ) } ` );
await mem . sessionEnd ();
// Replay
const replay = await mem . sessionReplay ( sessionId , {
audit: true ,
useModel: "gemini:gemini-2.5-flash" ,
diff: true
});
Best Practices
Use descriptive session names : Include date and purpose, e.g., “Fraud Detection Audit 2024-12-27”
Record minimal reproductions : Capture just enough to reproduce the issue
Use audit mode for compliance : Frozen context ensures reproducibility
Compare models with identical context : Use --audit --use-model --diff for fair comparisons
Monitor grounding scores : Low scores indicate potential hallucination
Next Steps