Core Design Principles
1. Single-File Guarantee
Every.mv2 file is completely self-contained:
- No sidecars - Never creates
.wal,.shm,.lock, or journal files - Fully portable - Copy, move, or share the file freely
- No database - No external services required
2. Crash Safety
The embedded Write-Ahead Log (WAL) ensures data integrity:- Writes go to WAL first, then to permanent storage
- Automatic recovery on file open after crashes
- Recovery completes in under 250ms even for large files
3. Determinism
Same inputs produce identical bytes on the same platform:- Reproducible builds for testing and QA
- Verifiable file integrity with checksums
- Predictable behavior across runs
4. Performance
Optimized for fast search and retrieval:- Search latency: ~5ms for 50K documents
- Cold start: under 200ms
- WAL append: under 0.1ms per write
File Layout
The.mv2 file format has a well-defined structure:
Header
The 4 KB header contains:| Field | Description |
|---|---|
| Magic | MV2 identifier |
| Version | File format version |
| WAL Offset | Start of embedded WAL region |
| WAL Size | Size of WAL ring buffer |
| Checkpoint Position | Last committed WAL position |
| TOC Checksum | BLAKE3 hash for integrity |
Embedded WAL
The WAL is sized based on total file capacity:| File Size | WAL Size |
|---|---|
| Under 100 MB | 1 MB |
| Under 1 GB | 4 MB |
| Under 10 GB | 16 MB |
| 10 GB or more | 64 MB |
- WAL reaches 75% capacity
- User calls
seal() - Every 1,000 transactions
- Clean shutdown
Frames
Frames are the fundamental unit of storage. Each frame contains:- Payload - The actual content (text, binary, media)
- Metadata - Title, URI, timestamps, tags, labels
- Checksum - BLAKE3 hash for verification
- Encoding - Plain or Zstd compressed
Search Architecture
Memvid supports three search modes:Lexical Search (BM25)
Fast keyword search using BM25 ranking:- Full-text search with term frequency scoring
- Date range filters:
date:[2024-01-01 TO 2024-12-31] - Tokenization and stemming
Vector Search
Semantic similarity search using embeddings:- Fast approximate nearest neighbor search
- Optional Product Quantization (PQ) for 16x compression
- Configurable embedding models
Hybrid Search
Combines both approaches:- Run lexical search for keyword matches
- Run vector search for semantic similarity
- Merge and rerank results
- Return top-k hits
Developer Walkthrough
Here’s how to work with Memvid in practice:Using the CLI
Using the Python SDK
Using the Node.js SDK
Verification and Repair
Memvid includes built-in tools for file health:Verify
Check file integrity without modification:Doctor
Diagnose and repair issues:Single-File Check
Ensure no auxiliary files were created:Checksums and Integrity
Defense in depth with cascading checksums:| Level | What’s Checked |
|---|---|
| Header | TOC checksum (BLAKE3) |
| WAL Records | Per-record checksum |
| Index Segments | Per-segment checksum |
| Frames | Per-frame payload checksum |
Next Steps
- File Format Details - Deep dive into the MV2 structure
- CLI Commands - Complete CLI reference
- Python SDK - Python bindings guide
- Node.js SDK - Node.js bindings guide