File Structure
A.mv2 file is a single, self-contained binary with five main layers:
1. Header (4 KB)
The header contains:- Magic bytes: Identifies the file as
.mv2format - Version: File format version
- WAL metadata: Position and size of write-ahead log
- Footer offset: Points to the table of contents
2. Embedded WAL
The write-ahead log (WAL) ensures crash safety:- All mutations are written to WAL first
- On recovery, uncommitted changes are replayed
- Size scales with file capacity (1 MB to 64 MB)
3. Segments (Frames)
Your actual data lives in segments, which contain frames (the fundamental unit of storage):- Text segments: Document content and metadata stored as frames
- Blob segments: Binary data (images, PDFs) as frames
- Media segments: Audio and video content as frames
- Vector segments: Embeddings for semantic search (optional)
4. Indices
Memvid maintains multiple indices for fast search:- Lexical index (BM25): Full-text keyword search - works out of the box
- Time index: Temporal ordering of frames
- Vector index: Semantic similarity search - optional, add when needed
5. Table of Contents + Footer
The TOC maps everything:- Segment locations and sizes
- Index offsets
- Checksums for integrity verification
MV2FOOT!).
Data Lifecycle
Writing Data
When you add documents:- put() - Adds frames (documents) to pending state
- Indices updated - Lexical and vector indices are built
- Time entries queued - Timestamps recorded for timeline
- WAL appended - Transaction logged for crash safety
- seal() - Commits everything to disk with checksums
Reading Data
When you search or retrieve:- Open file - Locate latest valid footer
- Load TOC - Map segments and indices
- Replay WAL - Apply any uncommitted changes
- Query indices - Search lexical/vector/time indices
- Return results - Ranked documents with snippets
Single-File Guarantee
Memvid’s core promise is single-file portability:What It Means
- No sidecar files: No
.wal,.lock,.shmfiles - No external state: Everything is in the
.mv2file - Portable: Copy the file to transfer the entire memory
Why It Matters
How It Works
Traditional databases use separate files for journals, locks, and indices. Memvid embeds all of these inside the.mv2 file:
| Traditional DB | Memvid |
|---|---|
| data.db + data.db-wal + data.db-shm | knowledge.mv2 |
| Requires careful copying | Just copy the file |
Crash Safety
The embedded WAL ensures data survives unexpected shutdowns.Write-Ahead Logging
Every mutation is logged before being applied:- Transaction written to WAL region
- WAL synced to disk (fsync)
- Changes applied to segments
- Checksum updated
Recovery Process
On open, Memvid:- Locates the last valid footer
- Loads the table of contents
- Scans WAL for uncommitted entries
- Replays any pending transactions
WAL Sizing
WAL size scales with file capacity:| File Size | WAL Size |
|---|---|
| Under 100 MB | 1 MB |
| Under 1 GB | 4 MB |
| Under 10 GB | 16 MB |
| 10 GB or more | 64 MB |
Locking and Concurrency
File Locking
Memvid uses OS-level file locks:- Shared locks: Multiple readers allowed
- Exclusive locks: Single writer at a time
Read-Only Mode
For concurrent read access:Writer Conflicts
If a writer holds the lock:Determinism
Given the same inputs, Memvid produces the same outputs.Why Determinism Matters
- Reproducible builds: Same data → same file
- Reliable testing: Predictable behavior
- Easy debugging: Consistent results
How It’s Achieved
- Segments written in deterministic order
- Timestamps explicit, not system-derived
- Checksums verify integrity
Performance Considerations
Memory Usage
Memvid keeps some data in memory:- Table of contents
- WAL handle
- Pending time entries
- Closing handles when done
- Using read-only mode for queries
Index Building
Building indices is CPU-intensive:- Lexical index: BM25 tokenization and indexing
- Vector index: Graph construction for similarity search
Search Optimization
- Lexical search: Fast for exact keywords
- Vector search: Slower but more intelligent
- Hybrid search: Balances both