Skip to main content
Understanding how Memvid stores data helps you make better decisions about ingestion, search, and performance optimization.

File Structure

A .mv2 file is a single, self-contained binary with five main layers:

1. Header (4 KB)

The header contains:
  • Magic bytes: Identifies the file as .mv2 format
  • Version: File format version
  • WAL metadata: Position and size of write-ahead log
  • Footer offset: Points to the table of contents

2. Embedded WAL

The write-ahead log (WAL) ensures crash safety:
  • All mutations are written to WAL first
  • On recovery, uncommitted changes are replayed
  • Size scales with file capacity (1 MB to 64 MB)

3. Segments (Frames)

Your actual data lives in segments, which contain frames (the fundamental unit of storage):
  • Text segments: Document content and metadata stored as frames
  • Blob segments: Binary data (images, PDFs) as frames
  • Media segments: Audio and video content as frames
  • Vector segments: Embeddings for semantic search (optional)
Each frame contains payload, metadata, timestamp, URI, and checksum. Segments are written in deterministic order for reproducibility.

4. Indices

Memvid maintains multiple indices for fast search:
  • Lexical index (BM25): Full-text keyword search - works out of the box
  • Time index: Temporal ordering of frames
  • Vector index: Semantic similarity search - optional, add when needed
The TOC maps everything:
  • Segment locations and sizes
  • Index offsets
  • Checksums for integrity verification
The footer contains a final checksum and magic trailer (MV2FOOT!).

Data Lifecycle

Writing Data

When you add documents:
  1. put() - Adds frames (documents) to pending state
  2. Indices updated - Lexical and vector indices are built
  3. Time entries queued - Timestamps recorded for timeline
  4. WAL appended - Transaction logged for crash safety
  5. seal() - Commits everything to disk with checksums
from memvid_sdk import use

mem = use('basic', 'knowledge.mv2')

# 1. Add documents (pending)
mem.put(title="Doc 1", label="docs", metadata={}, text="Your content")
mem.put(title="Report", label="docs", metadata={}, file="report.pdf")

# 2. Commit to disk
mem.seal()

Reading Data

When you search or retrieve:
  1. Open file - Locate latest valid footer
  2. Load TOC - Map segments and indices
  3. Replay WAL - Apply any uncommitted changes
  4. Query indices - Search lexical/vector/time indices
  5. Return results - Ranked documents with snippets

Single-File Guarantee

Memvid’s core promise is single-file portability:

What It Means

  • No sidecar files: No .wal, .lock, .shm files
  • No external state: Everything is in the .mv2 file
  • Portable: Copy the file to transfer the entire memory

Why It Matters

# Your entire knowledge base
ls ~/project/
# → knowledge.mv2

# Share it anywhere
cp knowledge.mv2 /team/shared/
scp knowledge.mv2 user@server:/data/
git add knowledge.mv2

How It Works

Traditional databases use separate files for journals, locks, and indices. Memvid embeds all of these inside the .mv2 file:
Traditional DBMemvid
data.db + data.db-wal + data.db-shmknowledge.mv2
Requires careful copyingJust copy the file

Crash Safety

The embedded WAL ensures data survives unexpected shutdowns.

Write-Ahead Logging

Every mutation is logged before being applied:
  1. Transaction written to WAL region
  2. WAL synced to disk (fsync)
  3. Changes applied to segments
  4. Checksum updated

Recovery Process

On open, Memvid:
  1. Locates the last valid footer
  2. Loads the table of contents
  3. Scans WAL for uncommitted entries
  4. Replays any pending transactions
This guarantees that your data is safe even after crashes or power failures.

WAL Sizing

WAL size scales with file capacity:
File SizeWAL Size
Under 100 MB1 MB
Under 1 GB4 MB
Under 10 GB16 MB
10 GB or more64 MB

Locking and Concurrency

File Locking

Memvid uses OS-level file locks:
  • Shared locks: Multiple readers allowed
  • Exclusive locks: Single writer at a time

Read-Only Mode

For concurrent read access:
# Multiple processes can read simultaneously
mem = use('basic', 'knowledge.mv2', read_only=True)
results = mem.find('query')

Writer Conflicts

If a writer holds the lock:
from memvid_sdk import use, LockedError

try:
    mem = use('basic', 'knowledge.mv2')
except LockedError:
    print("File is locked by another process")

Determinism

Given the same inputs, Memvid produces the same outputs.

Why Determinism Matters

  • Reproducible builds: Same data → same file
  • Reliable testing: Predictable behavior
  • Easy debugging: Consistent results

How It’s Achieved

  • Segments written in deterministic order
  • Timestamps explicit, not system-derived
  • Checksums verify integrity

Performance Considerations

Memory Usage

Memvid keeps some data in memory:
  • Table of contents
  • WAL handle
  • Pending time entries
For large files, consider:
  • Closing handles when done
  • Using read-only mode for queries

Index Building

Building indices is CPU-intensive:
  • Lexical index: BM25 tokenization and indexing
  • Vector index: Graph construction for similarity search
Use parallel ingestion for large datasets:
memvid put knowledge.mv2 --input ./large-dataset/ \
  --vector-compression \
  --parallel-segments

Search Optimization

  • Lexical search: Fast for exact keywords
  • Vector search: Slower but more intelligent
  • Hybrid search: Balances both
Choose the right mode for your query.

Next Steps