Glossary - Memvid

This glossary provides definitions for all key terms, components, and concepts in the Memvid ecosystem. Whether you’re just getting started or diving deep into the architecture, this reference will help you understand how everything fits together.

Architecture Components

Memvid Core

The heart of Memvid, written in Rust. memvid-core is the foundational library that implements all core functionality:

File format handling - Reading/writing .mv2 files
Indexing engines - Lexical (Tantivy), vector (HNSW), and hybrid search
WAL management - Write-ahead logging for crash safety
Enrichment pipeline - Background processing for embeddings and extraction
Memory management - Frame storage, versioning, and lifecycle

The core is compiled to native binaries and exposed through language bindings (Node.js, Python) for cross-platform use.

CLI (Command Line Interface)

The memvid CLI tool provides direct access to all Memvid operations from your terminal:

# Create a new memory file
memvid create my-knowledge.mv2

# Add content
memvid put my-knowledge.mv2 --input document.pdf

# Search
memvid find my-knowledge.mv2 --query "machine learning"

# Ask questions
memvid ask my-knowledge.mv2 --question "What is the main thesis?"

The CLI is ideal for scripting, automation, and quick interactions without writing code.

SDKs (Software Development Kits)

Language-specific libraries that wrap the Memvid core for seamless integration:

Node.js SDK

@memvid/sdk - Native N-API bindings for Node.js applications

Python SDK

memvid-sdk - PyO3 bindings for Python applications

Both SDKs provide identical APIs:

create() / use() - Create or open memory files
put() / put_many() - Insert documents
find() - Search with various modes
ask() - AI-powered Q&A
timeline() - Browse insertion history

File Format

MV2 (Memory File)

The .mv2 file extension represents a Memvid memory file. It’s a single, self-contained binary file that stores:

All your documents and data (frames)
Search indices (lexical, vector, temporal)
Metadata and checksums
Write-ahead log for crash recovery

Key characteristics:

Portable - Copy, move, or share as a single file
Serverless - No database server required
Crash-safe - WAL ensures data integrity
Deterministic - Reproducible builds for verification

MV2E (Encrypted Memory File)

The .mv2e extension indicates an encrypted memory file (Capsule). Uses:

Argon2 for password-based key derivation
AES-GCM for authenticated encryption

Requires a password to open; transparent once unlocked.

Core Concepts

Frame

The atomic unit of data in Memvid. Every piece of content you store becomes a frame. Properties:

Property	Description
`frame_id`	Unique monotonic identifier (u64)
`content`	The actual text/data stored
`metadata`	Tags, timestamps, source info
`status`	Active, Superseded, or Deleted
`role`	Document, DocumentChunk, or ExtractedImage

Lifecycle:

Insert - Frame created with frame_id, status = Active
Update - Original marked Superseded, new frame created
Delete - Status changed to Deleted (soft delete)

Frames are immutable once committed - updates create new frames.

Memory

A “memory” in Memvid refers to the runtime instance managing an .mv2 file. When you open a memory file, you get a Memory object that handles:

Reading and writing frames
Managing indices
Coordinating search
Handling commits and checkpoints

// Open a memory
const mem = await memvid.use("knowledge.mv2");

// The 'mem' object is your Memory instance
await mem.put({ text: "Hello world" });
await mem.find({ query: "hello" });

Commit

The process of persisting pending changes to the .mv2 file:

WAL entries written to disk
Indices updated
Footer rewritten with new checksums
File synced to storage

Commits happen automatically on close, or can be triggered manually for durability guarantees.

Checkpoint

A checkpoint purges committed WAL entries and updates the header:

Frees WAL space for new writes
Marks transactions as permanently durable
Triggered automatically when WAL reaches 75% capacity

Search & Retrieval

Lexical Search

Traditional keyword-based search using the BM25 ranking algorithm:

How it works: Builds an inverted index of terms, scores documents by term frequency and inverse document frequency
Best for: Exact matches, specific keywords, technical terms
Engine: Tantivy (Rust full-text search library)

memvid find data.mv2 --query "API authentication" --mode lex

Semantic Search (Vector Search)

Meaning-based search using embeddings:

How it works: Converts text to vector embeddings, finds similar vectors using cosine similarity
Best for: Conceptual queries, finding related content, natural language questions
Index: HNSW (Hierarchical Navigable Small World) graph

memvid find data.mv2 --query "how to secure endpoints" --mode vec

Hybrid Search

Combines lexical and semantic search for best results:

Runs both search types in parallel
Normalizes scores from each
Merges using RRF (Reciprocal Rank Fusion)
Returns unified ranked results

memvid find data.mv2 --query "authentication best practices" --mode hybrid

Sketch Pre-filtering

Ultra-fast candidate filtering before expensive ranking:

SimHash: 64-bit locality-sensitive hash for quick similarity checks
Term Filter: Compact bitset for query term overlap
Top Terms: Hashed IDs of highest-weight terms

Reduces search candidates by 10-100x, enabling sub-millisecond filtering on million-frame memories.

Indexing

Lex Index

The lexical search index built on Tantivy:

Tokenizes text into terms
Builds inverted index (term → document list)
Supports field-specific queries (title, content, tags)
Deterministic chunking for reproducibility

Vec Index

The vector/embedding search index:

Stores document embeddings
Uses HNSW graph for approximate nearest neighbor search
Supports multiple embedding models (BGE, Nomic, OpenAI)
Optional product quantization for compression

Time Index

Temporal index for frame ordering:

Tracks insertion timestamps
Enables range queries (since/until)
Powers timeline() navigation
Supports forward and reverse traversal

Sketch Track

Per-frame micro-indices for fast filtering:

Variant	Size	Use Case
Small	32 bytes/frame	Memory-constrained
Medium	64 bytes/frame	Balanced (default)
Large	96 bytes/frame	Maximum precision

Enrichment

Enrichment Pipeline

Background processing that enhances frames after insertion: Phases:

Searchable (instant) - Skim text extracted, basic indexing
Enriched (background) - Full text, embeddings, memory cards, entities

The two-phase approach means search works immediately while richer features process in the background.

Memory Cards

Structured units of extracted knowledge:

Field	Description
`kind`	Fact, Preference, Event, Profile, Relationship, Goal
`content`	The extracted information
`polarity`	Positive, Negative, or Neutral
`version_relation`	Sets, Updates, Extends, or Retracts

Memory cards enable semantic querying beyond raw text search.

Entity Extraction (NER)

Named Entity Recognition identifies and links entities:

Types: Person, Organization, Location, Date, Money, URL, etc.
Model: DistilBERT-NER (ONNX)
Output: Entities with confidence scores and frame references

Logic Mesh

Entity-relationship graph connecting extracted entities:

Bidirectional graph structure
Nodes: Entities with types and mentions
Edges: Relationships with confidence
Enables: “follow” queries for fact traversal

Persistence

WAL (Write-Ahead Log)

Embedded circular buffer ensuring crash safety:

Purpose: Records mutations before they’re applied
Checksum: BLAKE3 hash for integrity verification
Recovery: Replays uncommitted entries after crash
Size: Configurable (64 KB to 64 MB)

Fixed 4 KB structure at file offset 0:

Magic bytes (MV2\0)
Spec and format versions
WAL offset and size
Footer offset pointer

Variable-length CBOR-serialized metadata at end of file:

Table of Contents (TOC)
Manifest pointers
Segment catalog
Checksums for validation

TOC (Table of Contents)

Master index structure in the footer:

Lists all frames with metadata
References to index manifests (lex, vec, time)
Segment catalog for published indices

Capacity & Licensing

Ticket

Signed proof of capacity grant:

Issuer: Authority that granted the capacity
Sequence: Monotonic identifier
Capacity: Bytes allowed
Signature: ED25519 digital signature

Tickets are validated cryptographically and cannot be forged.

Capacity Tiers

Storage limits based on plan:

Tier	Capacity	Memory Files	Queries/Month
Free	50 MB	—	—
Starter	25 GB	5	250k
Pro	125 GB	25	20M
Enterprise	Unlimited	Unlimited	Unlimited

Embedding Models

Local Models

Run entirely on your machine:

Model	Dimensions	Speed	Quality
BGE-Small	384	Fast	Good
BGE-Base	768	Medium	Better
Nomic-Embed	768	Medium	Better

Cloud Models

API-based embedding providers:

Provider	Model	Dimensions
OpenAI	text-embedding-3-small	1536
OpenAI	text-embedding-3-large	3072
NVIDIA	NV-Embed-v2	4096

CLIP Models

For visual/image embeddings:

Model	Dimensions	Use Case
SigLIP	768	High-quality image search
MobileCLIP	384	Fast, lightweight

Operations

put / put_many

Insert documents into memory:

// Single document
const frameId = await mem.put({ text: "Hello world" });

// Batch insert
const frameIds = await mem.put_many([
  { text: "Doc 1" },
  { text: "Doc 2" },
  { file: "document.pdf" }
]);

find

Search the memory:

const results = await mem.find({
  query: "machine learning",
  k: 10,
  mode: "hybrid",
  snippet_chars: 200
});

ask

AI-powered question answering:

const answer = await mem.ask({
  question: "What are the key findings?",
  use_model: "openai"
});

Returns synthesized answer with source citations.

timeline

Browse frames by insertion order:

// Recent frames
const recent = await mem.timeline({ limit: 20 });

// Oldest first
const oldest = await mem.timeline({ limit: 20, reverse: true });

seal

Close and commit the memory file:

await mem.seal();

Commits pending changes and releases file lock.

Feature Flags

Cargo features that enable optional functionality:

Feature	Description
`lex`	Tantivy full-text search
`vec`	Vector embeddings + HNSW
`clip`	CLIP visual search
`whisper`	Audio transcription
`encryption`	Capsule encryption
`logic_mesh`	Entity-relationship graph + NER
`replay`	Session recording/replay
`temporal_track`	Temporal mention tracking
`parallel_segments`	Parallel index building

Common Workflows

Ingestion → Search → Ask

Create memory       → memvid.create("data.mv2")
Insert documents    → mem.put({ file: "docs/*.pdf" })
Wait for enrichment → (automatic background processing)
Search              → mem.find({ query: "..." })
Ask questions       → mem.ask({ question: "..." })
Close               → mem.seal()

Enrichment Pipeline

put()           → Frame enters Searchable state
Background      → Worker extracts full text
Embedding       → Generate vector embeddings
Extraction      → Memory cards, entities
Indexed         → Frame now fully Enriched

Five-Minute Guide

Get started with Memvid in minutes

Architecture Overview

Deep dive into how Memvid works

CLI Reference

Complete CLI command reference

SDK Comparison

Compare features across SDKs

Get Started

Comparisons

Install

Hosting

Architecture

Search & Retrieval

Enrichment

Media Processing

Embeddings

Security & Limits

Performance

CLI

Python SDK

Node.js SDK

Examples & Packages

Testing

Help

​Architecture Components

​Memvid Core

​CLI (Command Line Interface)

​SDKs (Software Development Kits)

Node.js SDK

Python SDK

​File Format

​MV2 (Memory File)

​MV2E (Encrypted Memory File)

​Core Concepts

​Frame

​Memory

​Commit

​Checkpoint

​Search & Retrieval

​Lexical Search

​Semantic Search (Vector Search)

​Hybrid Search

​Sketch Pre-filtering

​Indexing

​Lex Index

​Vec Index

​Time Index

​Sketch Track

​Enrichment

​Enrichment Pipeline

​Memory Cards

​Entity Extraction (NER)

​Logic Mesh

​Persistence

​WAL (Write-Ahead Log)

​Header

​Footer

​TOC (Table of Contents)

​Capacity & Licensing

​Ticket

​Capacity Tiers

​Embedding Models

​Local Models

​Cloud Models

​CLIP Models

​Operations

​put / put_many

​find

​ask

​timeline

​seal

​Feature Flags

​Common Workflows

​Ingestion → Search → Ask

​Enrichment Pipeline

​See Also

Five-Minute Guide

Architecture Overview

CLI Reference

SDK Comparison

Architecture Components

Memvid Core

CLI (Command Line Interface)

SDKs (Software Development Kits)

File Format

MV2 (Memory File)

MV2E (Encrypted Memory File)

Core Concepts

Frame

Memory

Commit

Checkpoint

Search & Retrieval

Lexical Search

Semantic Search (Vector Search)

Hybrid Search

Sketch Pre-filtering

Indexing

Lex Index

Vec Index

Time Index

Sketch Track

Enrichment

Enrichment Pipeline

Memory Cards

Entity Extraction (NER)

Logic Mesh

Persistence

WAL (Write-Ahead Log)

Header

Footer

TOC (Table of Contents)

Capacity & Licensing

Ticket

Capacity Tiers

Embedding Models

Local Models

Cloud Models

CLIP Models

Operations

put / put_many

find

ask

timeline

seal

Feature Flags

Common Workflows

Ingestion → Search → Ask

Enrichment Pipeline

See Also