Memvid supports CLIP (Contrastive Language-Image Pre-training) embeddings for visual search. This enables searching documents, PDFs, and images by visual content, including charts, diagrams, photos, and visual elements using natural language queries.
Overview
CLIP models learn to associate images with text descriptions, enabling:
- Text-to-image search: Find images using natural language (“sustainability charts”, “team photos”)
- Visual document search: Search PDF pages by their visual content, not just text
- Cross-modal retrieval: Query with text, retrieve visual content
| Provider | Model | Dimensions | Best For |
|---|
| Local | MobileCLIP-S2 | 512 | Offline, privacy-first |
| OpenAI | text-embedding-3-small | 1536 | General purpose |
| OpenAI | text-embedding-3-large | 3072 | Highest quality |
| Gemini | embedding-001 | 768 | Google ecosystem |
Quick Start
Python SDK
from memvid_sdk import create
from memvid_sdk.clip import get_clip_provider
# Initialize CLIP provider
clip = get_clip_provider('openai') # or 'local', 'gemini'
print(f"Provider: {clip.name} ({clip.dimension} dimensions)")
# Create memory and store a PDF
mem = create('visual_search.mv2')
mem.enable_lex()
frame_id = mem.put(
title="Annual Report 2024",
label="report",
metadata={"year": 2024},
file="report.pdf",
)
# Generate text embedding for visual search
query_embedding = clip.embed_text("revenue growth charts")
print(f"Query embedding: {len(query_embedding)} dimensions")
# Search (visual search requires vector index)
results = mem.find("revenue", k=10)
Node.js SDK
import { create, getClipProvider } from '@memvid/sdk';
// Initialize CLIP provider
const clip = getClipProvider('openai'); // or 'local', 'gemini'
console.log(`Provider: ${clip.name} (${clip.dimension} dimensions)`);
// Create memory and store a PDF
const mem = await create('visual_search.mv2');
await mem.enableLex();
const frameId = await mem.put({
title: 'Annual Report 2024',
label: 'report',
metadata: { year: 2024 },
file: 'report.pdf',
});
// Generate text embedding for visual search
const queryEmbedding = await clip.embedText('revenue growth charts');
console.log(`Query embedding: ${queryEmbedding.length} dimensions`);
// Search
const results = await mem.find('revenue', { k: 10 });
Providers
Local CLIP (MobileCLIP-S2)
The default provider uses MobileCLIP-S2, a lightweight CLIP model optimized for mobile and edge devices.
Characteristics:
- Dimensions: 512
- Size: ~200 MB (downloaded on first use)
- Inference: CPU-based, no GPU required
- Privacy: All processing happens locally
- Offline: Works without internet after initial download
from memvid_sdk.clip import get_clip_provider, LocalClip
# Using factory
clip = get_clip_provider('local')
# Or direct instantiation
clip = LocalClip(model='mobileclip-s2')
# Embed an image
image_embedding = clip.embed_image('photo.jpg')
# Embed text for search
text_embedding = clip.embed_text('sunset over ocean')
# Batch embed multiple images
embeddings = clip.embed_images(['img1.jpg', 'img2.jpg', 'img3.jpg'])
import { getClipProvider, LocalClip } from '@memvid/sdk';
// Using factory
const clip = getClipProvider('local');
// Or direct instantiation
const clip = new LocalClip({ model: 'mobileclip-s2' });
// Embed an image
const imageEmbedding = await clip.embedImage('photo.jpg');
// Embed text for search
const textEmbedding = await clip.embedText('sunset over ocean');
// Batch embed multiple images
const embeddings = await clip.embedImages(['img1.jpg', 'img2.jpg', 'img3.jpg']);
Local CLIP is supported in memvid-core and the Python SDK. In Node.js, LocalClip requires a native build that exports ClipModel (the prebuilt npm binaries may not include it). Cloud providers work out of the box.
OpenAI CLIP
OpenAI’s embedding models provide excellent quality for visual search queries.
Setup:
export OPENAI_API_KEY=sk-your-key-here
Usage:
from memvid_sdk.clip import get_clip_provider, OpenAIClip
# Using factory
clip = get_clip_provider('openai')
# Or with specific model
clip = get_clip_provider('openai:text-embedding-3-large')
# Direct instantiation
clip = OpenAIClip(model='text-embedding-3-small')
# Embed text for visual search
embedding = clip.embed_text('executive team photo')
print(f"Dimensions: {len(embedding)}")
import { getClipProvider, OpenAIClip } from '@memvid/sdk';
// Using factory
const clip = getClipProvider('openai');
// Override embedding/vision models via config
const clip2 = getClipProvider('openai', { embeddingModel: 'text-embedding-3-large', visionModel: 'gpt-4o-mini' });
// Direct instantiation
const clip3 = new OpenAIClip({ embeddingModel: 'text-embedding-3-small', visionModel: 'gpt-4o-mini' });
// Embed text for visual search
const embedding = await clip.embedText('executive team photo');
console.log(`Dimensions: ${embedding.length}`);
Model Comparison:
| Model | Dimensions | Quality |
|---|
text-embedding-3-small | 1536 | Good |
text-embedding-3-large | 3072 | Best |
Gemini CLIP
Google’s Gemini provides multimodal embeddings for visual search.
Setup:
export GEMINI_API_KEY=your-key-here
Usage:
from memvid_sdk.clip import get_clip_provider, GeminiClip
# Using factory
clip = get_clip_provider('gemini')
# Or with specific model
clip = get_clip_provider('gemini:embedding-001')
# Direct instantiation
clip = GeminiClip(model='embedding-001')
# Embed text
embedding = clip.embed_text('data visualization dashboard')
import { getClipProvider, GeminiClip } from '@memvid/sdk';
const clip = getClipProvider('gemini');
const embedding = await clip.embedText('data visualization dashboard');
Complete Example
Here’s a full workflow for visual document search:
from pathlib import Path
from memvid_sdk import create
from memvid_sdk.clip import get_clip_provider
# Configuration
PROVIDER = 'openai' # 'local', 'openai', 'gemini'
PDF_PATH = 'annual_report.pdf'
OUTPUT_PATH = 'visual_search.mv2'
# Initialize
clip = get_clip_provider(PROVIDER)
print(f"CLIP Provider: {clip.name} ({clip.dimension} dims)")
# Create memory
if Path(OUTPUT_PATH).exists():
Path(OUTPUT_PATH).unlink()
mem = create(OUTPUT_PATH)
mem.enable_lex()
# Ingest PDF
frame_id = mem.put(
title=Path(PDF_PATH).stem,
label='report',
metadata={'source': 'finance', 'year': 2024},
file=PDF_PATH,
)
print(f"Stored PDF as frame {frame_id}")
# Visual search queries
queries = [
'revenue growth charts',
'organizational structure',
'sustainability initiatives',
'executive portraits',
]
print("\nGenerating embeddings for visual search:")
for query in queries:
embedding = clip.embed_text(query)
print(f" '{query}' -> {len(embedding)} dims")
# Seal and show stats
mem.seal()
stats = mem.stats()
print(f"\nFinal: {stats.get('frame_count', 0)} frames")
API Reference
ClipProvider Interface
All CLIP providers implement this interface:
| Method | Description |
|---|
name | Provider identifier (e.g., openai:text-embedding-3-small) |
dimension | Embedding vector dimension |
embed_image(path) | Generate embedding for a single image |
embed_text(text) | Generate text embedding for visual search |
embed_images(paths) | Batch embed multiple images |
Factory Function
# Python
from memvid_sdk.clip import get_clip_provider
clip = get_clip_provider(provider) # 'local', 'openai', 'gemini', 'openai:model-name'
// Node.js
import { getClipProvider } from '@memvid/sdk';
const clip = getClipProvider(provider); // 'local' | 'openai' | 'gemini'
Environment Variables
| Variable | Description |
|---|
OPENAI_API_KEY | OpenAI API key for OpenAI CLIP |
GEMINI_API_KEY | Google AI API key for Gemini CLIP |
MEMVID_MODELS_DIR | Local model cache directory |
MEMVID_OFFLINE=1 | Skip model downloads (local CLIP) |
MEMVID_CLIP_MODEL | Override default CLIP model |
Use Cases
Visual Document Search
Search PDFs by their visual content (charts, diagrams, tables):
# Find pages with specific visual elements
clip = get_clip_provider('openai')
query = clip.embed_text('pie chart showing market share')
# Use with memory search
results = mem.find('market share', k=10)
Image Gallery Search
Build searchable image galleries with natural language:
# Embed and store images
for image_path in Path('photos/').glob('*.jpg'):
embedding = clip.embed_image(str(image_path))
mem.put(
title=image_path.stem,
label='photo',
file=str(image_path),
metadata={'clip_embedding': embedding},
)
# Search by description
query_embedding = clip.embed_text('beach sunset')
Multimodal RAG
Combine visual and text search for richer retrieval:
# Store documents with visual embeddings
for pdf in pdfs:
# Text for lexical search
frame_id = mem.put(title=pdf.name, file=str(pdf))
# Visual embedding for image search
visual_embedding = clip.embed_image(pdf.thumbnail_path)
# Hybrid search combines both modalities
results = mem.find(query, mode='auto')
Best Practices
- Choose the right provider: Use local CLIP for privacy/offline, OpenAI for quality
- Batch embeddings: Use
embed_images() for multiple images to reduce API calls
- Cache embeddings: Store visual embeddings in metadata for reuse
- Consistent models: Use the same model for indexing and querying
- Dimension matching: Ensure query and document embeddings have same dimensions
Limitations
- Local CLIP (Node.js): Requires a native build with CLIP support; prebuilt npm binaries may be cloud-only
- Image formats: Supports JPEG, PNG, WebP, GIF
- PDF visual search: Requires extracting page images first
- Model size: Local CLIP downloads ~200 MB on first use
Next Steps