Skip to main content
Memvid supports CLIP (Contrastive Language-Image Pre-training) embeddings for visual search. This enables searching documents, PDFs, and images by visual content, including charts, diagrams, photos, and visual elements using natural language queries.

Overview

CLIP models learn to associate images with text descriptions, enabling:
  • Text-to-image search: Find images using natural language (“sustainability charts”, “team photos”)
  • Visual document search: Search PDF pages by their visual content, not just text
  • Cross-modal retrieval: Query with text, retrieve visual content
ProviderModelDimensionsBest For
LocalMobileCLIP-S2512Offline, privacy-first
OpenAItext-embedding-3-small1536General purpose
OpenAItext-embedding-3-large3072Highest quality
Geminiembedding-001768Google ecosystem

Quick Start

Python SDK

from memvid_sdk import create
from memvid_sdk.clip import get_clip_provider

# Initialize CLIP provider
clip = get_clip_provider('openai')  # or 'local', 'gemini'
print(f"Provider: {clip.name} ({clip.dimension} dimensions)")

# Create memory and store a PDF
mem = create('visual_search.mv2')
mem.enable_lex()

frame_id = mem.put(
    title="Annual Report 2024",
    label="report",
    metadata={"year": 2024},
    file="report.pdf",
)

# Generate text embedding for visual search
query_embedding = clip.embed_text("revenue growth charts")
print(f"Query embedding: {len(query_embedding)} dimensions")

# Search (visual search requires vector index)
results = mem.find("revenue", k=10)

Node.js SDK

import { create, getClipProvider } from '@memvid/sdk';

// Initialize CLIP provider
const clip = getClipProvider('openai');  // or 'local', 'gemini'
console.log(`Provider: ${clip.name} (${clip.dimension} dimensions)`);

// Create memory and store a PDF
const mem = await create('visual_search.mv2');
await mem.enableLex();

const frameId = await mem.put({
  title: 'Annual Report 2024',
  label: 'report',
  metadata: { year: 2024 },
  file: 'report.pdf',
});

// Generate text embedding for visual search
const queryEmbedding = await clip.embedText('revenue growth charts');
console.log(`Query embedding: ${queryEmbedding.length} dimensions`);

// Search
const results = await mem.find('revenue', { k: 10 });

Providers

Local CLIP (MobileCLIP-S2)

The default provider uses MobileCLIP-S2, a lightweight CLIP model optimized for mobile and edge devices. Characteristics:
  • Dimensions: 512
  • Size: ~200 MB (downloaded on first use)
  • Inference: CPU-based, no GPU required
  • Privacy: All processing happens locally
  • Offline: Works without internet after initial download
from memvid_sdk.clip import get_clip_provider, LocalClip

# Using factory
clip = get_clip_provider('local')

# Or direct instantiation
clip = LocalClip(model='mobileclip-s2')

# Embed an image
image_embedding = clip.embed_image('photo.jpg')

# Embed text for search
text_embedding = clip.embed_text('sunset over ocean')

# Batch embed multiple images
embeddings = clip.embed_images(['img1.jpg', 'img2.jpg', 'img3.jpg'])
import { getClipProvider, LocalClip } from '@memvid/sdk';

// Using factory
const clip = getClipProvider('local');

// Or direct instantiation
const clip = new LocalClip({ model: 'mobileclip-s2' });

// Embed an image
const imageEmbedding = await clip.embedImage('photo.jpg');

// Embed text for search
const textEmbedding = await clip.embedText('sunset over ocean');

// Batch embed multiple images
const embeddings = await clip.embedImages(['img1.jpg', 'img2.jpg', 'img3.jpg']);
Local CLIP is supported in memvid-core and the Python SDK. In Node.js, LocalClip requires a native build that exports ClipModel (the prebuilt npm binaries may not include it). Cloud providers work out of the box.

OpenAI CLIP

OpenAI’s embedding models provide excellent quality for visual search queries. Setup:
export OPENAI_API_KEY=sk-your-key-here
Usage:
from memvid_sdk.clip import get_clip_provider, OpenAIClip

# Using factory
clip = get_clip_provider('openai')

# Or with specific model
clip = get_clip_provider('openai:text-embedding-3-large')

# Direct instantiation
clip = OpenAIClip(model='text-embedding-3-small')

# Embed text for visual search
embedding = clip.embed_text('executive team photo')
print(f"Dimensions: {len(embedding)}")
import { getClipProvider, OpenAIClip } from '@memvid/sdk';

// Using factory
const clip = getClipProvider('openai');

// Override embedding/vision models via config
const clip2 = getClipProvider('openai', { embeddingModel: 'text-embedding-3-large', visionModel: 'gpt-4o-mini' });

// Direct instantiation
const clip3 = new OpenAIClip({ embeddingModel: 'text-embedding-3-small', visionModel: 'gpt-4o-mini' });

// Embed text for visual search
const embedding = await clip.embedText('executive team photo');
console.log(`Dimensions: ${embedding.length}`);
Model Comparison:
ModelDimensionsQuality
text-embedding-3-small1536Good
text-embedding-3-large3072Best

Gemini CLIP

Google’s Gemini provides multimodal embeddings for visual search. Setup:
export GEMINI_API_KEY=your-key-here
Usage:
from memvid_sdk.clip import get_clip_provider, GeminiClip

# Using factory
clip = get_clip_provider('gemini')

# Or with specific model
clip = get_clip_provider('gemini:embedding-001')

# Direct instantiation
clip = GeminiClip(model='embedding-001')

# Embed text
embedding = clip.embed_text('data visualization dashboard')
import { getClipProvider, GeminiClip } from '@memvid/sdk';

const clip = getClipProvider('gemini');
const embedding = await clip.embedText('data visualization dashboard');

Complete Example

Here’s a full workflow for visual document search:
from pathlib import Path
from memvid_sdk import create
from memvid_sdk.clip import get_clip_provider

# Configuration
PROVIDER = 'openai'  # 'local', 'openai', 'gemini'
PDF_PATH = 'annual_report.pdf'
OUTPUT_PATH = 'visual_search.mv2'

# Initialize
clip = get_clip_provider(PROVIDER)
print(f"CLIP Provider: {clip.name} ({clip.dimension} dims)")

# Create memory
if Path(OUTPUT_PATH).exists():
    Path(OUTPUT_PATH).unlink()

mem = create(OUTPUT_PATH)
mem.enable_lex()

# Ingest PDF
frame_id = mem.put(
    title=Path(PDF_PATH).stem,
    label='report',
    metadata={'source': 'finance', 'year': 2024},
    file=PDF_PATH,
)
print(f"Stored PDF as frame {frame_id}")

# Visual search queries
queries = [
    'revenue growth charts',
    'organizational structure',
    'sustainability initiatives',
    'executive portraits',
]

print("\nGenerating embeddings for visual search:")
for query in queries:
    embedding = clip.embed_text(query)
    print(f"  '{query}' -> {len(embedding)} dims")

# Seal and show stats
mem.seal()
stats = mem.stats()
print(f"\nFinal: {stats.get('frame_count', 0)} frames")

API Reference

ClipProvider Interface

All CLIP providers implement this interface:
MethodDescription
nameProvider identifier (e.g., openai:text-embedding-3-small)
dimensionEmbedding vector dimension
embed_image(path)Generate embedding for a single image
embed_text(text)Generate text embedding for visual search
embed_images(paths)Batch embed multiple images

Factory Function

# Python
from memvid_sdk.clip import get_clip_provider

clip = get_clip_provider(provider)  # 'local', 'openai', 'gemini', 'openai:model-name'
// Node.js
	import { getClipProvider } from '@memvid/sdk';

	const clip = getClipProvider(provider); // 'local' | 'openai' | 'gemini'

Environment Variables

VariableDescription
OPENAI_API_KEYOpenAI API key for OpenAI CLIP
GEMINI_API_KEYGoogle AI API key for Gemini CLIP
MEMVID_MODELS_DIRLocal model cache directory
MEMVID_OFFLINE=1Skip model downloads (local CLIP)
MEMVID_CLIP_MODELOverride default CLIP model

Use Cases

Search PDFs by their visual content (charts, diagrams, tables):
# Find pages with specific visual elements
clip = get_clip_provider('openai')
query = clip.embed_text('pie chart showing market share')

# Use with memory search
results = mem.find('market share', k=10)
Build searchable image galleries with natural language:
# Embed and store images
for image_path in Path('photos/').glob('*.jpg'):
    embedding = clip.embed_image(str(image_path))
    mem.put(
        title=image_path.stem,
        label='photo',
        file=str(image_path),
        metadata={'clip_embedding': embedding},
    )

# Search by description
query_embedding = clip.embed_text('beach sunset')

Multimodal RAG

Combine visual and text search for richer retrieval:
# Store documents with visual embeddings
for pdf in pdfs:
    # Text for lexical search
    frame_id = mem.put(title=pdf.name, file=str(pdf))

    # Visual embedding for image search
    visual_embedding = clip.embed_image(pdf.thumbnail_path)

# Hybrid search combines both modalities
results = mem.find(query, mode='auto')

Best Practices

  1. Choose the right provider: Use local CLIP for privacy/offline, OpenAI for quality
  2. Batch embeddings: Use embed_images() for multiple images to reduce API calls
  3. Cache embeddings: Store visual embeddings in metadata for reuse
  4. Consistent models: Use the same model for indexing and querying
  5. Dimension matching: Ensure query and document embeddings have same dimensions

Limitations

  • Local CLIP (Node.js): Requires a native build with CLIP support; prebuilt npm binaries may be cloud-only
  • Image formats: Supports JPEG, PNG, WebP, GIF
  • PDF visual search: Requires extracting page images first
  • Model size: Local CLIP downloads ~200 MB on first use

Next Steps