Skip to main content
Memvid supports local LLM inference through Ollama, allowing you to run AI-powered Q&A without sending your data to external APIs. This is ideal for:
  • Privacy-sensitive data - Keep everything on your machine
  • Offline usage - No internet connection required after setup
  • Cost savings - No API fees for inference
  • Low latency - No network round-trips

Quick Setup

1. Install Ollama

brew install ollama

2. Start Ollama Server

# Start in foreground (see logs)
ollama serve

# Or run as background service (macOS)
brew services start ollama

3. Pull a Model

# Recommended: Qwen2.5 1.5B (best quality/size ratio)
ollama pull qwen2.5:1.5b

4. Use with Memvid

memvid ask knowledge.mv2 \
  --question "What is the main topic?" \
  --use-model "ollama:qwen2.5:1.5b"

ModelSizeSpeedQualityBest For
qwen2.5:0.5b~400MBFastGoodQuick queries, limited RAM
qwen2.5:1.5b~1GBFastGreatRecommended default
qwen2.5:3b~2GBMediumExcellentComplex questions
phi3:mini~2GBMediumGreatReasoning tasks
gemma2:2b~1.6GBMediumGreatGeneral purpose
llama3.2:1b~1.3GBFastGoodConversational
llama3.2:3b~2GBMediumGreatBalanced performance

Pull Commands

# Small & fast
ollama pull qwen2.5:0.5b

# Recommended (best balance)
ollama pull qwen2.5:1.5b

# Higher quality
ollama pull qwen2.5:3b
ollama pull phi3:mini
ollama pull gemma2:2b

# Meta's Llama
ollama pull llama3.2:1b
ollama pull llama3.2:3b

CLI Usage

Basic Q&A

# Ask with local model
memvid ask knowledge.mv2 \
  --question "What are the key findings?" \
  --use-model "ollama:qwen2.5:1.5b"

# With JSON output
memvid ask knowledge.mv2 \
  --question "Summarize the main points" \
  --use-model "ollama:qwen2.5:1.5b" \
  --json

Advanced Options

# More context for complex questions
memvid ask knowledge.mv2 \
  --question "Explain the architecture in detail" \
  --use-model "ollama:qwen2.5:3b" \
  --top-k 15 \
  --snippet-chars 800

# Filter by scope
memvid ask knowledge.mv2 \
  --question "What API endpoints exist?" \
  --use-model "ollama:qwen2.5:1.5b" \
  --scope "mv2://api/"

# Time-travel query
memvid ask knowledge.mv2 \
  --question "What was the status?" \
  --use-model "ollama:qwen2.5:1.5b" \
  --as-of-frame 100

Python SDK Usage

from memvid_sdk import use

mem = use('basic', 'knowledge.mv2')

# Ask with local Ollama model
response = mem.ask(
    "What are the main conclusions?",
    model="ollama:qwen2.5:1.5b",
    k=10
)

print(response['answer'])

With Different Models

# Quick answer with small model
quick_response = mem.ask(
    "What is this document about?",
    model="ollama:qwen2.5:0.5b"
)

# Detailed analysis with larger model
detailed_response = mem.ask(
    "Provide a comprehensive analysis of the findings",
    model="ollama:qwen2.5:3b",
    k=15
)

Node.js SDK Usage

import { use } from '@memvid/sdk';

const mem = await use('basic', 'knowledge.mv2');

// Ask with local Ollama model
const response = await mem.ask(
  'What are the key takeaways?',
  { model: 'ollama:qwen2.5:1.5b', k: 10 }
);

console.log(response.answer);

Model Selection Guide

By Use Case

Use CaseRecommended ModelWhy
Quick lookupsqwen2.5:0.5bFastest response
General Q&Aqwen2.5:1.5bBest balance
Technical docsqwen2.5:3bBetter reasoning
Code analysisphi3:miniStrong at code
Research papersqwen2.5:3bComplex content

By Hardware

RAM AvailableRecommended Model
4GBqwen2.5:0.5b
8GBqwen2.5:1.5b
16GB+qwen2.5:3b or phi3:mini

Ollama Management

List Downloaded Models

ollama list

Remove a Model

ollama rm qwen2.5:0.5b

Update a Model

ollama pull qwen2.5:1.5b

Check Ollama Status

# Check if server is running
curl http://localhost:11434/api/tags

Run as Background Service

# Start service
brew services start ollama

# Stop service
brew services stop ollama

# Check status
brew services list | grep ollama

Troubleshooting

Ollama Not Running

Error: Failed to contact LLM provider
Solution:
# Start Ollama server
ollama serve

# Or as background service (macOS)
brew services start ollama

Model Not Found

Error: model 'qwen2.5:1.5b' not found
Solution:
# Pull the model first
ollama pull qwen2.5:1.5b

Slow Response Times

Solutions:
  • Use a smaller model: ollama:qwen2.5:0.5b
  • Reduce context: --top-k 5 --snippet-chars 300
  • Close other memory-intensive applications
  • Ensure you have enough RAM for the model

Out of Memory

Solutions:
  • Use a smaller model
  • Close other applications
  • Increase swap space (not recommended for performance)

Comparison: Local vs Cloud Models

AspectLocal (Ollama)Cloud (OpenAI, Claude, Gemini)
PrivacyData stays localData sent to API
CostFree after setupPer-token pricing
SpeedDepends on hardwareUsually faster
QualityGood to greatExcellent
OfflineYesNo
SetupRequires installationJust API key

When to Use Local Models

  • Sensitive/confidential data
  • Offline environments
  • Cost-sensitive applications
  • Privacy requirements

When to Use Cloud Models

  • Best possible answer quality
  • Limited local compute
  • Quick prototyping
  • Complex reasoning tasks

Next Steps