Skip to main content
Memvid processes audio and video files using Whisper for transcription, making spoken content searchable. Video files also support key frame extraction and playback from within the CLI.

How It Works

Key features:
  • Automatic transcription - Whisper converts speech to text
  • Timestamp alignment - Text segments linked to audio/video timecodes
  • Key frame extraction - Important frames from video
  • In-CLI playback - Play segments directly from terminal
  • Visual search - CLIP embeddings for video frames

Supported Formats

Audio

FormatExtensionNotes
MP3.mp3Most common, lossy
WAV.wavUncompressed, best quality
FLAC.flacLossless compression
AAC.aac, .m4aApple/iTunes format
OGG.oggOpen format, Vorbis codec
ALAC.m4aApple Lossless

Video

FormatExtensionNotes
MP4.mp4Most common, H.264/H.265
WebM.webmWeb-optimized, VP8/VP9
MOV.movApple QuickTime
AVI.aviLegacy Windows format
MKV.mkvMatroska container
FLV.flvFlash video (legacy)

Ingesting Audio

Basic Usage

# Ingest audio file (auto-transcribes)
memvid put memory.mv2 --input podcast.mp3

# Ingest multiple audio files
memvid put memory.mv2 --input ./recordings/

# With metadata
memvid put memory.mv2 --input interview.wav --metadata '{"speaker": "John Doe", "date": "2024-01-15"}'

Transcription Options

# Specify language (faster, more accurate)
memvid put memory.mv2 --input audio.mp3 --language en

# Force language detection
memvid put memory.mv2 --input audio.mp3 --detect-language

# Use larger model for better accuracy
memvid put memory.mv2 --input audio.mp3 --whisper-model medium

Whisper Models

ModelSizeSpeedAccuracyBest For
tiny39 MBFastestBasicQuick previews
base74 MBFastGoodGeneral use
small244 MBMediumBetterDefault choice
medium769 MBSlowerExcellentImportant content
large1.5 GBSlowestBestCritical accuracy
# Install specific model
memvid models install whisper-medium

# Use installed model
memvid put memory.mv2 --input audio.mp3 --whisper-model medium

Language Support

Whisper supports 99 languages. Specify for better accuracy:
# English
memvid put memory.mv2 --input audio.mp3 --language en

# Spanish
memvid put memory.mv2 --input audio.mp3 --language es

# Mandarin Chinese
memvid put memory.mv2 --input audio.mp3 --language zh

# Auto-detect (slower)
memvid put memory.mv2 --input audio.mp3 --detect-language

Ingesting Video

Basic Usage

# Ingest video (transcribes audio + extracts frames)
memvid put memory.mv2 --input meeting.mp4

# Video only (no audio transcription)
memvid put memory.mv2 --input silent-video.mp4 --no-transcribe

# Audio only (skip frame extraction)
memvid put memory.mv2 --input video.mp4 --audio-only

Frame Extraction

# Extract key frames for visual search
memvid put memory.mv2 --input video.mp4 --extract-frames

# Control frame density
memvid put memory.mv2 --input video.mp4 --frame-interval 30  # Every 30 seconds

# Extract specific number of frames
memvid put memory.mv2 --input video.mp4 --max-frames 50

Visual Embeddings

Enable CLIP embeddings for visual search:
# Enable visual embeddings for frames
memvid put memory.mv2 --input video.mp4 --clip-embeddings

# Search by visual content
memvid find memory.mv2 --query "person at whiteboard" --mode clip

Searching Transcribed Content

# Search transcription text
memvid find memory.mv2 --query "quarterly revenue"

# Search with timestamp context
memvid find memory.mv2 --query "action items" --json

# Results include timecodes:
# {
#   "frame_id": "frame_abc123",
#   "text": "The action items from this meeting are...",
#   "timestamp": "00:15:32",
#   "source": "meeting.mp4"
# }

Ask Questions

# Ask about audio/video content
memvid ask memory.mv2 --question "What were the main decisions from the meeting?"

# Get context with timestamps
memvid ask memory.mv2 --question "What did John say about the budget?" --sources

Visual Search (Video)

# Search by visual description
memvid find memory.mv2 --query "chart showing growth" --mode clip

# Combine text and visual
memvid find memory.mv2 --query "presentation slide" --mode auto

Playback

Playing Audio

# Play entire audio
memvid view memory.mv2 --frame-id frame_abc --play

# Play specific segment
memvid view memory.mv2 --frame-id frame_abc --play --start-seconds 30 --end-seconds 60

# Play from timestamp
memvid view memory.mv2 --frame-id frame_abc --play --start-seconds 120

Playing Video

# Play video
memvid view memory.mv2 --frame-id frame_xyz --play

# Play segment
memvid view memory.mv2 --frame-id frame_xyz --play --start-seconds 0 --end-seconds 30

# Preview mode (thumbnail)
memvid view memory.mv2 --frame-id frame_xyz --preview

Playback Controls

OptionDescription
--playStart playback
--start-seconds NStart at N seconds
--end-seconds NStop at N seconds
--previewShow thumbnail/preview
--preview-start HH:MM:SSPreview window start
--preview-end HH:MM:SSPreview window end

Use Cases

Meeting Recordings

# Create meeting memory
memvid create meetings.mv2

# Ingest meeting recordings
memvid put meetings.mv2 --input ./recordings/ --language en

# Find specific discussions
memvid find meetings.mv2 --query "budget approval"

# Ask about decisions
memvid ask meetings.mv2 --question "What was decided about the Q4 budget?"

# Play the relevant segment
memvid view meetings.mv2 --frame-id frame_abc --play --start-seconds 1234

Podcast Library

# Create podcast memory
memvid create podcasts.mv2

# Ingest episodes with metadata
memvid put podcasts.mv2 --input episode-42.mp3 \
  --metadata '{"show": "Tech Talk", "episode": 42, "guests": ["Alice", "Bob"]}'

# Search across all episodes
memvid find podcasts.mv2 --query "machine learning trends"

# Timeline of episodes
memvid timeline podcasts.mv2 --reverse

Video Tutorials

# Create tutorial library
memvid create tutorials.mv2

# Ingest with frame extraction
memvid put tutorials.mv2 --input ./tutorials/ --extract-frames --clip-embeddings

# Find by spoken content
memvid find tutorials.mv2 --query "how to configure webpack"

# Find by visual content
memvid find tutorials.mv2 --query "terminal with npm commands" --mode clip

Lecture Archive

# Create lecture memory
memvid create lectures.mv2

# Ingest lecture videos
memvid put lectures.mv2 --input ./cs101/ --whisper-model medium

# Search for topics
memvid find lectures.mv2 --query "binary search algorithm"

# Ask study questions
memvid ask lectures.mv2 --question "Explain the time complexity of quicksort"

Voicemail/Call Logs

# Create call memory
memvid create calls.mv2

# Ingest voicemails
memvid put calls.mv2 --input ./voicemails/ --metadata '{"type": "voicemail"}'

# Find by caller mention
memvid find calls.mv2 --query "callback number"

# Enrich with entity extraction
memvid enrich calls.mv2 --engine groq
memvid state calls.mv2 --entity "Customer Service"

GPU Acceleration

Transcription is CPU-intensive. Enable GPU for faster processing:

macOS (Apple Silicon)

# Install with Metal support
cargo install memvid-cli --features metal

# Or via Homebrew (includes Metal)
brew install memvid/tap/memvid

Linux/Windows (NVIDIA CUDA)

# Install with CUDA support
cargo install memvid-cli --features cuda

# Requires CUDA toolkit and cuDNN

Performance Comparison

Hardware1hr Audio1hr Video
CPU (M1)~15 min~25 min
Metal (M1)~3 min~8 min
CPU (Intel i7)~20 min~35 min
CUDA (RTX 3080)~2 min~5 min

Batch Processing

Parallel Ingestion

# Process multiple files in parallel
memvid put memory.mv2 --input ./media/ --parallel-segments

# Limit concurrent transcriptions (manage memory)
memvid put memory.mv2 --input ./media/ --max-concurrent 2

Large Libraries

# Ingest incrementally
memvid put memory.mv2 --input ./media/2024-01/
memvid put memory.mv2 --input ./media/2024-02/

# Check progress
memvid stats memory.mv2
memvid timeline memory.mv2 --limit 10

Frame Metadata

Each transcribed segment includes metadata:
{
  "frame_id": "frame_abc123",
  "uri": "mv2://media/meeting.mp4",
  "content_type": "audio/transcript",
  "metadata": {
    "source_file": "meeting.mp4",
    "duration_seconds": 3600,
    "segment_start": 932.5,
    "segment_end": 945.2,
    "timestamp": "00:15:32",
    "language": "en",
    "confidence": 0.94,
    "whisper_model": "small"
  }
}
Access metadata:
# View frame with metadata
memvid view memory.mv2 --frame-id frame_abc --json

# Filter by media type
memvid timeline memory.mv2 --filter "content_type:audio/transcript"

Troubleshooting

No Transcription Output

# Check if Whisper model is installed
memvid models list

# Install required model
memvid models install whisper-small

# Try with explicit language
memvid put memory.mv2 --input audio.mp3 --language en

Poor Transcription Quality

# Use larger model
memvid put memory.mv2 --input audio.mp3 --whisper-model medium

# Specify correct language
memvid put memory.mv2 --input audio.mp3 --language es

# Check audio quality - Whisper works best with clear audio

Playback Issues

# Check frame exists
memvid view memory.mv2 --frame-id frame_abc

# Try preview mode first
memvid view memory.mv2 --frame-id frame_abc --preview

# Check file format support
memvid stats memory.mv2 --json | jq '.frames[] | select(.uri | contains("video"))'

Out of Memory

# Reduce concurrent processing
memvid put memory.mv2 --input video.mp4 --max-concurrent 1

# Use smaller model
memvid put memory.mv2 --input video.mp4 --whisper-model base

# Process in segments
memvid put memory.mv2 --input video.mp4 --segment-duration 300

Limitations

LimitationWorkaround
No real-time streamingPre-record content
Large file sizesUse compression before ingestion
Multiple speakersManual speaker tagging via metadata
Background noisePre-process audio for noise reduction
Non-speech audioNot transcribed (music, effects)

SDK Support

Audio/video processing is currently CLI-only. SDK support planned. Workaround:
import subprocess

# Ingest via CLI
subprocess.run([
    'memvid', 'put', 'memory.mv2',
    '--input', 'audio.mp3',
    '--language', 'en'
])

# Search transcriptions via SDK
from memvid import use
mem = use('basic', 'memory.mv2')
results = mem.find("meeting action items")

Next Steps