Documentation Index Fetch the complete documentation index at: https://docs.memvid.com/llms.txt
Use this file to discover all available pages before exploring further.
Memvid processes audio and video files using Whisper for transcription, making spoken content searchable. Video files also support key frame extraction and playback from within the CLI.
How It Works
Key features:
Automatic transcription - Whisper converts speech to text
Timestamp alignment - Text segments linked to audio/video timecodes
Key frame extraction - Important frames from video
In-CLI playback - Play segments directly from terminal
Visual search - CLIP embeddings for video frames
Audio
Format Extension Notes MP3 .mp3Most common, lossy WAV .wavUncompressed, best quality FLAC .flacLossless compression AAC .aac, .m4aApple/iTunes format OGG .oggOpen format, Vorbis codec ALAC .m4aApple Lossless
Video
Format Extension Notes MP4 .mp4Most common, H.264/H.265 WebM .webmWeb-optimized, VP8/VP9 MOV .movApple QuickTime AVI .aviLegacy Windows format MKV .mkvMatroska container FLV .flvFlash video (legacy)
Ingesting Audio
Basic Usage
# Ingest audio file (auto-transcribes)
memvid put memory.mv2 --input podcast.mp3
# Ingest multiple audio files
memvid put memory.mv2 --input ./recordings/
# With metadata
memvid put memory.mv2 --input interview.wav --metadata '{"speaker": "John Doe", "date": "2024-01-15"}'
Transcription Options
# Specify language (faster, more accurate)
memvid put memory.mv2 --input audio.mp3 --language en
# Force language detection
memvid put memory.mv2 --input audio.mp3 --detect-language
# Use larger model for better accuracy
memvid put memory.mv2 --input audio.mp3 --whisper-model medium
Whisper Models
Model Size Speed Accuracy Best For tiny39 MB Fastest Basic Quick previews base74 MB Fast Good General use small244 MB Medium Better Default choice medium769 MB Slower Excellent Important content large1.5 GB Slowest Best Critical accuracy
# Install specific model
memvid models install whisper-medium
# Use installed model
memvid put memory.mv2 --input audio.mp3 --whisper-model medium
Language Support
Whisper supports 99 languages. Specify for better accuracy:
# English
memvid put memory.mv2 --input audio.mp3 --language en
# Spanish
memvid put memory.mv2 --input audio.mp3 --language es
# Mandarin Chinese
memvid put memory.mv2 --input audio.mp3 --language zh
# Auto-detect (slower)
memvid put memory.mv2 --input audio.mp3 --detect-language
Ingesting Video
Basic Usage
# Ingest video (transcribes audio + extracts frames)
memvid put memory.mv2 --input meeting.mp4
# Video only (no audio transcription)
memvid put memory.mv2 --input silent-video.mp4 --no-transcribe
# Audio only (skip frame extraction)
memvid put memory.mv2 --input video.mp4 --audio-only
# Extract key frames for visual search
memvid put memory.mv2 --input video.mp4 --extract-frames
# Control frame density
memvid put memory.mv2 --input video.mp4 --frame-interval 30 # Every 30 seconds
# Extract specific number of frames
memvid put memory.mv2 --input video.mp4 --max-frames 50
Visual Embeddings
Enable CLIP embeddings for visual search:
# Enable visual embeddings for frames
memvid put memory.mv2 --input video.mp4 --clip-embeddings
# Search by visual content
memvid find memory.mv2 --query "person at whiteboard" --mode clip
Searching Transcribed Content
Text Search
# Search transcription text
memvid find memory.mv2 --query "quarterly revenue"
# Search with timestamp context
memvid find memory.mv2 --query "action items" --json
# Results include timecodes:
# {
# "frame_id": "frame_abc123",
# "text": "The action items from this meeting are...",
# "timestamp": "00:15:32",
# "source": "meeting.mp4"
# }
Ask Questions
# Ask about audio/video content
memvid ask memory.mv2 --question "What were the main decisions from the meeting?"
# Get context with timestamps
memvid ask memory.mv2 --question "What did John say about the budget?" --sources
Visual Search (Video)
# Search by visual description
memvid find memory.mv2 --query "chart showing growth" --mode clip
# Combine text and visual
memvid find memory.mv2 --query "presentation slide" --mode auto
Playback
Playing Audio
# Play entire audio
memvid view memory.mv2 --frame-id frame_abc --play
# Play specific segment
memvid view memory.mv2 --frame-id frame_abc --play --start-seconds 30 --end-seconds 60
# Play from timestamp
memvid view memory.mv2 --frame-id frame_abc --play --start-seconds 120
Playing Video
# Play video
memvid view memory.mv2 --frame-id frame_xyz --play
# Play segment
memvid view memory.mv2 --frame-id frame_xyz --play --start-seconds 0 --end-seconds 30
# Preview mode (thumbnail)
memvid view memory.mv2 --frame-id frame_xyz --preview
Playback Controls
Option Description --playStart playback --start-seconds NStart at N seconds --end-seconds NStop at N seconds --previewShow thumbnail/preview --preview-start HH:MM:SSPreview window start --preview-end HH:MM:SSPreview window end
Use Cases
Meeting Recordings
# Create meeting memory
memvid create meetings.mv2
# Ingest meeting recordings
memvid put meetings.mv2 --input ./recordings/ --language en
# Find specific discussions
memvid find meetings.mv2 --query "budget approval"
# Ask about decisions
memvid ask meetings.mv2 --question "What was decided about the Q4 budget?"
# Play the relevant segment
memvid view meetings.mv2 --frame-id frame_abc --play --start-seconds 1234
Podcast Library
# Create podcast memory
memvid create podcasts.mv2
# Ingest episodes with metadata
memvid put podcasts.mv2 --input episode-42.mp3 \
--metadata '{"show": "Tech Talk", "episode": 42, "guests": ["Alice", "Bob"]}'
# Search across all episodes
memvid find podcasts.mv2 --query "machine learning trends"
# Timeline of episodes
memvid timeline podcasts.mv2 --reverse
Video Tutorials
# Create tutorial library
memvid create tutorials.mv2
# Ingest with frame extraction
memvid put tutorials.mv2 --input ./tutorials/ --extract-frames --clip-embeddings
# Find by spoken content
memvid find tutorials.mv2 --query "how to configure webpack"
# Find by visual content
memvid find tutorials.mv2 --query "terminal with npm commands" --mode clip
Lecture Archive
# Create lecture memory
memvid create lectures.mv2
# Ingest lecture videos
memvid put lectures.mv2 --input ./cs101/ --whisper-model medium
# Search for topics
memvid find lectures.mv2 --query "binary search algorithm"
# Ask study questions
memvid ask lectures.mv2 --question "Explain the time complexity of quicksort"
Voicemail/Call Logs
# Create call memory
memvid create calls.mv2
# Ingest voicemails
memvid put calls.mv2 --input ./voicemails/ --metadata '{"type": "voicemail"}'
# Find by caller mention
memvid find calls.mv2 --query "callback number"
# Enrich with entity extraction
memvid enrich calls.mv2 --engine groq
memvid state calls.mv2 --entity "Customer Service"
GPU Acceleration
Transcription is CPU-intensive. Enable GPU for faster processing:
macOS (Apple Silicon)
# Install with Metal support
cargo install memvid-cli --features metal
# Or via Homebrew (includes Metal)
brew install memvid/tap/memvid
Linux/Windows (NVIDIA CUDA)
# Install with CUDA support
cargo install memvid-cli --features cuda
# Requires CUDA toolkit and cuDNN
Hardware 1hr Audio 1hr Video CPU (M1) ~15 min ~25 min Metal (M1) ~3 min ~8 min CPU (Intel i7) ~20 min ~35 min CUDA (RTX 3080) ~2 min ~5 min
Batch Processing
Parallel Ingestion
# Process multiple files in parallel
memvid put memory.mv2 --input ./media/ --parallel-segments
# Limit concurrent transcriptions (manage memory)
memvid put memory.mv2 --input ./media/ --max-concurrent 2
Large Libraries
# Ingest incrementally
memvid put memory.mv2 --input ./media/2024-01/
memvid put memory.mv2 --input ./media/2024-02/
# Check progress
memvid stats memory.mv2
memvid timeline memory.mv2 --limit 10
Each transcribed segment includes metadata:
{
"frame_id" : "frame_abc123" ,
"uri" : "mv2://media/meeting.mp4" ,
"content_type" : "audio/transcript" ,
"metadata" : {
"source_file" : "meeting.mp4" ,
"duration_seconds" : 3600 ,
"segment_start" : 932.5 ,
"segment_end" : 945.2 ,
"timestamp" : "00:15:32" ,
"language" : "en" ,
"confidence" : 0.94 ,
"whisper_model" : "small"
}
}
Access metadata:
# View frame with metadata
memvid view memory.mv2 --frame-id frame_abc --json
# Filter by media type
memvid timeline memory.mv2 --filter "content_type:audio/transcript"
Troubleshooting
No Transcription Output
# Check if Whisper model is installed
memvid models list
# Install required model
memvid models install whisper-small
# Try with explicit language
memvid put memory.mv2 --input audio.mp3 --language en
Poor Transcription Quality
# Use larger model
memvid put memory.mv2 --input audio.mp3 --whisper-model medium
# Specify correct language
memvid put memory.mv2 --input audio.mp3 --language es
# Check audio quality - Whisper works best with clear audio
Playback Issues
# Check frame exists
memvid view memory.mv2 --frame-id frame_abc
# Try preview mode first
memvid view memory.mv2 --frame-id frame_abc --preview
# Check file format support
memvid stats memory.mv2 --json | jq '.frames[] | select(.uri | contains("video"))'
Out of Memory
# Reduce concurrent processing
memvid put memory.mv2 --input video.mp4 --max-concurrent 1
# Use smaller model
memvid put memory.mv2 --input video.mp4 --whisper-model base
# Process in segments
memvid put memory.mv2 --input video.mp4 --segment-duration 300
Limitations
Limitation Workaround No real-time streaming Pre-record content Large file sizes Use compression before ingestion Multiple speakers Manual speaker tagging via metadata Background noise Pre-process audio for noise reduction Non-speech audio Not transcribed (music, effects)
SDK Support
Audio/video processing is currently CLI-only . SDK support planned.
Workaround:
import subprocess
# Ingest via CLI
subprocess.run([
'memvid' , 'put' , 'memory.mv2' ,
'--input' , 'audio.mp3' ,
'--language' , 'en'
])
# Search transcriptions via SDK
from memvid import use
mem = use( 'basic' , 'memory.mv2' )
results = mem.find( "meeting action items" )
Next Steps
Visual Embeddings CLIP search for images and video frames
Memory Cards Extract entities from transcriptions