Audio & Video Processing

Memvid processes audio and video files using Whisper for transcription, making spoken content searchable. Video files also support key frame extraction and playback from within the CLI.

How It Works

Key features:

Automatic transcription - Whisper converts speech to text
Timestamp alignment - Text segments linked to audio/video timecodes
Key frame extraction - Important frames from video
In-CLI playback - Play segments directly from terminal
Visual search - CLIP embeddings for video frames

Supported Formats

Audio

Format	Extension	Notes
MP3	`.mp3`	Most common, lossy
WAV	`.wav`	Uncompressed, best quality
FLAC	`.flac`	Lossless compression
AAC	`.aac`, `.m4a`	Apple/iTunes format
OGG	`.ogg`	Open format, Vorbis codec
ALAC	`.m4a`	Apple Lossless

Video

Format	Extension	Notes
MP4	`.mp4`	Most common, H.264/H.265
WebM	`.webm`	Web-optimized, VP8/VP9
MOV	`.mov`	Apple QuickTime
AVI	`.avi`	Legacy Windows format
MKV	`.mkv`	Matroska container
FLV	`.flv`	Flash video (legacy)

Ingesting Audio

Basic Usage

# Ingest audio file (auto-transcribes)
memvid put memory.mv2 --input podcast.mp3

# Ingest multiple audio files
memvid put memory.mv2 --input ./recordings/

# With metadata
memvid put memory.mv2 --input interview.wav --metadata '{"speaker": "John Doe", "date": "2024-01-15"}'

Transcription Options

# Specify language (faster, more accurate)
memvid put memory.mv2 --input audio.mp3 --language en

# Force language detection
memvid put memory.mv2 --input audio.mp3 --detect-language

# Use larger model for better accuracy
memvid put memory.mv2 --input audio.mp3 --whisper-model medium

Whisper Models

Model	Size	Speed	Accuracy	Best For
`tiny`	39 MB	Fastest	Basic	Quick previews
`base`	74 MB	Fast	Good	General use
`small`	244 MB	Medium	Better	Default choice
`medium`	769 MB	Slower	Excellent	Important content
`large`	1.5 GB	Slowest	Best	Critical accuracy

# Install specific model
memvid models install whisper-medium

# Use installed model
memvid put memory.mv2 --input audio.mp3 --whisper-model medium

Language Support

Whisper supports 99 languages. Specify for better accuracy:

# English
memvid put memory.mv2 --input audio.mp3 --language en

# Spanish
memvid put memory.mv2 --input audio.mp3 --language es

# Mandarin Chinese
memvid put memory.mv2 --input audio.mp3 --language zh

# Auto-detect (slower)
memvid put memory.mv2 --input audio.mp3 --detect-language

Ingesting Video

Basic Usage

# Ingest video (transcribes audio + extracts frames)
memvid put memory.mv2 --input meeting.mp4

# Video only (no audio transcription)
memvid put memory.mv2 --input silent-video.mp4 --no-transcribe

# Audio only (skip frame extraction)
memvid put memory.mv2 --input video.mp4 --audio-only

Frame Extraction

# Extract key frames for visual search
memvid put memory.mv2 --input video.mp4 --extract-frames

# Control frame density
memvid put memory.mv2 --input video.mp4 --frame-interval 30  # Every 30 seconds

# Extract specific number of frames
memvid put memory.mv2 --input video.mp4 --max-frames 50

Visual Embeddings

Enable CLIP embeddings for visual search:

# Enable visual embeddings for frames
memvid put memory.mv2 --input video.mp4 --clip-embeddings

# Search by visual content
memvid find memory.mv2 --query "person at whiteboard" --mode clip

Searching Transcribed Content

Text Search

# Search transcription text
memvid find memory.mv2 --query "quarterly revenue"

# Search with timestamp context
memvid find memory.mv2 --query "action items" --json

# Results include timecodes:
# {
#   "frame_id": "frame_abc123",
#   "text": "The action items from this meeting are...",
#   "timestamp": "00:15:32",
#   "source": "meeting.mp4"
# }

Ask Questions

# Ask about audio/video content
memvid ask memory.mv2 --question "What were the main decisions from the meeting?"

# Get context with timestamps
memvid ask memory.mv2 --question "What did John say about the budget?" --sources

Visual Search (Video)

# Search by visual description
memvid find memory.mv2 --query "chart showing growth" --mode clip

# Combine text and visual
memvid find memory.mv2 --query "presentation slide" --mode auto

Playback

Playing Audio

# Play entire audio
memvid view memory.mv2 --frame-id frame_abc --play

# Play specific segment
memvid view memory.mv2 --frame-id frame_abc --play --start-seconds 30 --end-seconds 60

# Play from timestamp
memvid view memory.mv2 --frame-id frame_abc --play --start-seconds 120

Playing Video

# Play video
memvid view memory.mv2 --frame-id frame_xyz --play

# Play segment
memvid view memory.mv2 --frame-id frame_xyz --play --start-seconds 0 --end-seconds 30

# Preview mode (thumbnail)
memvid view memory.mv2 --frame-id frame_xyz --preview

Playback Controls

Option	Description
`--play`	Start playback
`--start-seconds N`	Start at N seconds
`--end-seconds N`	Stop at N seconds
`--preview`	Show thumbnail/preview
`--preview-start HH:MM:SS`	Preview window start
`--preview-end HH:MM:SS`	Preview window end

Use Cases

Meeting Recordings

# Create meeting memory
memvid create meetings.mv2

# Ingest meeting recordings
memvid put meetings.mv2 --input ./recordings/ --language en

# Find specific discussions
memvid find meetings.mv2 --query "budget approval"

# Ask about decisions
memvid ask meetings.mv2 --question "What was decided about the Q4 budget?"

# Play the relevant segment
memvid view meetings.mv2 --frame-id frame_abc --play --start-seconds 1234

Podcast Library

# Create podcast memory
memvid create podcasts.mv2

# Ingest episodes with metadata
memvid put podcasts.mv2 --input episode-42.mp3 \
  --metadata '{"show": "Tech Talk", "episode": 42, "guests": ["Alice", "Bob"]}'

# Search across all episodes
memvid find podcasts.mv2 --query "machine learning trends"

# Timeline of episodes
memvid timeline podcasts.mv2 --reverse

Video Tutorials

# Create tutorial library
memvid create tutorials.mv2

# Ingest with frame extraction
memvid put tutorials.mv2 --input ./tutorials/ --extract-frames --clip-embeddings

# Find by spoken content
memvid find tutorials.mv2 --query "how to configure webpack"

# Find by visual content
memvid find tutorials.mv2 --query "terminal with npm commands" --mode clip

Lecture Archive

# Create lecture memory
memvid create lectures.mv2

# Ingest lecture videos
memvid put lectures.mv2 --input ./cs101/ --whisper-model medium

# Search for topics
memvid find lectures.mv2 --query "binary search algorithm"

# Ask study questions
memvid ask lectures.mv2 --question "Explain the time complexity of quicksort"

Voicemail/Call Logs

# Create call memory
memvid create calls.mv2

# Ingest voicemails
memvid put calls.mv2 --input ./voicemails/ --metadata '{"type": "voicemail"}'

# Find by caller mention
memvid find calls.mv2 --query "callback number"

# Enrich with entity extraction
memvid enrich calls.mv2 --engine groq
memvid state calls.mv2 --entity "Customer Service"

GPU Acceleration

Transcription is CPU-intensive. Enable GPU for faster processing:

macOS (Apple Silicon)

# Install with Metal support
cargo install memvid-cli --features metal

# Or via Homebrew (includes Metal)
brew install memvid/tap/memvid

Linux/Windows (NVIDIA CUDA)

# Install with CUDA support
cargo install memvid-cli --features cuda

# Requires CUDA toolkit and cuDNN

Performance Comparison

Hardware	1hr Audio	1hr Video
CPU (M1)	~15 min	~25 min
Metal (M1)	~3 min	~8 min
CPU (Intel i7)	~20 min	~35 min
CUDA (RTX 3080)	~2 min	~5 min

Batch Processing

Parallel Ingestion

# Process multiple files in parallel
memvid put memory.mv2 --input ./media/ --parallel-segments

# Limit concurrent transcriptions (manage memory)
memvid put memory.mv2 --input ./media/ --max-concurrent 2

Large Libraries

# Ingest incrementally
memvid put memory.mv2 --input ./media/2024-01/
memvid put memory.mv2 --input ./media/2024-02/

# Check progress
memvid stats memory.mv2
memvid timeline memory.mv2 --limit 10

Frame Metadata

Each transcribed segment includes metadata:

{
  "frame_id": "frame_abc123",
  "uri": "mv2://media/meeting.mp4",
  "content_type": "audio/transcript",
  "metadata": {
    "source_file": "meeting.mp4",
    "duration_seconds": 3600,
    "segment_start": 932.5,
    "segment_end": 945.2,
    "timestamp": "00:15:32",
    "language": "en",
    "confidence": 0.94,
    "whisper_model": "small"
  }
}

Access metadata:

# View frame with metadata
memvid view memory.mv2 --frame-id frame_abc --json

# Filter by media type
memvid timeline memory.mv2 --filter "content_type:audio/transcript"

Troubleshooting

No Transcription Output

# Check if Whisper model is installed
memvid models list

# Install required model
memvid models install whisper-small

# Try with explicit language
memvid put memory.mv2 --input audio.mp3 --language en

Poor Transcription Quality

# Use larger model
memvid put memory.mv2 --input audio.mp3 --whisper-model medium

# Specify correct language
memvid put memory.mv2 --input audio.mp3 --language es

# Check audio quality - Whisper works best with clear audio

Playback Issues

# Check frame exists
memvid view memory.mv2 --frame-id frame_abc

# Try preview mode first
memvid view memory.mv2 --frame-id frame_abc --preview

# Check file format support
memvid stats memory.mv2 --json | jq '.frames[] | select(.uri | contains("video"))'

Out of Memory

# Reduce concurrent processing
memvid put memory.mv2 --input video.mp4 --max-concurrent 1

# Use smaller model
memvid put memory.mv2 --input video.mp4 --whisper-model base

# Process in segments
memvid put memory.mv2 --input video.mp4 --segment-duration 300

Limitations

Limitation	Workaround
No real-time streaming	Pre-record content
Large file sizes	Use compression before ingestion
Multiple speakers	Manual speaker tagging via metadata
Background noise	Pre-process audio for noise reduction
Non-speech audio	Not transcribed (music, effects)

SDK Support

Audio/video processing is currently CLI-only. SDK support planned. Workaround:

import subprocess

# Ingest via CLI
subprocess.run([
    'memvid', 'put', 'memory.mv2',
    '--input', 'audio.mp3',
    '--language', 'en'
])

# Search transcriptions via SDK
from memvid import use
mem = use('basic', 'memory.mv2')
results = mem.find("meeting action items")

Get Started

Comparisons

Install

Hosting

Architecture

Search & Retrieval

Enrichment

Media Processing

Embeddings

Security & Limits

Performance

CLI

Python SDK

Node.js SDK

Examples & Packages

Testing

Help

​How It Works

​Supported Formats

​Audio

​Video

​Ingesting Audio

​Basic Usage

​Transcription Options

​Whisper Models

​Language Support

​Ingesting Video

​Basic Usage

​Frame Extraction

​Visual Embeddings

​Searching Transcribed Content

​Text Search

​Ask Questions

​Visual Search (Video)

​Playback

​Playing Audio

​Playing Video

​Playback Controls

​Use Cases

​Meeting Recordings

​Podcast Library

​Video Tutorials

​Lecture Archive

​Voicemail/Call Logs

​GPU Acceleration

​macOS (Apple Silicon)

​Linux/Windows (NVIDIA CUDA)

​Performance Comparison

​Batch Processing

​Parallel Ingestion

​Large Libraries

​Frame Metadata

​Troubleshooting

​No Transcription Output

​Poor Transcription Quality

​Playback Issues

​Out of Memory

​Limitations

​SDK Support

​Next Steps

Visual Embeddings

Memory Cards

How It Works

Supported Formats

Audio

Video

Ingesting Audio

Basic Usage

Transcription Options

Whisper Models

Language Support

Ingesting Video

Basic Usage

Frame Extraction

Visual Embeddings

Searching Transcribed Content

Text Search

Ask Questions

Visual Search (Video)

Playback

Playing Audio

Playing Video

Playback Controls

Use Cases

Meeting Recordings

Podcast Library

Video Tutorials

Lecture Archive

Voicemail/Call Logs

GPU Acceleration

macOS (Apple Silicon)

Linux/Windows (NVIDIA CUDA)

Performance Comparison

Batch Processing

Parallel Ingestion

Large Libraries

Frame Metadata

Troubleshooting

No Transcription Output

Poor Transcription Quality

Playback Issues

Out of Memory

Limitations

SDK Support

Next Steps