Memvid automatically prevents duplicate content from bloating your memory files using two complementary techniques: content hashing for exact duplicates and SimHash for near-duplicates.
How Deduplication Works
When you add content to a memory, Memvid performs two checks:
| Check | Algorithm | Catches |
|---|
| Exact | BLAKE3 hash | Identical content |
| Near | SimHash (64-bit LSH) | Similar content with minor variations |
Both checks happen automatically during put operations with no configuration required.
Exact Deduplication
Every frame stores a BLAKE3 content hash. When you add new content:
- Hash is computed for the new content
- Hash is checked against existing frames
- If match found, the existing frame ID is returned
- No duplicate frame is created
# First put - creates new frame
memvid put memory.mv2 --input document.pdf
# Output: Created frame_abc123
# Second put of same file - returns existing frame
memvid put memory.mv2 --input document.pdf
# Output: Duplicate detected, returning existing frame_abc123
# Python SDK
frame_id_1 = mem.put("The quick brown fox")
frame_id_2 = mem.put("The quick brown fox") # Same content
assert frame_id_1 == frame_id_2 # True - no duplicate created
// Node.js SDK
const id1 = await mem.put({ content: "The quick brown fox" })
const id2 = await mem.put({ content: "The quick brown fox" })
console.log(id1 === id2) // true
SimHash (Near-Duplicate Detection)
SimHash is a locality-sensitive hashing algorithm that detects near-duplicate content - documents that are almost identical but have minor differences like:
- Whitespace changes
- Punctuation variations
- Minor edits or typos
- Reformatted text
How SimHash Works
- Tokenize: Break content into word n-grams (shingles)
- Hash shingles: Each shingle gets a 64-bit hash
- Combine: Weighted combination produces final 64-bit fingerprint
- Compare: Hamming distance measures similarity
Two documents are considered near-duplicates if their SimHash fingerprints differ by fewer than 32 bits (out of 64).
Hamming Distance Thresholds
| Distance | Similarity | Classification |
|---|
| 0-10 bits | 85-100% | Near-identical |
| 11-20 bits | 70-85% | Very similar |
| 21-31 bits | 50-70% | Somewhat similar |
| 32+ bits | < 50% | Different documents |
Example: Near-Duplicate Detection
# Original document
echo "The quick brown fox jumps over the lazy dog." | memvid put memory.mv2 --input -
# Output: Created frame_001
# Minor variation (punctuation + whitespace)
echo "The quick brown fox jumps over the lazy dog" | memvid put memory.mv2 --input -
# Output: Near-duplicate of frame_001 detected, skipping
# Different document (passes threshold)
echo "A slow red cat sleeps under the busy cat." | memvid put memory.mv2 --input -
# Output: Created frame_002
Sketch Track (Fast Pre-filtering)
For large memories (10k+ frames), Memvid uses sketch tracks to accelerate duplicate detection. Sketches are compact fingerprints that enable sub-millisecond candidate filtering.
Sketch Variants
| Variant | Size | Speed | Accuracy | Best For |
|---|
small | 32 bytes | Fastest | Good | < 50k frames |
medium | 64 bytes | Fast | Better | 50k-200k frames |
large | 96 bytes | Moderate | Best | 200k+ frames |
Building Sketches
# Build sketch index (recommended for large memories)
memvid sketch build memory.mv2 --variant medium
# Check sketch stats
memvid sketch info memory.mv2
Output:
Sketch Track Info
Variant: medium (64 bytes)
Frames indexed: 45,230
Index size: 2.9 MB
Avg lookup time: 0.3ms
How Sketches Speed Up Search
Without sketches:
- Compare query against all 45,230 frames
- Full SimHash comparison for each
- ~450ms total
With sketches:
- Compare query sketch against sketch index
- Get ~100 candidates in 0.3ms
- Full comparison only on candidates
- ~5ms total (90x faster)
Deduplication Statistics
Check deduplication stats for your memory:
memvid stats memory.mv2 --json
{
"frame_count": 1250,
"unique_content_hashes": 1248,
"duplicate_frames_prevented": 127,
"has_sketch_track": true,
"sketch_variant": "medium"
}
When Duplicates Are Allowed
Some use cases require keeping duplicates:
Audit Trails
When you need to track every submission regardless of content:
# Add timestamp to make each entry unique
memvid put memory.mv2 --input report.pdf --timestamp "$(date -u +%s)"
# Python - unique URI bypasses dedup
import time
mem.put(
text="Daily report content",
uri=f"report-{int(time.time())}"
)
Versioning
Track document versions explicitly:
# Version in metadata distinguishes duplicates
memvid put memory.mv2 --input contract.pdf --metadata '{"version": "1.0"}'
memvid put memory.mv2 --input contract.pdf --metadata '{"version": "1.1"}'
Disabling Deduplication
For specific use cases where you want all content stored:
# Python SDK - force creation
frame_id = mem.put(
text="Content that might be duplicate",
skip_dedup=True # Force new frame creation
)
Disabling deduplication can significantly increase storage usage. Only disable when you have a specific need to store duplicate content.
Deduplication Across Memories
Deduplication only works within a single .mv2 file. The same content in different memory files will be stored separately.
# These are independent - both will store the content
memvid put work.mv2 --input document.pdf
memvid put personal.mv2 --input document.pdf
| Operation | With Dedup | Without Dedup |
|---|
| Single put | +2ms | Baseline |
| Batch put (1000) | +50ms total | Baseline |
| Storage (duplicates) | 0 bytes | Full size |
The overhead is minimal and the storage savings are typically significant - especially for:
- Chat logs with repeated messages
- Documentation with boilerplate sections
- Logs with repeated patterns
- Meeting notes with agenda templates
Best Practices
For Most Use Cases
Let deduplication work automatically:
# Just add content - dedup handles the rest
memvid put memory.mv2 --input ./documents/
For Large Collections
Build sketch indices for faster dedup checking:
# After initial bulk import
memvid sketch build memory.mv2 --variant medium
# Future puts will be faster
memvid put memory.mv2 --input ./new-documents/
For Audit Requirements
Use unique identifiers when duplicates matter:
# Each entry gets unique URI
mem.put(
text=log_entry,
uri=f"log/{timestamp}/{uuid4()}"
)
Troubleshooting
”Why isn’t my duplicate being detected?”
- Content differs slightly: Check for hidden whitespace, encoding differences
- Different metadata: URI or timestamp makes entries unique
- Sketch not built: For large memories, build sketch index
# Check if content hashes match
memvid view memory.mv2 --frame-id frame_001 --json | jq '.content_hash'
memvid view memory.mv2 --frame-id frame_002 --json | jq '.content_hash'
”Why was my unique content marked as duplicate?”
SimHash can have false positives for very short content or content with similar structure:
# Very short content may collide
echo "yes" | memvid put memory.mv2 --input -
echo "no" | memvid put memory.mv2 --input - # Might be seen as near-duplicate
Solution: Add distinguishing context or use unique URIs.
Next Steps