Memvid automatically prevents duplicate content from bloating your memory files using two complementary techniques: content hashing for exact duplicates and SimHash for near-duplicates.Documentation Index
Fetch the complete documentation index at: https://docs.memvid.com/llms.txt
Use this file to discover all available pages before exploring further.
How Deduplication Works
When you add content to a memory, Memvid performs two checks:| Check | Algorithm | Catches |
|---|---|---|
| Exact | BLAKE3 hash | Identical content |
| Near | SimHash (64-bit LSH) | Similar content with minor variations |
put operations with no configuration required.
Exact Deduplication
Every frame stores a BLAKE3 content hash. When you add new content:- Hash is computed for the new content
- Hash is checked against existing frames
- If match found, the existing frame ID is returned
- No duplicate frame is created
SimHash (Near-Duplicate Detection)
SimHash is a locality-sensitive hashing algorithm that detects near-duplicate content - documents that are almost identical but have minor differences like:- Whitespace changes
- Punctuation variations
- Minor edits or typos
- Reformatted text
How SimHash Works
- Tokenize: Break content into word n-grams (shingles)
- Hash shingles: Each shingle gets a 64-bit hash
- Combine: Weighted combination produces final 64-bit fingerprint
- Compare: Hamming distance measures similarity
Hamming Distance Thresholds
| Distance | Similarity | Classification |
|---|---|---|
| 0-10 bits | 85-100% | Near-identical |
| 11-20 bits | 70-85% | Very similar |
| 21-31 bits | 50-70% | Somewhat similar |
| 32+ bits | < 50% | Different documents |
Example: Near-Duplicate Detection
Sketch Track (Fast Pre-filtering)
For large memories (10k+ frames), Memvid uses sketch tracks to accelerate duplicate detection. Sketches are compact fingerprints that enable sub-millisecond candidate filtering.Sketch Variants
| Variant | Size | Speed | Accuracy | Best For |
|---|---|---|---|---|
small | 32 bytes | Fastest | Good | < 50k frames |
medium | 64 bytes | Fast | Better | 50k-200k frames |
large | 96 bytes | Moderate | Best | 200k+ frames |
Building Sketches
How Sketches Speed Up Search
Without sketches:- Compare query against all 45,230 frames
- Full SimHash comparison for each
- ~450ms total
- Compare query sketch against sketch index
- Get ~100 candidates in 0.3ms
- Full comparison only on candidates
- ~5ms total (90x faster)
Deduplication Statistics
Check deduplication stats for your memory:When Duplicates Are Allowed
Some use cases require keeping duplicates:Audit Trails
When you need to track every submission regardless of content:Versioning
Track document versions explicitly:Disabling Deduplication
For specific use cases where you want all content stored:Deduplication Across Memories
Deduplication only works within a single.mv2 file. The same content in different memory files will be stored separately.
Performance Impact
| Operation | With Dedup | Without Dedup |
|---|---|---|
| Single put | +2ms | Baseline |
| Batch put (1000) | +50ms total | Baseline |
| Storage (duplicates) | 0 bytes | Full size |
- Chat logs with repeated messages
- Documentation with boilerplate sections
- Logs with repeated patterns
- Meeting notes with agenda templates
Best Practices
For Most Use Cases
Let deduplication work automatically:For Large Collections
Build sketch indices for faster dedup checking:For Audit Requirements
Use unique identifiers when duplicates matter:Troubleshooting
”Why isn’t my duplicate being detected?”
- Content differs slightly: Check for hidden whitespace, encoding differences
- Different metadata: URI or timestamp makes entries unique
- Sketch not built: For large memories, build sketch index
”Why was my unique content marked as duplicate?”
SimHash can have false positives for very short content or content with similar structure:Next Steps
Adaptive Retrieval
Automatically determine optimal result counts
Indices & Tracks
Understand how content is indexed