Deduplication & SimHash

Memvid automatically prevents duplicate content from bloating your memory files using two complementary techniques: content hashing for exact duplicates and SimHash for near-duplicates.

How Deduplication Works

When you add content to a memory, Memvid performs two checks:

Check	Algorithm	Catches
Exact	BLAKE3 hash	Identical content
Near	SimHash (64-bit LSH)	Similar content with minor variations

Both checks happen automatically during put operations with no configuration required.

Exact Deduplication

Every frame stores a BLAKE3 content hash. When you add new content:

Hash is computed for the new content
Hash is checked against existing frames
If match found, the existing frame ID is returned
No duplicate frame is created

# First put - creates new frame
memvid put memory.mv2 --input document.pdf
# Output: Created frame_abc123

# Second put of same file - returns existing frame
memvid put memory.mv2 --input document.pdf
# Output: Duplicate detected, returning existing frame_abc123

# Python SDK
frame_id_1 = mem.put("The quick brown fox")
frame_id_2 = mem.put("The quick brown fox")  # Same content

assert frame_id_1 == frame_id_2  # True - no duplicate created

// Node.js SDK
const id1 = await mem.put({ content: "The quick brown fox" })
const id2 = await mem.put({ content: "The quick brown fox" })

console.log(id1 === id2)  // true

SimHash (Near-Duplicate Detection)

SimHash is a locality-sensitive hashing algorithm that detects near-duplicate content - documents that are almost identical but have minor differences like:

Whitespace changes
Punctuation variations
Minor edits or typos
Reformatted text

How SimHash Works

Tokenize: Break content into word n-grams (shingles)
Hash shingles: Each shingle gets a 64-bit hash
Combine: Weighted combination produces final 64-bit fingerprint
Compare: Hamming distance measures similarity

Two documents are considered near-duplicates if their SimHash fingerprints differ by fewer than 32 bits (out of 64).

Hamming Distance Thresholds

Distance	Similarity	Classification
0-10 bits	85-100%	Near-identical
11-20 bits	70-85%	Very similar
21-31 bits	50-70%	Somewhat similar
32+ bits	< 50%	Different documents

Example: Near-Duplicate Detection

# Original document
echo "The quick brown fox jumps over the lazy dog." | memvid put memory.mv2 --input -
# Output: Created frame_001

# Minor variation (punctuation + whitespace)
echo "The quick brown fox jumps over the lazy dog" | memvid put memory.mv2 --input -
# Output: Near-duplicate of frame_001 detected, skipping

# Different document (passes threshold)
echo "A slow red cat sleeps under the busy cat." | memvid put memory.mv2 --input -
# Output: Created frame_002

Sketch Track (Fast Pre-filtering)

For large memories (10k+ frames), Memvid uses sketch tracks to accelerate duplicate detection. Sketches are compact fingerprints that enable sub-millisecond candidate filtering.

Sketch Variants

Variant	Size	Speed	Accuracy	Best For
`small`	32 bytes	Fastest	Good	< 50k frames
`medium`	64 bytes	Fast	Better	50k-200k frames
`large`	96 bytes	Moderate	Best	200k+ frames

Building Sketches

# Build sketch index (recommended for large memories)
memvid sketch build memory.mv2 --variant medium

# Check sketch stats
memvid sketch info memory.mv2

Output:

Sketch Track Info
  Variant: medium (64 bytes)
  Frames indexed: 45,230
  Index size: 2.9 MB
  Avg lookup time: 0.3ms

How Sketches Speed Up Search

Without sketches:

Compare query against all 45,230 frames
Full SimHash comparison for each
~450ms total

With sketches:

Compare query sketch against sketch index
Get ~100 candidates in 0.3ms
Full comparison only on candidates
~5ms total (90x faster)

Deduplication Statistics

Check deduplication stats for your memory:

memvid stats memory.mv2 --json

{
  "frame_count": 1250,
  "unique_content_hashes": 1248,
  "duplicate_frames_prevented": 127,
  "has_sketch_track": true,
  "sketch_variant": "medium"
}

When Duplicates Are Allowed

Some use cases require keeping duplicates:

Audit Trails

When you need to track every submission regardless of content:

# Add timestamp to make each entry unique
memvid put memory.mv2 --input report.pdf --timestamp "$(date -u +%s)"

# Python - unique URI bypasses dedup
import time
mem.put(
    text="Daily report content",
    uri=f"report-{int(time.time())}"
)

Versioning

Track document versions explicitly:

# Version in metadata distinguishes duplicates
memvid put memory.mv2 --input contract.pdf --metadata '{"version": "1.0"}'
memvid put memory.mv2 --input contract.pdf --metadata '{"version": "1.1"}'

Disabling Deduplication

For specific use cases where you want all content stored:

# Python SDK - force creation
frame_id = mem.put(
    text="Content that might be duplicate",
    skip_dedup=True  # Force new frame creation
)

Disabling deduplication can significantly increase storage usage. Only disable when you have a specific need to store duplicate content.

Deduplication Across Memories

Deduplication only works within a single .mv2 file. The same content in different memory files will be stored separately.

# These are independent - both will store the content
memvid put work.mv2 --input document.pdf
memvid put personal.mv2 --input document.pdf

Performance Impact

Operation	With Dedup	Without Dedup
Single put	+2ms	Baseline
Batch put (1000)	+50ms total	Baseline
Storage (duplicates)	0 bytes	Full size

The overhead is minimal and the storage savings are typically significant - especially for:

Chat logs with repeated messages
Documentation with boilerplate sections
Logs with repeated patterns
Meeting notes with agenda templates

Best Practices

For Most Use Cases

Let deduplication work automatically:

# Just add content - dedup handles the rest
memvid put memory.mv2 --input ./documents/

For Large Collections

Build sketch indices for faster dedup checking:

# After initial bulk import
memvid sketch build memory.mv2 --variant medium

# Future puts will be faster
memvid put memory.mv2 --input ./new-documents/

For Audit Requirements

Use unique identifiers when duplicates matter:

# Each entry gets unique URI
mem.put(
    text=log_entry,
    uri=f"log/{timestamp}/{uuid4()}"
)

Troubleshooting

”Why isn’t my duplicate being detected?”

Content differs slightly: Check for hidden whitespace, encoding differences
Different metadata: URI or timestamp makes entries unique
Sketch not built: For large memories, build sketch index

# Check if content hashes match
memvid view memory.mv2 --frame-id frame_001 --json | jq '.content_hash'
memvid view memory.mv2 --frame-id frame_002 --json | jq '.content_hash'

”Why was my unique content marked as duplicate?”

SimHash can have false positives for very short content or content with similar structure:

# Very short content may collide
echo "yes" | memvid put memory.mv2 --input -
echo "no" | memvid put memory.mv2 --input -  # Might be seen as near-duplicate

Solution: Add distinguishing context or use unique URIs.

Next Steps

Adaptive Retrieval

Automatically determine optimal result counts

Indices & Tracks

Understand how content is indexed

Get Started

Comparisons

Install

Hosting

Architecture

Search & Retrieval

Enrichment

Media Processing

Embeddings

Security & Limits

Performance

CLI

Python SDK

Node.js SDK

Examples & Packages

Testing

Help

​How Deduplication Works

​Exact Deduplication

​SimHash (Near-Duplicate Detection)

​How SimHash Works

​Hamming Distance Thresholds

​Example: Near-Duplicate Detection

​Sketch Track (Fast Pre-filtering)

​Sketch Variants

​Building Sketches

​How Sketches Speed Up Search

​Deduplication Statistics

​When Duplicates Are Allowed

​Audit Trails

​Versioning

​Disabling Deduplication

​Deduplication Across Memories

​Performance Impact

​Best Practices

​For Most Use Cases

​For Large Collections

​For Audit Requirements

​Troubleshooting

​”Why isn’t my duplicate being detected?”

​”Why was my unique content marked as duplicate?”

​Next Steps

Adaptive Retrieval

Indices & Tracks

How Deduplication Works

Exact Deduplication

SimHash (Near-Duplicate Detection)

How SimHash Works

Hamming Distance Thresholds

Example: Near-Duplicate Detection

Sketch Track (Fast Pre-filtering)

Sketch Variants

Building Sketches

How Sketches Speed Up Search

Deduplication Statistics

When Duplicates Are Allowed

Audit Trails

Versioning

Disabling Deduplication

Deduplication Across Memories

Performance Impact

Best Practices

For Most Use Cases

For Large Collections

For Audit Requirements

Troubleshooting

”Why isn’t my duplicate being detected?”

”Why was my unique content marked as duplicate?”

Next Steps