Skip to main content
Memvid automatically prevents duplicate content from bloating your memory files using two complementary techniques: content hashing for exact duplicates and SimHash for near-duplicates.

How Deduplication Works

When you add content to a memory, Memvid performs two checks:
CheckAlgorithmCatches
ExactBLAKE3 hashIdentical content
NearSimHash (64-bit LSH)Similar content with minor variations
Both checks happen automatically during put operations with no configuration required.

Exact Deduplication

Every frame stores a BLAKE3 content hash. When you add new content:
  1. Hash is computed for the new content
  2. Hash is checked against existing frames
  3. If match found, the existing frame ID is returned
  4. No duplicate frame is created
# First put - creates new frame
memvid put memory.mv2 --input document.pdf
# Output: Created frame_abc123

# Second put of same file - returns existing frame
memvid put memory.mv2 --input document.pdf
# Output: Duplicate detected, returning existing frame_abc123
# Python SDK
frame_id_1 = mem.put("The quick brown fox")
frame_id_2 = mem.put("The quick brown fox")  # Same content

assert frame_id_1 == frame_id_2  # True - no duplicate created
// Node.js SDK
const id1 = await mem.put({ content: "The quick brown fox" })
const id2 = await mem.put({ content: "The quick brown fox" })

console.log(id1 === id2)  // true

SimHash (Near-Duplicate Detection)

SimHash is a locality-sensitive hashing algorithm that detects near-duplicate content - documents that are almost identical but have minor differences like:
  • Whitespace changes
  • Punctuation variations
  • Minor edits or typos
  • Reformatted text

How SimHash Works

  1. Tokenize: Break content into word n-grams (shingles)
  2. Hash shingles: Each shingle gets a 64-bit hash
  3. Combine: Weighted combination produces final 64-bit fingerprint
  4. Compare: Hamming distance measures similarity
Two documents are considered near-duplicates if their SimHash fingerprints differ by fewer than 32 bits (out of 64).

Hamming Distance Thresholds

DistanceSimilarityClassification
0-10 bits85-100%Near-identical
11-20 bits70-85%Very similar
21-31 bits50-70%Somewhat similar
32+ bits< 50%Different documents

Example: Near-Duplicate Detection

# Original document
echo "The quick brown fox jumps over the lazy dog." | memvid put memory.mv2 --input -
# Output: Created frame_001

# Minor variation (punctuation + whitespace)
echo "The quick brown fox jumps over the lazy dog" | memvid put memory.mv2 --input -
# Output: Near-duplicate of frame_001 detected, skipping

# Different document (passes threshold)
echo "A slow red cat sleeps under the busy cat." | memvid put memory.mv2 --input -
# Output: Created frame_002

Sketch Track (Fast Pre-filtering)

For large memories (10k+ frames), Memvid uses sketch tracks to accelerate duplicate detection. Sketches are compact fingerprints that enable sub-millisecond candidate filtering.

Sketch Variants

VariantSizeSpeedAccuracyBest For
small32 bytesFastestGood< 50k frames
medium64 bytesFastBetter50k-200k frames
large96 bytesModerateBest200k+ frames

Building Sketches

# Build sketch index (recommended for large memories)
memvid sketch build memory.mv2 --variant medium

# Check sketch stats
memvid sketch info memory.mv2
Output:
Sketch Track Info
  Variant: medium (64 bytes)
  Frames indexed: 45,230
  Index size: 2.9 MB
  Avg lookup time: 0.3ms
Without sketches:
  1. Compare query against all 45,230 frames
  2. Full SimHash comparison for each
  3. ~450ms total
With sketches:
  1. Compare query sketch against sketch index
  2. Get ~100 candidates in 0.3ms
  3. Full comparison only on candidates
  4. ~5ms total (90x faster)

Deduplication Statistics

Check deduplication stats for your memory:
memvid stats memory.mv2 --json
{
  "frame_count": 1250,
  "unique_content_hashes": 1248,
  "duplicate_frames_prevented": 127,
  "has_sketch_track": true,
  "sketch_variant": "medium"
}

When Duplicates Are Allowed

Some use cases require keeping duplicates:

Audit Trails

When you need to track every submission regardless of content:
# Add timestamp to make each entry unique
memvid put memory.mv2 --input report.pdf --timestamp "$(date -u +%s)"
# Python - unique URI bypasses dedup
import time
mem.put(
    text="Daily report content",
    uri=f"report-{int(time.time())}"
)

Versioning

Track document versions explicitly:
# Version in metadata distinguishes duplicates
memvid put memory.mv2 --input contract.pdf --metadata '{"version": "1.0"}'
memvid put memory.mv2 --input contract.pdf --metadata '{"version": "1.1"}'

Disabling Deduplication

For specific use cases where you want all content stored:
# Python SDK - force creation
frame_id = mem.put(
    text="Content that might be duplicate",
    skip_dedup=True  # Force new frame creation
)
Disabling deduplication can significantly increase storage usage. Only disable when you have a specific need to store duplicate content.

Deduplication Across Memories

Deduplication only works within a single .mv2 file. The same content in different memory files will be stored separately.
# These are independent - both will store the content
memvid put work.mv2 --input document.pdf
memvid put personal.mv2 --input document.pdf

Performance Impact

OperationWith DedupWithout Dedup
Single put+2msBaseline
Batch put (1000)+50ms totalBaseline
Storage (duplicates)0 bytesFull size
The overhead is minimal and the storage savings are typically significant - especially for:
  • Chat logs with repeated messages
  • Documentation with boilerplate sections
  • Logs with repeated patterns
  • Meeting notes with agenda templates

Best Practices

For Most Use Cases

Let deduplication work automatically:
# Just add content - dedup handles the rest
memvid put memory.mv2 --input ./documents/

For Large Collections

Build sketch indices for faster dedup checking:
# After initial bulk import
memvid sketch build memory.mv2 --variant medium

# Future puts will be faster
memvid put memory.mv2 --input ./new-documents/

For Audit Requirements

Use unique identifiers when duplicates matter:
# Each entry gets unique URI
mem.put(
    text=log_entry,
    uri=f"log/{timestamp}/{uuid4()}"
)

Troubleshooting

”Why isn’t my duplicate being detected?”

  1. Content differs slightly: Check for hidden whitespace, encoding differences
  2. Different metadata: URI or timestamp makes entries unique
  3. Sketch not built: For large memories, build sketch index
# Check if content hashes match
memvid view memory.mv2 --frame-id frame_001 --json | jq '.content_hash'
memvid view memory.mv2 --frame-id frame_002 --json | jq '.content_hash'

”Why was my unique content marked as duplicate?”

SimHash can have false positives for very short content or content with similar structure:
# Very short content may collide
echo "yes" | memvid put memory.mv2 --input -
echo "no" | memvid put memory.mv2 --input -  # Might be seen as near-duplicate
Solution: Add distinguishing context or use unique URIs.

Next Steps