How Memvid automatically detects and prevents duplicate content
Memvid automatically prevents duplicate content from bloating your memory files using two complementary techniques: content hashing for exact duplicates and SimHash for near-duplicates.
Every frame stores a BLAKE3 content hash. When you add new content:
Hash is computed for the new content
Hash is checked against existing frames
If match found, the existing frame ID is returned
No duplicate frame is created
# First put - creates new framememvid put memory.mv2 --input document.pdf# Output: Created frame_abc123# Second put of same file - returns existing framememvid put memory.mv2 --input document.pdf# Output: Duplicate detected, returning existing frame_abc123
# Python SDKframe_id_1 = mem.put("The quick brown fox")frame_id_2 = mem.put("The quick brown fox") # Same contentassert frame_id_1 == frame_id_2 # True - no duplicate created
// Node.js SDKconst id1 = await mem.put({ content: "The quick brown fox" })const id2 = await mem.put({ content: "The quick brown fox" })console.log(id1 === id2) // true
SimHash is a locality-sensitive hashing algorithm that detects near-duplicate content - documents that are almost identical but have minor differences like:
# Original documentecho "The quick brown fox jumps over the lazy dog." | memvid put memory.mv2 --input -# Output: Created frame_001# Minor variation (punctuation + whitespace)echo "The quick brown fox jumps over the lazy dog" | memvid put memory.mv2 --input -# Output: Near-duplicate of frame_001 detected, skipping# Different document (passes threshold)echo "A slow red cat sleeps under the busy cat." | memvid put memory.mv2 --input -# Output: Created frame_002
For large memories (10k+ frames), Memvid uses sketch tracks to accelerate duplicate detection. Sketches are compact fingerprints that enable sub-millisecond candidate filtering.
# Version in metadata distinguishes duplicatesmemvid put memory.mv2 --input contract.pdf --metadata '{"version": "1.0"}'memvid put memory.mv2 --input contract.pdf --metadata '{"version": "1.1"}'
SimHash can have false positives for very short content or content with similar structure:
# Very short content may collideecho "yes" | memvid put memory.mv2 --input -echo "no" | memvid put memory.mv2 --input - # Might be seen as near-duplicate
Solution: Add distinguishing context or use unique URIs.