Entity Extraction (Logic Mesh)

Memvid supports Named Entity Recognition (NER) for extracting structured entities from documents. This enables building knowledge graphs, entity-based search, and relationship mapping, turning unstructured text into structured knowledge.

Overview

Entity extraction identifies and classifies named entities in text:

People: Names of individuals (CEO, executives, authors)
Organizations: Companies, institutions, agencies
Locations: Cities, countries, addresses
Dates: Temporal references, deadlines, events
Money: Currency amounts, valuations, prices
Custom types: Domain-specific entities (products, deals, regulations)

Provider	Model	Entity Types	Best For
Local	DistilBERT-NER	PERSON, ORG, LOCATION, MISC	Offline, privacy-first
OpenAI	GPT-4o-mini	Custom	High accuracy, custom entities
OpenAI	GPT-4o	Custom	Best quality
Claude	Claude 3.5 Sonnet	Custom	Nuanced extraction
Gemini	Gemini 2.0 Flash	Custom	Fast, cost-effective

Quick Start

Python SDK

from memvid_sdk import create
from memvid_sdk.entities import get_entity_extractor

# Initialize entity extractor
ner = get_entity_extractor('openai', entity_types=['COMPANY', 'PERSON', 'MONEY', 'DATE'])
print(f"Provider: {ner.name}")
print(f"Entity types: {ner.entity_types}")

# Extract entities from text
text = """
Microsoft CEO Satya Nadella announced a $50 million investment in Seattle.
The deal closes December 2024 with Pinnacle Financial as lead investor.
"""

entities = ner.extract(text, min_confidence=0.5)
for entity in entities:
    print(f"  {entity['name']} ({entity['type']}, {entity['confidence']:.2f})")

# Output:
#   Microsoft (COMPANY, 0.95)
#   Satya Nadella (PERSON, 0.97)
#   $50 million (MONEY, 0.95)
#   Seattle (LOCATION, 0.90)
#   December 2024 (DATE, 0.88)
#   Pinnacle Financial (COMPANY, 0.92)

Node.js SDK

import { create, getEntityExtractor } from '@memvid/sdk';

// Initialize entity extractor
const ner = getEntityExtractor('openai', {
  entityTypes: ['COMPANY', 'PERSON', 'MONEY', 'DATE'],
});
console.log(`Provider: ${ner.name}`);
console.log(`Entity types: ${ner.entityTypes}`);

// Extract entities from text
const text = `
Microsoft CEO Satya Nadella announced a $50 million investment in Seattle.
The deal closes December 2024 with Pinnacle Financial as lead investor.
`;

const entities = await ner.extract(text, 0.5);
for (const entity of entities) {
  console.log(`  ${entity.name} (${entity.type}, ${entity.confidence.toFixed(2)})`);
}

Providers

Local NER (DistilBERT)

The default provider uses DistilBERT-NER, a lightweight model for offline entity extraction. Characteristics:

Model: DistilBERT fine-tuned on CoNLL-03
Size: ~261 MB (downloaded on first use)
Entity types: PERSON, ORG, LOCATION, MISC (fixed)
Inference: CPU-based, no GPU required
Privacy: All processing happens locally

from memvid_sdk.entities import get_entity_extractor, LocalNER

# Using factory
ner = get_entity_extractor('local')

# Or direct instantiation
ner = LocalNER(model='distilbert-ner')

# Extract entities
entities = ner.extract("Apple CEO Tim Cook visited Paris headquarters.")
# [
#   {'name': 'Apple', 'type': 'ORG', 'confidence': 0.98},
#   {'name': 'Tim Cook', 'type': 'PERSON', 'confidence': 0.97},
#   {'name': 'Paris', 'type': 'LOCATION', 'confidence': 0.95},
# ]

import { getEntityExtractor, LocalNER } from '@memvid/sdk';

const ner = getEntityExtractor('local');
const entities = await ner.extract('Apple CEO Tim Cook visited Paris headquarters.');

Local NER uses fixed entity types (PERSON, ORG, LOCATION, MISC). For custom entity types, use cloud providers. In Node.js, LocalNER requires a native build that exports NerModel (the prebuilt npm binaries may not include it).

OpenAI Entities

OpenAI’s models provide high-accuracy extraction with custom entity types. Setup:

export OPENAI_API_KEY=sk-your-key-here

Usage:

from memvid_sdk.entities import get_entity_extractor, OpenAIEntities

# Using factory with custom entity types
ner = get_entity_extractor('openai', entity_types=[
    'COMPANY',
    'PERSON',
    'LOCATION',
    'MONEY',
    'DATE',
    'PRODUCT',
    'DEAL_TYPE',
])

# Or with specific model
ner = get_entity_extractor('openai:gpt-4o-mini', entity_types=['COMPANY', 'PERSON'])

# Direct instantiation
ner = OpenAIEntities(
    model='gpt-4o-mini',
    entity_types=['COMPANY', 'EXECUTIVE', 'PRODUCT'],
)

# Extract entities
entities = ner.extract(text, min_confidence=0.5)

import { getEntityExtractor, OpenAIEntities } from '@memvid/sdk';

const ner = getEntityExtractor('openai', {
  entityTypes: ['COMPANY', 'PERSON', 'LOCATION', 'MONEY', 'DATE'],
});

// Or with specific model
const ner = getEntityExtractor('openai:gpt-4o-mini', {
  entityTypes: ['COMPANY', 'PERSON'],
});

const entities = await ner.extract(text, 0.5);

Model Comparison:

Model	Speed	Quality
`gpt-4o-mini`	Fast	Good
`gpt-4o`	Medium	Best
`gpt-4-turbo`	Medium	Excellent

Claude Entities

Anthropic’s Claude excels at nuanced entity extraction with context understanding. Setup:

export ANTHROPIC_API_KEY=your-key-here

Usage:

from memvid_sdk.entities import get_entity_extractor, ClaudeEntities

# Using factory
ner = get_entity_extractor('claude', entity_types=['COMPANY', 'PERSON', 'REGULATION'])

# With specific model
ner = get_entity_extractor('claude:claude-3-5-sonnet-20241022', entity_types=['COMPANY'])

# Direct instantiation
ner = ClaudeEntities(
    model='claude-3-5-sonnet-20241022',
    entity_types=['COMPANY', 'EXECUTIVE', 'DEAL'],
)

entities = ner.extract(text, min_confidence=0.6)

import { getEntityExtractor, ClaudeEntities } from '@memvid/sdk';

const ner = getEntityExtractor('claude', {
  entityTypes: ['COMPANY', 'PERSON', 'REGULATION'],
});

const entities = await ner.extract(text, 0.6);

Gemini Entities

Google’s Gemini provides fast, cost-effective entity extraction. Setup:

export GEMINI_API_KEY=your-key-here

Usage:

from memvid_sdk.entities import get_entity_extractor, GeminiEntities

# Using factory
ner = get_entity_extractor('gemini', entity_types=['COMPANY', 'PERSON'])

# With specific model
ner = get_entity_extractor('gemini:gemini-2.0-flash', entity_types=['COMPANY'])

entities = ner.extract(text, min_confidence=0.5)

import { getEntityExtractor, GeminiEntities } from '@memvid/sdk';

const ner = getEntityExtractor('gemini', {
  entityTypes: ['COMPANY', 'PERSON'],
});

const entities = await ner.extract(text, 0.5);

Complete Example

Here’s a full workflow for document entity extraction:

from pathlib import Path
from memvid_sdk import create
from memvid_sdk.entities import get_entity_extractor

# Configuration
PROVIDER = 'openai'
ENTITY_TYPES = ['COMPANY', 'PERSON', 'LOCATION', 'MONEY', 'DATE', 'DEAL_TYPE']
DATASET_DIR = Path('documents/')
OUTPUT_PATH = 'knowledge_base.mv2'

# Initialize
ner = get_entity_extractor(PROVIDER, entity_types=ENTITY_TYPES)
print(f"Entity Extractor: {ner.name}")
print(f"Entity Types: {', '.join(ner.entity_types)}")

# Create memory
if Path(OUTPUT_PATH).exists():
    Path(OUTPUT_PATH).unlink()

mem = create(OUTPUT_PATH)
mem.enable_lex()

# Process documents
all_entities = []
pdf_files = list(DATASET_DIR.glob('*.pdf'))

for i, pdf_path in enumerate(pdf_files):
    print(f"\n[{i+1}/{len(pdf_files)}] {pdf_path.name}")

    # Store document
    frame_id = mem.put(
        title=pdf_path.stem.replace('_', ' ').title(),
        label='document',
        metadata={},
        file=str(pdf_path),
    )
    print(f"    Stored as frame {frame_id}")

    # Extract entities (from document text or summary)
    document_text = f"Document: {pdf_path.stem}"  # Replace with actual text extraction
    entities = ner.extract(document_text, min_confidence=0.5)

    print(f"    Found {len(entities)} entities:")
    for e in entities[:4]:
        print(f"      - {e['name']} ({e['type']}, {e['confidence']:.2f})")

    all_entities.extend(entities)

# Entity statistics
print("\n--- Entity Summary ---")
counts = {}
for e in all_entities:
    t = e.get('type', 'UNKNOWN')
    counts[t] = counts.get(t, 0) + 1

for entity_type, count in sorted(counts.items(), key=lambda x: -x[1]):
    print(f"  {entity_type}: {count}")

# Seal
mem.seal()
stats = mem.stats()
print(f"\nFinal: {stats.get('frame_count', 0)} frames, {len(all_entities)} entities")

Custom Entity Types

Cloud providers support custom entity types tailored to your domain:

Finance Domain

ner = get_entity_extractor('openai', entity_types=[
    'COMPANY',
    'INVESTOR',
    'FUND',
    'MONEY',
    'DEAL_TYPE',      # IPO, M&A, Series A
    'VALUATION',
    'EXECUTIVE',
    'DATE',
])

Legal Domain

ner = get_entity_extractor('claude', entity_types=[
    'PARTY',
    'COURT',
    'JUDGE',
    'CASE_NUMBER',
    'STATUTE',
    'DATE',
    'JURISDICTION',
])

Healthcare Domain

ner = get_entity_extractor('openai:gpt-4o', entity_types=[
    'PATIENT',
    'PROVIDER',
    'MEDICATION',
    'DIAGNOSIS',
    'PROCEDURE',
    'DATE',
    'FACILITY',
])

API Reference

EntityExtractor Interface

All entity extractors implement this interface:

Method	Description
`name`	Provider identifier (e.g., `openai:gpt-4o-mini`)
`entity_types`	List of supported entity types
`extract(text, min_confidence)`	Extract entities from text
`extract_batch(texts, min_confidence)`	Batch extract from multiple texts

Entity Object

Each extracted entity contains:

Field	Type	Description
`name`	string	Entity text as it appears
`type`	string	Entity classification
`confidence`	float	Confidence score (0.0-1.0)

Factory Function

# Python
from memvid_sdk.entities import get_entity_extractor

ner = get_entity_extractor(
    provider,           # 'local', 'openai', 'claude', 'gemini', 'openai:model-name'
    entity_types=None,  # Custom entity types (cloud providers only)
    api_key=None,       # Override env var
)

// Node.js
	import { getEntityExtractor } from '@memvid/sdk';

const ner = getEntityExtractor(provider, {
  entityTypes: ['COMPANY', 'PERSON'],  // Custom entity types
  apiKey: undefined,                    // Override env var
});

Environment Variables

Variable	Description
`OPENAI_API_KEY`	OpenAI API key
`ANTHROPIC_API_KEY`	Anthropic API key for Claude
`GEMINI_API_KEY`	Google AI API key for Gemini
`MEMVID_MODELS_DIR`	Local model cache directory
`MEMVID_OFFLINE=1`	Skip model downloads (local NER)

Use Cases

Document Intelligence

Extract structured data from unstructured documents:

# Process legal contracts
ner = get_entity_extractor('claude', entity_types=[
    'PARTY', 'DATE', 'MONEY', 'TERM', 'JURISDICTION'
])

contract_text = "Agreement between Acme Corp and Beta Inc dated January 15, 2024..."
entities = ner.extract(contract_text)

# Build structured contract summary
parties = [e['name'] for e in entities if e['type'] == 'PARTY']
dates = [e['name'] for e in entities if e['type'] == 'DATE']

Knowledge Graph Building

Create entity-relationship graphs from documents:

# Extract entities from multiple documents
all_entities = []
for doc in documents:
    entities = ner.extract(doc.text)
    for e in entities:
        e['source_doc'] = doc.id
    all_entities.extend(entities)

# Build co-occurrence graph
from collections import defaultdict
co_occurrences = defaultdict(int)
for doc_id in set(e['source_doc'] for e in all_entities):
    doc_entities = [e for e in all_entities if e['source_doc'] == doc_id]
    for i, e1 in enumerate(doc_entities):
        for e2 in doc_entities[i+1:]:
            pair = tuple(sorted([e1['name'], e2['name']]))
            co_occurrences[pair] += 1

Entity-Based Search

Find documents by entity type:

# Store entities with documents
for doc in documents:
    entities = ner.extract(doc.text)

    frame_id = mem.put(
        title=doc.title,
        label='document',
        metadata={
            'entities': entities,
            'companies': [e['name'] for e in entities if e['type'] == 'COMPANY'],
            'people': [e['name'] for e in entities if e['type'] == 'PERSON'],
        },
        text=doc.text,
    )

# Search by entity
results = mem.find('Microsoft', k=10)

Deal Memo Analysis

Extract structured deal information:

ner = get_entity_extractor('openai', entity_types=[
    'COMPANY', 'INVESTOR', 'MONEY', 'DEAL_TYPE', 'DATE', 'LOCATION'
])

deal_text = """
Series B Funding: Atlas Logistics
Atlas Logistics, headquartered in Seattle, announced a $50 million Series B round.
Lead investor Pinnacle Capital. Deal closes Q1 2025.
"""

entities = ner.extract(deal_text)
# Structured output:
# - COMPANY: Atlas Logistics
# - LOCATION: Seattle
# - MONEY: $50 million
# - DEAL_TYPE: Series B
# - INVESTOR: Pinnacle Capital
# - DATE: Q1 2025

Best Practices

Choose appropriate entity types: Define types specific to your domain
Set confidence thresholds: Use higher thresholds (0.7+) for critical applications
Batch extraction: Use extract_batch() for multiple texts
Cache results: Store extracted entities in document metadata
Validate entities: Review extracted entities for accuracy in critical workflows
Use local for privacy: Local NER processes data entirely on-device

Limitations

Local NER: Fixed entity types (PERSON, ORG, LOCATION, MISC)
Local NER: Python SDK only (Node.js uses cloud providers)
Cloud providers: Require API keys and internet connection
Rate limits: Cloud providers have rate limits based on plan
Context length: Very long texts may need chunking

Next Steps

Visual Embeddings

Enable image and visual search with CLIP

Embedding Models

Configure text embedding models for semantic search

Python SDK

Complete Python SDK reference

Node.js SDK

Complete Node.js SDK reference

Get Started

Comparisons

Install

Hosting

Architecture

Search & Retrieval

Enrichment

Media Processing

Embeddings

Security & Limits

Performance

CLI

Python SDK

Node.js SDK

Examples & Packages

Testing

Help

​Overview

​Quick Start

​Python SDK

​Node.js SDK

​Providers

​Local NER (DistilBERT)

​OpenAI Entities

​Claude Entities

​Gemini Entities

​Complete Example

​Custom Entity Types

​Finance Domain

​Legal Domain

​Healthcare Domain

​API Reference

​EntityExtractor Interface

​Entity Object

​Factory Function

​Environment Variables

​Use Cases

​Document Intelligence

​Knowledge Graph Building

​Entity-Based Search

​Deal Memo Analysis

​Best Practices

​Limitations

​Next Steps

Visual Embeddings

Embedding Models

Python SDK

Node.js SDK

Overview

Quick Start

Python SDK

Node.js SDK

Providers

Local NER (DistilBERT)

OpenAI Entities

Claude Entities

Gemini Entities

Complete Example

Custom Entity Types

Finance Domain

Legal Domain

Healthcare Domain

API Reference

EntityExtractor Interface

Entity Object

Factory Function

Environment Variables

Use Cases

Document Intelligence

Knowledge Graph Building

Entity-Based Search

Deal Memo Analysis

Best Practices

Limitations

Next Steps