Skip to main content
Memvid supports Named Entity Recognition (NER) for extracting structured entities from documents. This enables building knowledge graphs, entity-based search, and relationship mapping, turning unstructured text into structured knowledge.

Overview

Entity extraction identifies and classifies named entities in text:
  • People: Names of individuals (CEO, executives, authors)
  • Organizations: Companies, institutions, agencies
  • Locations: Cities, countries, addresses
  • Dates: Temporal references, deadlines, events
  • Money: Currency amounts, valuations, prices
  • Custom types: Domain-specific entities (products, deals, regulations)
ProviderModelEntity TypesBest For
LocalDistilBERT-NERPERSON, ORG, LOCATION, MISCOffline, privacy-first
OpenAIGPT-4o-miniCustomHigh accuracy, custom entities
OpenAIGPT-4oCustomBest quality
ClaudeClaude 3.5 SonnetCustomNuanced extraction
GeminiGemini 2.0 FlashCustomFast, cost-effective

Quick Start

Python SDK

from memvid_sdk import create
from memvid_sdk.entities import get_entity_extractor

# Initialize entity extractor
ner = get_entity_extractor('openai', entity_types=['COMPANY', 'PERSON', 'MONEY', 'DATE'])
print(f"Provider: {ner.name}")
print(f"Entity types: {ner.entity_types}")

# Extract entities from text
text = """
Microsoft CEO Satya Nadella announced a $50 million investment in Seattle.
The deal closes December 2024 with Pinnacle Financial as lead investor.
"""

entities = ner.extract(text, min_confidence=0.5)
for entity in entities:
    print(f"  {entity['name']} ({entity['type']}, {entity['confidence']:.2f})")

# Output:
#   Microsoft (COMPANY, 0.95)
#   Satya Nadella (PERSON, 0.97)
#   $50 million (MONEY, 0.95)
#   Seattle (LOCATION, 0.90)
#   December 2024 (DATE, 0.88)
#   Pinnacle Financial (COMPANY, 0.92)

Node.js SDK

import { create, getEntityExtractor } from '@memvid/sdk';

// Initialize entity extractor
const ner = getEntityExtractor('openai', {
  entityTypes: ['COMPANY', 'PERSON', 'MONEY', 'DATE'],
});
console.log(`Provider: ${ner.name}`);
console.log(`Entity types: ${ner.entityTypes}`);

// Extract entities from text
const text = `
Microsoft CEO Satya Nadella announced a $50 million investment in Seattle.
The deal closes December 2024 with Pinnacle Financial as lead investor.
`;

const entities = await ner.extract(text, 0.5);
for (const entity of entities) {
  console.log(`  ${entity.name} (${entity.type}, ${entity.confidence.toFixed(2)})`);
}

Providers

Local NER (DistilBERT)

The default provider uses DistilBERT-NER, a lightweight model for offline entity extraction. Characteristics:
  • Model: DistilBERT fine-tuned on CoNLL-03
  • Size: ~261 MB (downloaded on first use)
  • Entity types: PERSON, ORG, LOCATION, MISC (fixed)
  • Inference: CPU-based, no GPU required
  • Privacy: All processing happens locally
from memvid_sdk.entities import get_entity_extractor, LocalNER

# Using factory
ner = get_entity_extractor('local')

# Or direct instantiation
ner = LocalNER(model='distilbert-ner')

# Extract entities
entities = ner.extract("Apple CEO Tim Cook visited Paris headquarters.")
# [
#   {'name': 'Apple', 'type': 'ORG', 'confidence': 0.98},
#   {'name': 'Tim Cook', 'type': 'PERSON', 'confidence': 0.97},
#   {'name': 'Paris', 'type': 'LOCATION', 'confidence': 0.95},
# ]
import { getEntityExtractor, LocalNER } from '@memvid/sdk';

const ner = getEntityExtractor('local');
const entities = await ner.extract('Apple CEO Tim Cook visited Paris headquarters.');
Local NER uses fixed entity types (PERSON, ORG, LOCATION, MISC). For custom entity types, use cloud providers. In Node.js, LocalNER requires a native build that exports NerModel (the prebuilt npm binaries may not include it).

OpenAI Entities

OpenAI’s models provide high-accuracy extraction with custom entity types. Setup:
export OPENAI_API_KEY=sk-your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, OpenAIEntities

# Using factory with custom entity types
ner = get_entity_extractor('openai', entity_types=[
    'COMPANY',
    'PERSON',
    'LOCATION',
    'MONEY',
    'DATE',
    'PRODUCT',
    'DEAL_TYPE',
])

# Or with specific model
ner = get_entity_extractor('openai:gpt-4o-mini', entity_types=['COMPANY', 'PERSON'])

# Direct instantiation
ner = OpenAIEntities(
    model='gpt-4o-mini',
    entity_types=['COMPANY', 'EXECUTIVE', 'PRODUCT'],
)

# Extract entities
entities = ner.extract(text, min_confidence=0.5)
import { getEntityExtractor, OpenAIEntities } from '@memvid/sdk';

const ner = getEntityExtractor('openai', {
  entityTypes: ['COMPANY', 'PERSON', 'LOCATION', 'MONEY', 'DATE'],
});

// Or with specific model
const ner = getEntityExtractor('openai:gpt-4o-mini', {
  entityTypes: ['COMPANY', 'PERSON'],
});

const entities = await ner.extract(text, 0.5);
Model Comparison:
ModelSpeedQuality
gpt-4o-miniFastGood
gpt-4oMediumBest
gpt-4-turboMediumExcellent

Claude Entities

Anthropic’s Claude excels at nuanced entity extraction with context understanding. Setup:
export ANTHROPIC_API_KEY=your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, ClaudeEntities

# Using factory
ner = get_entity_extractor('claude', entity_types=['COMPANY', 'PERSON', 'REGULATION'])

# With specific model
ner = get_entity_extractor('claude:claude-3-5-sonnet-20241022', entity_types=['COMPANY'])

# Direct instantiation
ner = ClaudeEntities(
    model='claude-3-5-sonnet-20241022',
    entity_types=['COMPANY', 'EXECUTIVE', 'DEAL'],
)

entities = ner.extract(text, min_confidence=0.6)
import { getEntityExtractor, ClaudeEntities } from '@memvid/sdk';

const ner = getEntityExtractor('claude', {
  entityTypes: ['COMPANY', 'PERSON', 'REGULATION'],
});

const entities = await ner.extract(text, 0.6);

Gemini Entities

Google’s Gemini provides fast, cost-effective entity extraction. Setup:
export GEMINI_API_KEY=your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, GeminiEntities

# Using factory
ner = get_entity_extractor('gemini', entity_types=['COMPANY', 'PERSON'])

# With specific model
ner = get_entity_extractor('gemini:gemini-2.0-flash', entity_types=['COMPANY'])

entities = ner.extract(text, min_confidence=0.5)
import { getEntityExtractor, GeminiEntities } from '@memvid/sdk';

const ner = getEntityExtractor('gemini', {
  entityTypes: ['COMPANY', 'PERSON'],
});

const entities = await ner.extract(text, 0.5);

Complete Example

Here’s a full workflow for document entity extraction:
from pathlib import Path
from memvid_sdk import create
from memvid_sdk.entities import get_entity_extractor

# Configuration
PROVIDER = 'openai'
ENTITY_TYPES = ['COMPANY', 'PERSON', 'LOCATION', 'MONEY', 'DATE', 'DEAL_TYPE']
DATASET_DIR = Path('documents/')
OUTPUT_PATH = 'knowledge_base.mv2'

# Initialize
ner = get_entity_extractor(PROVIDER, entity_types=ENTITY_TYPES)
print(f"Entity Extractor: {ner.name}")
print(f"Entity Types: {', '.join(ner.entity_types)}")

# Create memory
if Path(OUTPUT_PATH).exists():
    Path(OUTPUT_PATH).unlink()

mem = create(OUTPUT_PATH)
mem.enable_lex()

# Process documents
all_entities = []
pdf_files = list(DATASET_DIR.glob('*.pdf'))

for i, pdf_path in enumerate(pdf_files):
    print(f"\n[{i+1}/{len(pdf_files)}] {pdf_path.name}")

    # Store document
    frame_id = mem.put(
        title=pdf_path.stem.replace('_', ' ').title(),
        label='document',
        metadata={},
        file=str(pdf_path),
    )
    print(f"    Stored as frame {frame_id}")

    # Extract entities (from document text or summary)
    document_text = f"Document: {pdf_path.stem}"  # Replace with actual text extraction
    entities = ner.extract(document_text, min_confidence=0.5)

    print(f"    Found {len(entities)} entities:")
    for e in entities[:4]:
        print(f"      - {e['name']} ({e['type']}, {e['confidence']:.2f})")

    all_entities.extend(entities)

# Entity statistics
print("\n--- Entity Summary ---")
counts = {}
for e in all_entities:
    t = e.get('type', 'UNKNOWN')
    counts[t] = counts.get(t, 0) + 1

for entity_type, count in sorted(counts.items(), key=lambda x: -x[1]):
    print(f"  {entity_type}: {count}")

# Seal
mem.seal()
stats = mem.stats()
print(f"\nFinal: {stats.get('frame_count', 0)} frames, {len(all_entities)} entities")

Custom Entity Types

Cloud providers support custom entity types tailored to your domain:

Finance Domain

ner = get_entity_extractor('openai', entity_types=[
    'COMPANY',
    'INVESTOR',
    'FUND',
    'MONEY',
    'DEAL_TYPE',      # IPO, M&A, Series A
    'VALUATION',
    'EXECUTIVE',
    'DATE',
])
ner = get_entity_extractor('claude', entity_types=[
    'PARTY',
    'COURT',
    'JUDGE',
    'CASE_NUMBER',
    'STATUTE',
    'DATE',
    'JURISDICTION',
])

Healthcare Domain

ner = get_entity_extractor('openai:gpt-4o', entity_types=[
    'PATIENT',
    'PROVIDER',
    'MEDICATION',
    'DIAGNOSIS',
    'PROCEDURE',
    'DATE',
    'FACILITY',
])

API Reference

EntityExtractor Interface

All entity extractors implement this interface:
MethodDescription
nameProvider identifier (e.g., openai:gpt-4o-mini)
entity_typesList of supported entity types
extract(text, min_confidence)Extract entities from text
extract_batch(texts, min_confidence)Batch extract from multiple texts

Entity Object

Each extracted entity contains:
FieldTypeDescription
namestringEntity text as it appears
typestringEntity classification
confidencefloatConfidence score (0.0-1.0)

Factory Function

# Python
from memvid_sdk.entities import get_entity_extractor

ner = get_entity_extractor(
    provider,           # 'local', 'openai', 'claude', 'gemini', 'openai:model-name'
    entity_types=None,  # Custom entity types (cloud providers only)
    api_key=None,       # Override env var
)
// Node.js
	import { getEntityExtractor } from '@memvid/sdk';

const ner = getEntityExtractor(provider, {
  entityTypes: ['COMPANY', 'PERSON'],  // Custom entity types
  apiKey: undefined,                    // Override env var
});

Environment Variables

VariableDescription
OPENAI_API_KEYOpenAI API key
ANTHROPIC_API_KEYAnthropic API key for Claude
GEMINI_API_KEYGoogle AI API key for Gemini
MEMVID_MODELS_DIRLocal model cache directory
MEMVID_OFFLINE=1Skip model downloads (local NER)

Use Cases

Document Intelligence

Extract structured data from unstructured documents:
# Process legal contracts
ner = get_entity_extractor('claude', entity_types=[
    'PARTY', 'DATE', 'MONEY', 'TERM', 'JURISDICTION'
])

contract_text = "Agreement between Acme Corp and Beta Inc dated January 15, 2024..."
entities = ner.extract(contract_text)

# Build structured contract summary
parties = [e['name'] for e in entities if e['type'] == 'PARTY']
dates = [e['name'] for e in entities if e['type'] == 'DATE']

Knowledge Graph Building

Create entity-relationship graphs from documents:
# Extract entities from multiple documents
all_entities = []
for doc in documents:
    entities = ner.extract(doc.text)
    for e in entities:
        e['source_doc'] = doc.id
    all_entities.extend(entities)

# Build co-occurrence graph
from collections import defaultdict
co_occurrences = defaultdict(int)
for doc_id in set(e['source_doc'] for e in all_entities):
    doc_entities = [e for e in all_entities if e['source_doc'] == doc_id]
    for i, e1 in enumerate(doc_entities):
        for e2 in doc_entities[i+1:]:
            pair = tuple(sorted([e1['name'], e2['name']]))
            co_occurrences[pair] += 1
Find documents by entity type:
# Store entities with documents
for doc in documents:
    entities = ner.extract(doc.text)

    frame_id = mem.put(
        title=doc.title,
        label='document',
        metadata={
            'entities': entities,
            'companies': [e['name'] for e in entities if e['type'] == 'COMPANY'],
            'people': [e['name'] for e in entities if e['type'] == 'PERSON'],
        },
        text=doc.text,
    )

# Search by entity
results = mem.find('Microsoft', k=10)

Deal Memo Analysis

Extract structured deal information:
ner = get_entity_extractor('openai', entity_types=[
    'COMPANY', 'INVESTOR', 'MONEY', 'DEAL_TYPE', 'DATE', 'LOCATION'
])

deal_text = """
Series B Funding: Atlas Logistics
Atlas Logistics, headquartered in Seattle, announced a $50 million Series B round.
Lead investor Pinnacle Capital. Deal closes Q1 2025.
"""

entities = ner.extract(deal_text)
# Structured output:
# - COMPANY: Atlas Logistics
# - LOCATION: Seattle
# - MONEY: $50 million
# - DEAL_TYPE: Series B
# - INVESTOR: Pinnacle Capital
# - DATE: Q1 2025

Best Practices

  1. Choose appropriate entity types: Define types specific to your domain
  2. Set confidence thresholds: Use higher thresholds (0.7+) for critical applications
  3. Batch extraction: Use extract_batch() for multiple texts
  4. Cache results: Store extracted entities in document metadata
  5. Validate entities: Review extracted entities for accuracy in critical workflows
  6. Use local for privacy: Local NER processes data entirely on-device

Limitations

  • Local NER: Fixed entity types (PERSON, ORG, LOCATION, MISC)
  • Local NER: Python SDK only (Node.js uses cloud providers)
  • Cloud providers: Require API keys and internet connection
  • Rate limits: Cloud providers have rate limits based on plan
  • Context length: Very long texts may need chunking

Next Steps

Visual Embeddings

Enable image and visual search with CLIP

Embedding Models

Configure text embedding models for semantic search

Python SDK

Complete Python SDK reference

Node.js SDK

Complete Node.js SDK reference