Skip to main content
Memvid supports Named Entity Recognition (NER) for extracting structured entities from documents. This enables building knowledge graphs, entity-based search, and relationship mapping, turning unstructured text into structured knowledge.

Overview

Entity extraction identifies and classifies named entities in text:
  • People: Names of individuals (CEO, executives, authors)
  • Organizations: Companies, institutions, agencies
  • Locations: Cities, countries, addresses
  • Dates: Temporal references, deadlines, events
  • Money: Currency amounts, valuations, prices
  • Custom types: Domain-specific entities (products, deals, regulations)
ProviderModelEntity TypesBest For
LocalDistilBERT-NERPERSON, ORG, LOCATION, MISCOffline, privacy-first
OpenAIGPT-4o-miniCustomHigh accuracy, custom entities
OpenAIGPT-4oCustomBest quality
ClaudeClaude 3.5 SonnetCustomNuanced extraction
GeminiGemini 2.0 FlashCustomFast, cost-effective

Quick Start

Python SDK

from memvid_sdk import create
from memvid_sdk.entities import get_entity_extractor

# Initialize entity extractor
ner = get_entity_extractor('openai', entity_types=['COMPANY', 'PERSON', 'MONEY', 'DATE'])
print(f"Provider: {ner.name}")
print(f"Entity types: {ner.entity_types}")

# Extract entities from text
text = """
Microsoft CEO Satya Nadella announced a $50 million investment in Seattle.
The deal closes December 2024 with Pinnacle Financial as lead investor.
"""

entities = ner.extract(text, min_confidence=0.5)
for entity in entities:
    print(f"  {entity['name']} ({entity['type']}, {entity['confidence']:.2f})")

# Output:
#   Microsoft (COMPANY, 0.95)
#   Satya Nadella (PERSON, 0.97)
#   $50 million (MONEY, 0.95)
#   Seattle (LOCATION, 0.90)
#   December 2024 (DATE, 0.88)
#   Pinnacle Financial (COMPANY, 0.92)

Node.js SDK

import { create, getEntityExtractor } from '@memvid/sdk';

// Initialize entity extractor
const ner = getEntityExtractor('openai', {
  entityTypes: ['COMPANY', 'PERSON', 'MONEY', 'DATE'],
});
console.log(`Provider: ${ner.name}`);
console.log(`Entity types: ${ner.entityTypes}`);

// Extract entities from text
const text = `
Microsoft CEO Satya Nadella announced a $50 million investment in Seattle.
The deal closes December 2024 with Pinnacle Financial as lead investor.
`;

const entities = await ner.extract(text, 0.5);
for (const entity of entities) {
  console.log(`  ${entity.name} (${entity.type}, ${entity.confidence.toFixed(2)})`);
}

Providers

Local NER (DistilBERT)

The default provider uses DistilBERT-NER, a lightweight model for offline entity extraction. Characteristics:
  • Model: DistilBERT fine-tuned on CoNLL-03
  • Size: ~261 MB (downloaded on first use)
  • Entity types: PERSON, ORG, LOCATION, MISC (fixed)
  • Inference: CPU-based, no GPU required
  • Privacy: All processing happens locally
from memvid_sdk.entities import get_entity_extractor, LocalNER

# Using factory
ner = get_entity_extractor('local')

# Or direct instantiation
ner = LocalNER(model='distilbert-ner')

# Extract entities
entities = ner.extract("Apple CEO Tim Cook visited Paris headquarters.")
# [
#   {'name': 'Apple', 'type': 'ORG', 'confidence': 0.98},
#   {'name': 'Tim Cook', 'type': 'PERSON', 'confidence': 0.97},
#   {'name': 'Paris', 'type': 'LOCATION', 'confidence': 0.95},
# ]
import { getEntityExtractor, LocalNER } from '@memvid/sdk';

const ner = getEntityExtractor('local');
const entities = await ner.extract('Apple CEO Tim Cook visited Paris headquarters.');
Local NER uses fixed entity types (PERSON, ORG, LOCATION, MISC). For custom entity types, use cloud providers. In Node.js, LocalNER requires a native build that exports NerModel (the prebuilt npm binaries may not include it).

OpenAI Entities

OpenAI’s models provide high-accuracy extraction with custom entity types. Setup:
export OPENAI_API_KEY=sk-your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, OpenAIEntities

# Using factory with custom entity types
ner = get_entity_extractor('openai', entity_types=[
    'COMPANY',
    'PERSON',
    'LOCATION',
    'MONEY',
    'DATE',
    'PRODUCT',
    'DEAL_TYPE',
])

# Or with specific model
ner = get_entity_extractor('openai:gpt-4o-mini', entity_types=['COMPANY', 'PERSON'])

# Direct instantiation
ner = OpenAIEntities(
    model='gpt-4o-mini',
    entity_types=['COMPANY', 'EXECUTIVE', 'PRODUCT'],
)

# Extract entities
entities = ner.extract(text, min_confidence=0.5)
import { getEntityExtractor, OpenAIEntities } from '@memvid/sdk';

const ner = getEntityExtractor('openai', {
  entityTypes: ['COMPANY', 'PERSON', 'LOCATION', 'MONEY', 'DATE'],
});

// Or with specific model
const ner = getEntityExtractor('openai:gpt-4o-mini', {
  entityTypes: ['COMPANY', 'PERSON'],
});

const entities = await ner.extract(text, 0.5);
Model Comparison:
ModelSpeedQuality
gpt-4o-miniFastGood
gpt-4oMediumBest
gpt-4-turboMediumExcellent

Claude Entities

Anthropic’s Claude excels at nuanced entity extraction with context understanding. Setup:
export ANTHROPIC_API_KEY=your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, ClaudeEntities

# Using factory
ner = get_entity_extractor('claude', entity_types=['COMPANY', 'PERSON', 'REGULATION'])

# With specific model
ner = get_entity_extractor('claude:claude-3-5-sonnet-20241022', entity_types=['COMPANY'])

# Direct instantiation
ner = ClaudeEntities(
    model='claude-3-5-sonnet-20241022',
    entity_types=['COMPANY', 'EXECUTIVE', 'DEAL'],
)

entities = ner.extract(text, min_confidence=0.6)
import { getEntityExtractor, ClaudeEntities } from '@memvid/sdk';

const ner = getEntityExtractor('claude', {
  entityTypes: ['COMPANY', 'PERSON', 'REGULATION'],
});

const entities = await ner.extract(text, 0.6);

Gemini Entities

Google’s Gemini provides fast, cost-effective entity extraction. Setup:
export GEMINI_API_KEY=your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, GeminiEntities

# Using factory
ner = get_entity_extractor('gemini', entity_types=['COMPANY', 'PERSON'])

# With specific model
ner = get_entity_extractor('gemini:gemini-2.0-flash', entity_types=['COMPANY'])

entities = ner.extract(text, min_confidence=0.5)
import { getEntityExtractor, GeminiEntities } from '@memvid/sdk';

const ner = getEntityExtractor('gemini', {
  entityTypes: ['COMPANY', 'PERSON'],
});

const entities = await ner.extract(text, 0.5);

Complete Example

Here’s a full workflow for document entity extraction:
from pathlib import Path
from memvid_sdk import create
from memvid_sdk.entities import get_entity_extractor

# Configuration
PROVIDER = 'openai'
ENTITY_TYPES = ['COMPANY', 'PERSON', 'LOCATION', 'MONEY', 'DATE', 'DEAL_TYPE']
DATASET_DIR = Path('documents/')
OUTPUT_PATH = 'knowledge_base.mv2'

# Initialize
ner = get_entity_extractor(PROVIDER, entity_types=ENTITY_TYPES)
print(f"Entity Extractor: {ner.name}")
print(f"Entity Types: {', '.join(ner.entity_types)}")

# Create memory
if Path(OUTPUT_PATH).exists():
    Path(OUTPUT_PATH).unlink()

mem = create(OUTPUT_PATH)
mem.enable_lex()

# Process documents
all_entities = []
pdf_files = list(DATASET_DIR.glob('*.pdf'))

for i, pdf_path in enumerate(pdf_files):
    print(f"\n[{i+1}/{len(pdf_files)}] {pdf_path.name}")

    # Store document
    frame_id = mem.put(
        title=pdf_path.stem.replace('_', ' ').title(),
        label='document',
        metadata={},
        file=str(pdf_path),
    )
    print(f"    Stored as frame {frame_id}")

    # Extract entities (from document text or summary)
    document_text = f"Document: {pdf_path.stem}"  # Replace with actual text extraction
    entities = ner.extract(document_text, min_confidence=0.5)

    print(f"    Found {len(entities)} entities:")
    for e in entities[:4]:
        print(f"      - {e['name']} ({e['type']}, {e['confidence']:.2f})")

    all_entities.extend(entities)

# Entity statistics
print("\n--- Entity Summary ---")
counts = {}
for e in all_entities:
    t = e.get('type', 'UNKNOWN')
    counts[t] = counts.get(t, 0) + 1

for entity_type, count in sorted(counts.items(), key=lambda x: -x[1]):
    print(f"  {entity_type}: {count}")

# Seal
mem.seal()
stats = mem.stats()
print(f"\nFinal: {stats.get('frame_count', 0)} frames, {len(all_entities)} entities")

Custom Entity Types

Cloud providers support custom entity types tailored to your domain:

Finance Domain

ner = get_entity_extractor('openai', entity_types=[
    'COMPANY',
    'INVESTOR',
    'FUND',
    'MONEY',
    'DEAL_TYPE',      # IPO, M&A, Series A
    'VALUATION',
    'EXECUTIVE',
    'DATE',
])
ner = get_entity_extractor('claude', entity_types=[
    'PARTY',
    'COURT',
    'JUDGE',
    'CASE_NUMBER',
    'STATUTE',
    'DATE',
    'JURISDICTION',
])

Healthcare Domain

ner = get_entity_extractor('openai:gpt-4o', entity_types=[
    'PATIENT',
    'PROVIDER',
    'MEDICATION',
    'DIAGNOSIS',
    'PROCEDURE',
    'DATE',
    'FACILITY',
])

API Reference

EntityExtractor Interface

All entity extractors implement this interface:
MethodDescription
nameProvider identifier (e.g., openai:gpt-4o-mini)
entity_typesList of supported entity types
extract(text, min_confidence)Extract entities from text
extract_batch(texts, min_confidence)Batch extract from multiple texts

Entity Object

Each extracted entity contains:
FieldTypeDescription
namestringEntity text as it appears
typestringEntity classification
confidencefloatConfidence score (0.0-1.0)

Factory Function

# Python
from memvid_sdk.entities import get_entity_extractor

ner = get_entity_extractor(
    provider,           # 'local', 'openai', 'claude', 'gemini', 'openai:model-name'
    entity_types=None,  # Custom entity types (cloud providers only)
    api_key=None,       # Override env var
)
// Node.js
	import { getEntityExtractor } from '@memvid/sdk';

const ner = getEntityExtractor(provider, {
  entityTypes: ['COMPANY', 'PERSON'],  // Custom entity types
  apiKey: undefined,                    // Override env var
});

Environment Variables

VariableDescription
OPENAI_API_KEYOpenAI API key
ANTHROPIC_API_KEYAnthropic API key for Claude
GEMINI_API_KEYGoogle AI API key for Gemini
MEMVID_MODELS_DIRLocal model cache directory
MEMVID_OFFLINE=1Skip model downloads (local NER)

Use Cases

Document Intelligence

Extract structured data from unstructured documents:
# Process legal contracts
ner = get_entity_extractor('claude', entity_types=[
    'PARTY', 'DATE', 'MONEY', 'TERM', 'JURISDICTION'
])

contract_text = "Agreement between Acme Corp and Beta Inc dated January 15, 2024..."
entities = ner.extract(contract_text)

# Build structured contract summary
parties = [e['name'] for e in entities if e['type'] == 'PARTY']
dates = [e['name'] for e in entities if e['type'] == 'DATE']

Knowledge Graph Building

Create entity-relationship graphs from documents:
# Extract entities from multiple documents
all_entities = []
for doc in documents:
    entities = ner.extract(doc.text)
    for e in entities:
        e['source_doc'] = doc.id
    all_entities.extend(entities)

# Build co-occurrence graph
from collections import defaultdict
co_occurrences = defaultdict(int)
for doc_id in set(e['source_doc'] for e in all_entities):
    doc_entities = [e for e in all_entities if e['source_doc'] == doc_id]
    for i, e1 in enumerate(doc_entities):
        for e2 in doc_entities[i+1:]:
            pair = tuple(sorted([e1['name'], e2['name']]))
            co_occurrences[pair] += 1
Find documents by entity type:
# Store entities with documents
for doc in documents:
    entities = ner.extract(doc.text)

    frame_id = mem.put(
        title=doc.title,
        label='document',
        metadata={
            'entities': entities,
            'companies': [e['name'] for e in entities if e['type'] == 'COMPANY'],
            'people': [e['name'] for e in entities if e['type'] == 'PERSON'],
        },
        text=doc.text,
    )

# Search by entity
results = mem.find('Microsoft', k=10)

Deal Memo Analysis

Extract structured deal information:
ner = get_entity_extractor('openai', entity_types=[
    'COMPANY', 'INVESTOR', 'MONEY', 'DEAL_TYPE', 'DATE', 'LOCATION'
])

deal_text = """
Series B Funding: Atlas Logistics
Atlas Logistics, headquartered in Seattle, announced a $50 million Series B round.
Lead investor Pinnacle Capital. Deal closes Q1 2025.
"""

entities = ner.extract(deal_text)
# Structured output:
# - COMPANY: Atlas Logistics
# - LOCATION: Seattle
# - MONEY: $50 million
# - DEAL_TYPE: Series B
# - INVESTOR: Pinnacle Capital
# - DATE: Q1 2025

Best Practices

  1. Choose appropriate entity types: Define types specific to your domain
  2. Set confidence thresholds: Use higher thresholds (0.7+) for critical applications
  3. Batch extraction: Use extract_batch() for multiple texts
  4. Cache results: Store extracted entities in document metadata
  5. Validate entities: Review extracted entities for accuracy in critical workflows
  6. Use local for privacy: Local NER processes data entirely on-device

Limitations

  • Local NER: Fixed entity types (PERSON, ORG, LOCATION, MISC)
  • Local NER: Python SDK only (Node.js uses cloud providers)
  • Cloud providers: Require API keys and internet connection
  • Rate limits: Cloud providers have rate limits based on plan
  • Context length: Very long texts may need chunking

Next Steps