Memvid supports Named Entity Recognition (NER) for extracting structured entities from documents. This enables building knowledge graphs, entity-based search, and relationship mapping, turning unstructured text into structured knowledge.
Overview
Entity extraction identifies and classifies named entities in text:
- People: Names of individuals (CEO, executives, authors)
- Organizations: Companies, institutions, agencies
- Locations: Cities, countries, addresses
- Dates: Temporal references, deadlines, events
- Money: Currency amounts, valuations, prices
- Custom types: Domain-specific entities (products, deals, regulations)
| Provider | Model | Entity Types | Best For |
|---|
| Local | DistilBERT-NER | PERSON, ORG, LOCATION, MISC | Offline, privacy-first |
| OpenAI | GPT-4o-mini | Custom | High accuracy, custom entities |
| OpenAI | GPT-4o | Custom | Best quality |
| Claude | Claude 3.5 Sonnet | Custom | Nuanced extraction |
| Gemini | Gemini 2.0 Flash | Custom | Fast, cost-effective |
Quick Start
Python SDK
from memvid_sdk import create
from memvid_sdk.entities import get_entity_extractor
# Initialize entity extractor
ner = get_entity_extractor('openai', entity_types=['COMPANY', 'PERSON', 'MONEY', 'DATE'])
print(f"Provider: {ner.name}")
print(f"Entity types: {ner.entity_types}")
# Extract entities from text
text = """
Microsoft CEO Satya Nadella announced a $50 million investment in Seattle.
The deal closes December 2024 with Pinnacle Financial as lead investor.
"""
entities = ner.extract(text, min_confidence=0.5)
for entity in entities:
print(f" {entity['name']} ({entity['type']}, {entity['confidence']:.2f})")
# Output:
# Microsoft (COMPANY, 0.95)
# Satya Nadella (PERSON, 0.97)
# $50 million (MONEY, 0.95)
# Seattle (LOCATION, 0.90)
# December 2024 (DATE, 0.88)
# Pinnacle Financial (COMPANY, 0.92)
Node.js SDK
import { create, getEntityExtractor } from '@memvid/sdk';
// Initialize entity extractor
const ner = getEntityExtractor('openai', {
entityTypes: ['COMPANY', 'PERSON', 'MONEY', 'DATE'],
});
console.log(`Provider: ${ner.name}`);
console.log(`Entity types: ${ner.entityTypes}`);
// Extract entities from text
const text = `
Microsoft CEO Satya Nadella announced a $50 million investment in Seattle.
The deal closes December 2024 with Pinnacle Financial as lead investor.
`;
const entities = await ner.extract(text, 0.5);
for (const entity of entities) {
console.log(` ${entity.name} (${entity.type}, ${entity.confidence.toFixed(2)})`);
}
Providers
Local NER (DistilBERT)
The default provider uses DistilBERT-NER, a lightweight model for offline entity extraction.
Characteristics:
- Model: DistilBERT fine-tuned on CoNLL-03
- Size: ~261 MB (downloaded on first use)
- Entity types: PERSON, ORG, LOCATION, MISC (fixed)
- Inference: CPU-based, no GPU required
- Privacy: All processing happens locally
from memvid_sdk.entities import get_entity_extractor, LocalNER
# Using factory
ner = get_entity_extractor('local')
# Or direct instantiation
ner = LocalNER(model='distilbert-ner')
# Extract entities
entities = ner.extract("Apple CEO Tim Cook visited Paris headquarters.")
# [
# {'name': 'Apple', 'type': 'ORG', 'confidence': 0.98},
# {'name': 'Tim Cook', 'type': 'PERSON', 'confidence': 0.97},
# {'name': 'Paris', 'type': 'LOCATION', 'confidence': 0.95},
# ]
import { getEntityExtractor, LocalNER } from '@memvid/sdk';
const ner = getEntityExtractor('local');
const entities = await ner.extract('Apple CEO Tim Cook visited Paris headquarters.');
Local NER uses fixed entity types (PERSON, ORG, LOCATION, MISC). For custom entity types, use cloud providers. In Node.js, LocalNER requires a native build that exports NerModel (the prebuilt npm binaries may not include it).
OpenAI Entities
OpenAI’s models provide high-accuracy extraction with custom entity types.
Setup:
export OPENAI_API_KEY=sk-your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, OpenAIEntities
# Using factory with custom entity types
ner = get_entity_extractor('openai', entity_types=[
'COMPANY',
'PERSON',
'LOCATION',
'MONEY',
'DATE',
'PRODUCT',
'DEAL_TYPE',
])
# Or with specific model
ner = get_entity_extractor('openai:gpt-4o-mini', entity_types=['COMPANY', 'PERSON'])
# Direct instantiation
ner = OpenAIEntities(
model='gpt-4o-mini',
entity_types=['COMPANY', 'EXECUTIVE', 'PRODUCT'],
)
# Extract entities
entities = ner.extract(text, min_confidence=0.5)
import { getEntityExtractor, OpenAIEntities } from '@memvid/sdk';
const ner = getEntityExtractor('openai', {
entityTypes: ['COMPANY', 'PERSON', 'LOCATION', 'MONEY', 'DATE'],
});
// Or with specific model
const ner = getEntityExtractor('openai:gpt-4o-mini', {
entityTypes: ['COMPANY', 'PERSON'],
});
const entities = await ner.extract(text, 0.5);
Model Comparison:
| Model | Speed | Quality |
|---|
gpt-4o-mini | Fast | Good |
gpt-4o | Medium | Best |
gpt-4-turbo | Medium | Excellent |
Claude Entities
Anthropic’s Claude excels at nuanced entity extraction with context understanding.
Setup:
export ANTHROPIC_API_KEY=your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, ClaudeEntities
# Using factory
ner = get_entity_extractor('claude', entity_types=['COMPANY', 'PERSON', 'REGULATION'])
# With specific model
ner = get_entity_extractor('claude:claude-3-5-sonnet-20241022', entity_types=['COMPANY'])
# Direct instantiation
ner = ClaudeEntities(
model='claude-3-5-sonnet-20241022',
entity_types=['COMPANY', 'EXECUTIVE', 'DEAL'],
)
entities = ner.extract(text, min_confidence=0.6)
import { getEntityExtractor, ClaudeEntities } from '@memvid/sdk';
const ner = getEntityExtractor('claude', {
entityTypes: ['COMPANY', 'PERSON', 'REGULATION'],
});
const entities = await ner.extract(text, 0.6);
Gemini Entities
Google’s Gemini provides fast, cost-effective entity extraction.
Setup:
export GEMINI_API_KEY=your-key-here
Usage:
from memvid_sdk.entities import get_entity_extractor, GeminiEntities
# Using factory
ner = get_entity_extractor('gemini', entity_types=['COMPANY', 'PERSON'])
# With specific model
ner = get_entity_extractor('gemini:gemini-2.0-flash', entity_types=['COMPANY'])
entities = ner.extract(text, min_confidence=0.5)
import { getEntityExtractor, GeminiEntities } from '@memvid/sdk';
const ner = getEntityExtractor('gemini', {
entityTypes: ['COMPANY', 'PERSON'],
});
const entities = await ner.extract(text, 0.5);
Complete Example
Here’s a full workflow for document entity extraction:
from pathlib import Path
from memvid_sdk import create
from memvid_sdk.entities import get_entity_extractor
# Configuration
PROVIDER = 'openai'
ENTITY_TYPES = ['COMPANY', 'PERSON', 'LOCATION', 'MONEY', 'DATE', 'DEAL_TYPE']
DATASET_DIR = Path('documents/')
OUTPUT_PATH = 'knowledge_base.mv2'
# Initialize
ner = get_entity_extractor(PROVIDER, entity_types=ENTITY_TYPES)
print(f"Entity Extractor: {ner.name}")
print(f"Entity Types: {', '.join(ner.entity_types)}")
# Create memory
if Path(OUTPUT_PATH).exists():
Path(OUTPUT_PATH).unlink()
mem = create(OUTPUT_PATH)
mem.enable_lex()
# Process documents
all_entities = []
pdf_files = list(DATASET_DIR.glob('*.pdf'))
for i, pdf_path in enumerate(pdf_files):
print(f"\n[{i+1}/{len(pdf_files)}] {pdf_path.name}")
# Store document
frame_id = mem.put(
title=pdf_path.stem.replace('_', ' ').title(),
label='document',
metadata={},
file=str(pdf_path),
)
print(f" Stored as frame {frame_id}")
# Extract entities (from document text or summary)
document_text = f"Document: {pdf_path.stem}" # Replace with actual text extraction
entities = ner.extract(document_text, min_confidence=0.5)
print(f" Found {len(entities)} entities:")
for e in entities[:4]:
print(f" - {e['name']} ({e['type']}, {e['confidence']:.2f})")
all_entities.extend(entities)
# Entity statistics
print("\n--- Entity Summary ---")
counts = {}
for e in all_entities:
t = e.get('type', 'UNKNOWN')
counts[t] = counts.get(t, 0) + 1
for entity_type, count in sorted(counts.items(), key=lambda x: -x[1]):
print(f" {entity_type}: {count}")
# Seal
mem.seal()
stats = mem.stats()
print(f"\nFinal: {stats.get('frame_count', 0)} frames, {len(all_entities)} entities")
Custom Entity Types
Cloud providers support custom entity types tailored to your domain:
Finance Domain
ner = get_entity_extractor('openai', entity_types=[
'COMPANY',
'INVESTOR',
'FUND',
'MONEY',
'DEAL_TYPE', # IPO, M&A, Series A
'VALUATION',
'EXECUTIVE',
'DATE',
])
Legal Domain
ner = get_entity_extractor('claude', entity_types=[
'PARTY',
'COURT',
'JUDGE',
'CASE_NUMBER',
'STATUTE',
'DATE',
'JURISDICTION',
])
Healthcare Domain
ner = get_entity_extractor('openai:gpt-4o', entity_types=[
'PATIENT',
'PROVIDER',
'MEDICATION',
'DIAGNOSIS',
'PROCEDURE',
'DATE',
'FACILITY',
])
API Reference
All entity extractors implement this interface:
| Method | Description |
|---|
name | Provider identifier (e.g., openai:gpt-4o-mini) |
entity_types | List of supported entity types |
extract(text, min_confidence) | Extract entities from text |
extract_batch(texts, min_confidence) | Batch extract from multiple texts |
Entity Object
Each extracted entity contains:
| Field | Type | Description |
|---|
name | string | Entity text as it appears |
type | string | Entity classification |
confidence | float | Confidence score (0.0-1.0) |
Factory Function
# Python
from memvid_sdk.entities import get_entity_extractor
ner = get_entity_extractor(
provider, # 'local', 'openai', 'claude', 'gemini', 'openai:model-name'
entity_types=None, # Custom entity types (cloud providers only)
api_key=None, # Override env var
)
// Node.js
import { getEntityExtractor } from '@memvid/sdk';
const ner = getEntityExtractor(provider, {
entityTypes: ['COMPANY', 'PERSON'], // Custom entity types
apiKey: undefined, // Override env var
});
Environment Variables
| Variable | Description |
|---|
OPENAI_API_KEY | OpenAI API key |
ANTHROPIC_API_KEY | Anthropic API key for Claude |
GEMINI_API_KEY | Google AI API key for Gemini |
MEMVID_MODELS_DIR | Local model cache directory |
MEMVID_OFFLINE=1 | Skip model downloads (local NER) |
Use Cases
Document Intelligence
Extract structured data from unstructured documents:
# Process legal contracts
ner = get_entity_extractor('claude', entity_types=[
'PARTY', 'DATE', 'MONEY', 'TERM', 'JURISDICTION'
])
contract_text = "Agreement between Acme Corp and Beta Inc dated January 15, 2024..."
entities = ner.extract(contract_text)
# Build structured contract summary
parties = [e['name'] for e in entities if e['type'] == 'PARTY']
dates = [e['name'] for e in entities if e['type'] == 'DATE']
Knowledge Graph Building
Create entity-relationship graphs from documents:
# Extract entities from multiple documents
all_entities = []
for doc in documents:
entities = ner.extract(doc.text)
for e in entities:
e['source_doc'] = doc.id
all_entities.extend(entities)
# Build co-occurrence graph
from collections import defaultdict
co_occurrences = defaultdict(int)
for doc_id in set(e['source_doc'] for e in all_entities):
doc_entities = [e for e in all_entities if e['source_doc'] == doc_id]
for i, e1 in enumerate(doc_entities):
for e2 in doc_entities[i+1:]:
pair = tuple(sorted([e1['name'], e2['name']]))
co_occurrences[pair] += 1
Entity-Based Search
Find documents by entity type:
# Store entities with documents
for doc in documents:
entities = ner.extract(doc.text)
frame_id = mem.put(
title=doc.title,
label='document',
metadata={
'entities': entities,
'companies': [e['name'] for e in entities if e['type'] == 'COMPANY'],
'people': [e['name'] for e in entities if e['type'] == 'PERSON'],
},
text=doc.text,
)
# Search by entity
results = mem.find('Microsoft', k=10)
Deal Memo Analysis
Extract structured deal information:
ner = get_entity_extractor('openai', entity_types=[
'COMPANY', 'INVESTOR', 'MONEY', 'DEAL_TYPE', 'DATE', 'LOCATION'
])
deal_text = """
Series B Funding: Atlas Logistics
Atlas Logistics, headquartered in Seattle, announced a $50 million Series B round.
Lead investor Pinnacle Capital. Deal closes Q1 2025.
"""
entities = ner.extract(deal_text)
# Structured output:
# - COMPANY: Atlas Logistics
# - LOCATION: Seattle
# - MONEY: $50 million
# - DEAL_TYPE: Series B
# - INVESTOR: Pinnacle Capital
# - DATE: Q1 2025
Best Practices
- Choose appropriate entity types: Define types specific to your domain
- Set confidence thresholds: Use higher thresholds (0.7+) for critical applications
- Batch extraction: Use
extract_batch() for multiple texts
- Cache results: Store extracted entities in document metadata
- Validate entities: Review extracted entities for accuracy in critical workflows
- Use local for privacy: Local NER processes data entirely on-device
Limitations
- Local NER: Fixed entity types (PERSON, ORG, LOCATION, MISC)
- Local NER: Python SDK only (Node.js uses cloud providers)
- Cloud providers: Require API keys and internet connection
- Rate limits: Cloud providers have rate limits based on plan
- Context length: Very long texts may need chunking
Next Steps