Skip to main content
Memvid extracts structured tables from PDFs, making tabular data searchable and exportable. Tables are detected automatically using multiple extraction methods, with quality scoring to ensure accurate results.

How It Works

Key features:
  • Multiple detection methods - Stream, Lattice, LineBased
  • Quality scoring - Filter low-confidence extractions
  • Row embedding - Make individual rows semantically searchable
  • Export formats - CSV, JSON, or view inline

Extraction Methods

Memvid tries multiple methods and uses the best result:
MethodDescriptionBest For
StreamText position analysisBorderless tables, text layouts
LatticeGrid line detectionTables with visible borders
LineBasedRow/column inferenceMixed formatting
The extractor automatically selects the method with the highest quality score.

CLI Usage

Basic Table Extraction

# Extract tables from PDF
memvid put memory.mv2 --input report.pdf --tables

# Extract and embed rows for semantic search
memvid put memory.mv2 --input financial.pdf --tables --embed-rows

Extraction Modes

Control extraction aggressiveness:
# Conservative - high confidence only
memvid tables import memory.mv2 --input report.pdf --mode conservative

# Aggressive - extract everything possible
memvid tables import memory.mv2 --input messy.pdf --mode aggressive

# Default - balanced approach
memvid tables import memory.mv2 --input report.pdf
ModeDescriptionUse Case
conservativeHigh confidence onlyClean, formal documents
balancedDefault behaviorGeneral purpose
aggressiveExtract everythingMessy/scanned documents

Quality Filters

Filter by table quality:
# Only high-quality tables
memvid tables import memory.mv2 --input report.pdf --min-quality high

# Include medium quality
memvid tables import memory.mv2 --input report.pdf --min-quality medium

# Accept all (including low quality)
memvid tables import memory.mv2 --input report.pdf --min-quality low

Size Filters

Filter by table dimensions:
# Minimum 3 rows and 2 columns
memvid tables import memory.mv2 --input report.pdf --min-rows 3 --min-cols 2

# Skip single-row headers
memvid tables import memory.mv2 --input report.pdf --min-rows 2

Managing Tables

List Tables

# List all tables in memory
memvid tables list memory.mv2

# Output:
# Found 5 tables:
#   - pdf_table_1_page1: 12 rows x 4 cols (Stream) [high]
#   - pdf_table_2_page1: 8 rows x 3 cols (Lattice) [high]
#   - pdf_table_3_page2: 5 rows x 6 cols (LineBased) [medium]
#   - pdf_table_4_page3: 20 rows x 2 cols (Stream) [high]
#   - pdf_table_5_page5: 3 rows x 4 cols (Lattice) [low]

# JSON output for scripting
memvid tables list memory.mv2 --json

View Table

# View table contents
memvid tables view memory.mv2 --table-id pdf_table_1_page1

# Output:
# ┌──────────────┬──────────┬──────────┬──────────┐
# │ Product      │ Qty      │ Price    │ Total    │
# ├──────────────┼──────────┼──────────┼──────────┤
# │ Widget A     │ 10       │ $5.00    │ $50.00   │
# │ Widget B     │ 5        │ $10.00   │ $50.00   │
# │ Widget C     │ 2        │ $25.00   │ $50.00   │
# └──────────────┴──────────┴──────────┴──────────┘

# JSON output
memvid tables view memory.mv2 --table-id pdf_table_1_page1 --json

Export Table

# Export to CSV
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format csv --out data.csv

# Export to JSON (array of arrays)
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format json --out data.json

# Export to JSON (array of objects/records)
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format json --as-records --out data.json

# Export all tables
memvid tables export memory.mv2 --all --format csv --out-dir ./tables/
JSON array format:
[
  ["Product", "Qty", "Price", "Total"],
  ["Widget A", "10", "$5.00", "$50.00"],
  ["Widget B", "5", "$10.00", "$50.00"]
]
JSON records format (--as-records):
[
  {"Product": "Widget A", "Qty": "10", "Price": "$5.00", "Total": "$50.00"},
  {"Product": "Widget B", "Qty": "5", "Price": "$10.00", "Total": "$50.00"}
]

Searching Table Data

Row Embedding

When --embed-rows is enabled (default), individual table rows are embedded for semantic search:
# Ingest with row embedding
memvid put memory.mv2 --input financial.pdf --tables --embed-rows

# Search finds relevant rows
memvid find memory.mv2 --query "Q4 revenue"

# Results include table row matches:
# [0.89] Row from pdf_table_2_page3: "Q4 2024 | Revenue | $1,234,567"

Searching Table Content

# Search across all content including tables
memvid find memory.mv2 --query "total sales"

# Filter to table content only
memvid find memory.mv2 --query "total sales" --scope "table:"

Use Cases

Invoice Processing

# Create invoice memory
memvid create invoices.mv2

# Ingest invoices with table extraction
memvid put invoices.mv2 --input ./invoices/ --tables --embed-rows

# Find specific line items
memvid find invoices.mv2 --query "shipping charges"

# Export all invoice tables
memvid tables list invoices.mv2
memvid tables export invoices.mv2 --table-id inv_001_table1 --format csv --out line_items.csv

Financial Reports

# Ingest quarterly reports
memvid put finance.mv2 --input quarterly-reports/ --tables

# Search for metrics
memvid find finance.mv2 --query "EBITDA margin"

# Export data for analysis
memvid tables export finance.mv2 --all --format csv --out-dir ./financial-data/

Research Papers

# Extract data tables from papers
memvid put research.mv2 --input papers/ --tables --min-quality medium

# Find experimental results
memvid find research.mv2 --query "p-value significance"

# Export for meta-analysis
memvid tables export research.mv2 --table-id paper_xyz_table3 --format json --as-records

Payroll/HR Documents

# Process pay stubs
memvid put payroll.mv2 --input paystubs/ --tables --mode conservative

# Search for deductions
memvid find payroll.mv2 --query "401k contribution"

# Export earnings data
memvid tables export payroll.mv2 --table-id stub_jan_table1 --format csv

Quality Scoring

Each extracted table receives a quality score based on:
FactorDescription
Structure consistencyRegular row/column counts
Cell alignmentProperly aligned content
Header detectionClear header row identified
Empty cellsLow percentage of empty cells
Content coherenceRelated data in columns
Quality levels:
LevelScoreDescription
high0.8 - 1.0Reliable, well-structured
medium0.5 - 0.8Usable, may need review
low0.0 - 0.5Possible extraction errors

Handling Edge Cases

Merged Cells

Merged cells are expanded to fill all covered positions:
Original:           Extracted:
┌───────────┐       ┌─────┬─────┐
│  Header   │  →    │Header│Header│
├─────┬─────┤       ├─────┼─────┤
│  A  │  B  │       │  A  │  B  │
└─────┴─────┘       └─────┴─────┘

Multi-Page Tables

Tables spanning multiple pages are detected and merged when possible:
# Enable cross-page merging (default)
memvid tables import memory.mv2 --input report.pdf --merge-pages

# Disable merging (treat as separate tables)
memvid tables import memory.mv2 --input report.pdf --no-merge-pages

Nested Tables

Nested tables are extracted as separate tables with parent reference:
memvid tables list memory.mv2

# Output:
# - main_table_page1: 10 rows x 4 cols
#   └─ nested_table_1: 3 rows x 2 cols (parent: main_table_page1)

Rotated/Sideways Tables

Landscape-oriented tables are automatically detected and rotated:
# Auto-rotation is enabled by default
memvid tables import memory.mv2 --input landscape-report.pdf

# Disable auto-rotation
memvid tables import memory.mv2 --input report.pdf --no-auto-rotate

Performance Tips

Large PDFs

For PDFs with many pages:
# Process specific pages only
memvid tables import memory.mv2 --input large.pdf --pages 1-10

# Skip pages without tables
memvid tables import memory.mv2 --input large.pdf --skip-empty-pages

Batch Processing

For many PDFs:
# Process folder with parallel extraction
memvid put memory.mv2 --input ./pdfs/ --tables --parallel-segments

# Import tables only (no text extraction)
memvid tables import memory.mv2 --input ./pdfs/ --tables-only

Memory Usage

Table extraction can be memory-intensive for complex PDFs:
# Limit concurrent extractions
memvid tables import memory.mv2 --input large.pdf --max-concurrent 2

# Process page-by-page (lower memory)
memvid tables import memory.mv2 --input large.pdf --streaming

Troubleshooting

No Tables Detected

# Try aggressive mode
memvid tables import memory.mv2 --input report.pdf --mode aggressive

# Try specific method
memvid tables import memory.mv2 --input report.pdf --method stream
memvid tables import memory.mv2 --input report.pdf --method lattice

Poor Quality Extraction

# Check quality scores
memvid tables list memory.mv2 --json | jq '.[] | {id, quality}'

# Re-extract with different settings
memvid tables import memory.mv2 --input report.pdf --mode conservative --min-quality high

Missing Rows/Columns

# Adjust detection sensitivity
memvid tables import memory.mv2 --input report.pdf --sensitivity high

# Try lattice method for bordered tables
memvid tables import memory.mv2 --input report.pdf --method lattice

Limitations

LimitationWorkaround
Scanned PDFsUse OCR preprocessing first
Complex nested tablesMay extract as multiple tables
Very small textIncrease DPI in source
Decorative bordersUse stream method
Non-standard layoutsUse aggressive mode

SDK Support

Currently, table extraction is CLI-only. SDK support coming soon. Workaround for SDKs:
import subprocess
import json

# Extract tables via CLI
result = subprocess.run([
    'memvid', 'tables', 'list', 'memory.mv2', '--json'
], capture_output=True, text=True)

tables = json.loads(result.stdout)
for table in tables:
    print(f"Table: {table['id']} - {table['rows']}x{table['cols']}")
import { execSync } from 'child_process'

// Extract tables via CLI
const output = execSync('memvid tables list memory.mv2 --json')
const tables = JSON.parse(output.toString())

tables.forEach(table => {
  console.log(`Table: ${table.id} - ${table.rows}x${table.cols}`)
})

Next Steps