Table Extraction

Memvid extracts structured tables from PDFs, making tabular data searchable and exportable. Tables are detected automatically using multiple extraction methods, with quality scoring to ensure accurate results.

How It Works

Key features:

Multiple detection methods - Stream, Lattice, LineBased
Quality scoring - Filter low-confidence extractions
Row embedding - Make individual rows semantically searchable
Export formats - CSV, JSON, or view inline

Extraction Methods

Memvid tries multiple methods and uses the best result:

Method	Description	Best For
Stream	Text position analysis	Borderless tables, text layouts
Lattice	Grid line detection	Tables with visible borders
LineBased	Row/column inference	Mixed formatting

The extractor automatically selects the method with the highest quality score.

CLI Usage

Basic Table Extraction

# Extract tables from PDF
memvid put memory.mv2 --input report.pdf --tables

# Extract and embed rows for semantic search
memvid put memory.mv2 --input financial.pdf --tables --embed-rows

Extraction Modes

Control extraction aggressiveness:

# Conservative - high confidence only
memvid tables import memory.mv2 --input report.pdf --mode conservative

# Aggressive - extract everything possible
memvid tables import memory.mv2 --input messy.pdf --mode aggressive

# Default - balanced approach
memvid tables import memory.mv2 --input report.pdf

Mode	Description	Use Case
`conservative`	High confidence only	Clean, formal documents
`balanced`	Default behavior	General purpose
`aggressive`	Extract everything	Messy/scanned documents

Quality Filters

Filter by table quality:

# Only high-quality tables
memvid tables import memory.mv2 --input report.pdf --min-quality high

# Include medium quality
memvid tables import memory.mv2 --input report.pdf --min-quality medium

# Accept all (including low quality)
memvid tables import memory.mv2 --input report.pdf --min-quality low

Size Filters

Filter by table dimensions:

# Minimum 3 rows and 2 columns
memvid tables import memory.mv2 --input report.pdf --min-rows 3 --min-cols 2

# Skip single-row headers
memvid tables import memory.mv2 --input report.pdf --min-rows 2

Managing Tables

List Tables

# List all tables in memory
memvid tables list memory.mv2

# Output:
# Found 5 tables:
#   - pdf_table_1_page1: 12 rows x 4 cols (Stream) [high]
#   - pdf_table_2_page1: 8 rows x 3 cols (Lattice) [high]
#   - pdf_table_3_page2: 5 rows x 6 cols (LineBased) [medium]
#   - pdf_table_4_page3: 20 rows x 2 cols (Stream) [high]
#   - pdf_table_5_page5: 3 rows x 4 cols (Lattice) [low]

# JSON output for scripting
memvid tables list memory.mv2 --json

View Table

# View table contents
memvid tables view memory.mv2 --table-id pdf_table_1_page1

# Output:
# ┌──────────────┬──────────┬──────────┬──────────┐
# │ Product      │ Qty      │ Price    │ Total    │
# ├──────────────┼──────────┼──────────┼──────────┤
# │ Widget A     │ 10       │ $5.00    │ $50.00   │
# │ Widget B     │ 5        │ $10.00   │ $50.00   │
# │ Widget C     │ 2        │ $25.00   │ $50.00   │
# └──────────────┴──────────┴──────────┴──────────┘

# JSON output
memvid tables view memory.mv2 --table-id pdf_table_1_page1 --json

Export Table

# Export to CSV
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format csv --out data.csv

# Export to JSON (array of arrays)
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format json --out data.json

# Export to JSON (array of objects/records)
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format json --as-records --out data.json

# Export all tables
memvid tables export memory.mv2 --all --format csv --out-dir ./tables/

JSON array format:

[
  ["Product", "Qty", "Price", "Total"],
  ["Widget A", "10", "$5.00", "$50.00"],
  ["Widget B", "5", "$10.00", "$50.00"]
]

JSON records format (--as-records):

[
  {"Product": "Widget A", "Qty": "10", "Price": "$5.00", "Total": "$50.00"},
  {"Product": "Widget B", "Qty": "5", "Price": "$10.00", "Total": "$50.00"}
]

Searching Table Data

Row Embedding

When --embed-rows is enabled (default), individual table rows are embedded for semantic search:

# Ingest with row embedding
memvid put memory.mv2 --input financial.pdf --tables --embed-rows

# Search finds relevant rows
memvid find memory.mv2 --query "Q4 revenue"

# Results include table row matches:
# [0.89] Row from pdf_table_2_page3: "Q4 2024 | Revenue | $1,234,567"

Searching Table Content

# Search across all content including tables
memvid find memory.mv2 --query "total sales"

# Filter to table content only
memvid find memory.mv2 --query "total sales" --scope "table:"

Use Cases

Invoice Processing

# Create invoice memory
memvid create invoices.mv2

# Ingest invoices with table extraction
memvid put invoices.mv2 --input ./invoices/ --tables --embed-rows

# Find specific line items
memvid find invoices.mv2 --query "shipping charges"

# Export all invoice tables
memvid tables list invoices.mv2
memvid tables export invoices.mv2 --table-id inv_001_table1 --format csv --out line_items.csv

Financial Reports

# Ingest quarterly reports
memvid put finance.mv2 --input quarterly-reports/ --tables

# Search for metrics
memvid find finance.mv2 --query "EBITDA margin"

# Export data for analysis
memvid tables export finance.mv2 --all --format csv --out-dir ./financial-data/

Research Papers

# Extract data tables from papers
memvid put research.mv2 --input papers/ --tables --min-quality medium

# Find experimental results
memvid find research.mv2 --query "p-value significance"

# Export for meta-analysis
memvid tables export research.mv2 --table-id paper_xyz_table3 --format json --as-records

Payroll/HR Documents

# Process pay stubs
memvid put payroll.mv2 --input paystubs/ --tables --mode conservative

# Search for deductions
memvid find payroll.mv2 --query "401k contribution"

# Export earnings data
memvid tables export payroll.mv2 --table-id stub_jan_table1 --format csv

Quality Scoring

Each extracted table receives a quality score based on:

Factor	Description
Structure consistency	Regular row/column counts
Cell alignment	Properly aligned content
Header detection	Clear header row identified
Empty cells	Low percentage of empty cells
Content coherence	Related data in columns

Quality levels:

Level	Score	Description
`high`	0.8 - 1.0	Reliable, well-structured
`medium`	0.5 - 0.8	Usable, may need review
`low`	0.0 - 0.5	Possible extraction errors

Handling Edge Cases

Merged Cells

Merged cells are expanded to fill all covered positions:

Original:           Extracted:
┌───────────┐       ┌─────┬─────┐
│  Header   │  →    │Header│Header│
├─────┬─────┤       ├─────┼─────┤
│  A  │  B  │       │  A  │  B  │
└─────┴─────┘       └─────┴─────┘

Multi-Page Tables

Tables spanning multiple pages are detected and merged when possible:

# Enable cross-page merging (default)
memvid tables import memory.mv2 --input report.pdf --merge-pages

# Disable merging (treat as separate tables)
memvid tables import memory.mv2 --input report.pdf --no-merge-pages

Nested Tables

Nested tables are extracted as separate tables with parent reference:

memvid tables list memory.mv2

# Output:
# - main_table_page1: 10 rows x 4 cols
#   └─ nested_table_1: 3 rows x 2 cols (parent: main_table_page1)

Rotated/Sideways Tables

Landscape-oriented tables are automatically detected and rotated:

# Auto-rotation is enabled by default
memvid tables import memory.mv2 --input landscape-report.pdf

# Disable auto-rotation
memvid tables import memory.mv2 --input report.pdf --no-auto-rotate

Performance Tips

Large PDFs

For PDFs with many pages:

# Process specific pages only
memvid tables import memory.mv2 --input large.pdf --pages 1-10

# Skip pages without tables
memvid tables import memory.mv2 --input large.pdf --skip-empty-pages

Batch Processing

For many PDFs:

# Process folder with parallel extraction
memvid put memory.mv2 --input ./pdfs/ --tables --parallel-segments

# Import tables only (no text extraction)
memvid tables import memory.mv2 --input ./pdfs/ --tables-only

Memory Usage

Table extraction can be memory-intensive for complex PDFs:

# Limit concurrent extractions
memvid tables import memory.mv2 --input large.pdf --max-concurrent 2

# Process page-by-page (lower memory)
memvid tables import memory.mv2 --input large.pdf --streaming

Troubleshooting

No Tables Detected

# Try aggressive mode
memvid tables import memory.mv2 --input report.pdf --mode aggressive

# Try specific method
memvid tables import memory.mv2 --input report.pdf --method stream
memvid tables import memory.mv2 --input report.pdf --method lattice

Poor Quality Extraction

# Check quality scores
memvid tables list memory.mv2 --json | jq '.[] | {id, quality}'

# Re-extract with different settings
memvid tables import memory.mv2 --input report.pdf --mode conservative --min-quality high

Missing Rows/Columns

# Adjust detection sensitivity
memvid tables import memory.mv2 --input report.pdf --sensitivity high

# Try lattice method for bordered tables
memvid tables import memory.mv2 --input report.pdf --method lattice

Limitations

Limitation	Workaround
Scanned PDFs	Use OCR preprocessing first
Complex nested tables	May extract as multiple tables
Very small text	Increase DPI in source
Decorative borders	Use stream method
Non-standard layouts	Use aggressive mode

SDK Support

Currently, table extraction is CLI-only. SDK support coming soon. Workaround for SDKs:

import subprocess
import json

# Extract tables via CLI
result = subprocess.run([
    'memvid', 'tables', 'list', 'memory.mv2', '--json'
], capture_output=True, text=True)

tables = json.loads(result.stdout)
for table in tables:
    print(f"Table: {table['id']} - {table['rows']}x{table['cols']}")

import { execSync } from 'child_process'

// Extract tables via CLI
const output = execSync('memvid tables list memory.mv2 --json')
const tables = JSON.parse(output.toString())

tables.forEach(table => {
  console.log(`Table: ${table.id} - ${table.rows}x${table.cols}`)
})

Get Started

Comparisons

Install

Hosting

Architecture

Search & Retrieval

Enrichment

Media Processing

Embeddings

Security & Limits

Performance

CLI

Python SDK

Node.js SDK

Examples & Packages

Testing

Help

​How It Works

​Extraction Methods

​CLI Usage

​Basic Table Extraction

​Extraction Modes

​Quality Filters

​Size Filters

​Managing Tables

​List Tables

​View Table

​Export Table

​Searching Table Data

​Row Embedding

​Searching Table Content

​Use Cases

​Invoice Processing

​Financial Reports

​Research Papers

​Payroll/HR Documents

​Quality Scoring

​Handling Edge Cases

​Merged Cells

​Multi-Page Tables

​Nested Tables

​Rotated/Sideways Tables

​Performance Tips

​Large PDFs

​Batch Processing

​Memory Usage

​Troubleshooting

​No Tables Detected

​Poor Quality Extraction

​Missing Rows/Columns

​Limitations

​SDK Support

​Next Steps

CLI Reference

Visual Embeddings