Documentation Index Fetch the complete documentation index at: https://docs.memvid.com/llms.txt
Use this file to discover all available pages before exploring further.
Memvid extracts structured tables from PDFs, making tabular data searchable and exportable. Tables are detected automatically using multiple extraction methods, with quality scoring to ensure accurate results.
How It Works
Key features:
Multiple detection methods - Stream, Lattice, LineBased
Quality scoring - Filter low-confidence extractions
Row embedding - Make individual rows semantically searchable
Export formats - CSV, JSON, or view inline
Memvid tries multiple methods and uses the best result:
Method Description Best For Stream Text position analysis Borderless tables, text layouts Lattice Grid line detection Tables with visible borders LineBased Row/column inference Mixed formatting
The extractor automatically selects the method with the highest quality score.
CLI Usage
# Extract tables from PDF
memvid put memory.mv2 --input report.pdf --tables
# Extract and embed rows for semantic search
memvid put memory.mv2 --input financial.pdf --tables --embed-rows
Control extraction aggressiveness:
# Conservative - high confidence only
memvid tables import memory.mv2 --input report.pdf --mode conservative
# Aggressive - extract everything possible
memvid tables import memory.mv2 --input messy.pdf --mode aggressive
# Default - balanced approach
memvid tables import memory.mv2 --input report.pdf
Mode Description Use Case conservativeHigh confidence only Clean, formal documents balancedDefault behavior General purpose aggressiveExtract everything Messy/scanned documents
Quality Filters
Filter by table quality:
# Only high-quality tables
memvid tables import memory.mv2 --input report.pdf --min-quality high
# Include medium quality
memvid tables import memory.mv2 --input report.pdf --min-quality medium
# Accept all (including low quality)
memvid tables import memory.mv2 --input report.pdf --min-quality low
Size Filters
Filter by table dimensions:
# Minimum 3 rows and 2 columns
memvid tables import memory.mv2 --input report.pdf --min-rows 3 --min-cols 2
# Skip single-row headers
memvid tables import memory.mv2 --input report.pdf --min-rows 2
Managing Tables
List Tables
# List all tables in memory
memvid tables list memory.mv2
# Output:
# Found 5 tables:
# - pdf_table_1_page1: 12 rows x 4 cols (Stream) [high]
# - pdf_table_2_page1: 8 rows x 3 cols (Lattice) [high]
# - pdf_table_3_page2: 5 rows x 6 cols (LineBased) [medium]
# - pdf_table_4_page3: 20 rows x 2 cols (Stream) [high]
# - pdf_table_5_page5: 3 rows x 4 cols (Lattice) [low]
# JSON output for scripting
memvid tables list memory.mv2 --json
View Table
# View table contents
memvid tables view memory.mv2 --table-id pdf_table_1_page1
# Output:
# ┌──────────────┬──────────┬──────────┬──────────┐
# │ Product │ Qty │ Price │ Total │
# ├──────────────┼──────────┼──────────┼──────────┤
# │ Widget A │ 10 │ $5.00 │ $50.00 │
# │ Widget B │ 5 │ $10.00 │ $50.00 │
# │ Widget C │ 2 │ $25.00 │ $50.00 │
# └──────────────┴──────────┴──────────┴──────────┘
# JSON output
memvid tables view memory.mv2 --table-id pdf_table_1_page1 --json
Export Table
# Export to CSV
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format csv --out data.csv
# Export to JSON (array of arrays)
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format json --out data.json
# Export to JSON (array of objects/records)
memvid tables export memory.mv2 --table-id pdf_table_1_page1 --format json --as-records --out data.json
# Export all tables
memvid tables export memory.mv2 --all --format csv --out-dir ./tables/
JSON array format:
[
[ "Product" , "Qty" , "Price" , "Total" ],
[ "Widget A" , "10" , "$5.00" , "$50.00" ],
[ "Widget B" , "5" , "$10.00" , "$50.00" ]
]
JSON records format (--as-records):
[
{ "Product" : "Widget A" , "Qty" : "10" , "Price" : "$5.00" , "Total" : "$50.00" },
{ "Product" : "Widget B" , "Qty" : "5" , "Price" : "$10.00" , "Total" : "$50.00" }
]
Searching Table Data
Row Embedding
When --embed-rows is enabled (default), individual table rows are embedded for semantic search:
# Ingest with row embedding
memvid put memory.mv2 --input financial.pdf --tables --embed-rows
# Search finds relevant rows
memvid find memory.mv2 --query "Q4 revenue"
# Results include table row matches:
# [0.89] Row from pdf_table_2_page3: "Q4 2024 | Revenue | $1,234,567"
Searching Table Content
# Search across all content including tables
memvid find memory.mv2 --query "total sales"
# Filter to table content only
memvid find memory.mv2 --query "total sales" --scope "table:"
Use Cases
Invoice Processing
# Create invoice memory
memvid create invoices.mv2
# Ingest invoices with table extraction
memvid put invoices.mv2 --input ./invoices/ --tables --embed-rows
# Find specific line items
memvid find invoices.mv2 --query "shipping charges"
# Export all invoice tables
memvid tables list invoices.mv2
memvid tables export invoices.mv2 --table-id inv_001_table1 --format csv --out line_items.csv
Financial Reports
# Ingest quarterly reports
memvid put finance.mv2 --input quarterly-reports/ --tables
# Search for metrics
memvid find finance.mv2 --query "EBITDA margin"
# Export data for analysis
memvid tables export finance.mv2 --all --format csv --out-dir ./financial-data/
Research Papers
# Extract data tables from papers
memvid put research.mv2 --input papers/ --tables --min-quality medium
# Find experimental results
memvid find research.mv2 --query "p-value significance"
# Export for meta-analysis
memvid tables export research.mv2 --table-id paper_xyz_table3 --format json --as-records
Payroll/HR Documents
# Process pay stubs
memvid put payroll.mv2 --input paystubs/ --tables --mode conservative
# Search for deductions
memvid find payroll.mv2 --query "401k contribution"
# Export earnings data
memvid tables export payroll.mv2 --table-id stub_jan_table1 --format csv
Quality Scoring
Each extracted table receives a quality score based on:
Factor Description Structure consistency Regular row/column counts Cell alignment Properly aligned content Header detection Clear header row identified Empty cells Low percentage of empty cells Content coherence Related data in columns
Quality levels:
Level Score Description high0.8 - 1.0 Reliable, well-structured medium0.5 - 0.8 Usable, may need review low0.0 - 0.5 Possible extraction errors
Handling Edge Cases
Merged Cells
Merged cells are expanded to fill all covered positions:
Original: Extracted:
┌───────────┐ ┌─────┬─────┐
│ Header │ → │Header│Header│
├─────┬─────┤ ├─────┼─────┤
│ A │ B │ │ A │ B │
└─────┴─────┘ └─────┴─────┘
Multi-Page Tables
Tables spanning multiple pages are detected and merged when possible:
# Enable cross-page merging (default)
memvid tables import memory.mv2 --input report.pdf --merge-pages
# Disable merging (treat as separate tables)
memvid tables import memory.mv2 --input report.pdf --no-merge-pages
Nested Tables
Nested tables are extracted as separate tables with parent reference:
memvid tables list memory.mv2
# Output:
# - main_table_page1: 10 rows x 4 cols
# └─ nested_table_1: 3 rows x 2 cols (parent: main_table_page1)
Rotated/Sideways Tables
Landscape-oriented tables are automatically detected and rotated:
# Auto-rotation is enabled by default
memvid tables import memory.mv2 --input landscape-report.pdf
# Disable auto-rotation
memvid tables import memory.mv2 --input report.pdf --no-auto-rotate
Large PDFs
For PDFs with many pages:
# Process specific pages only
memvid tables import memory.mv2 --input large.pdf --pages 1-10
# Skip pages without tables
memvid tables import memory.mv2 --input large.pdf --skip-empty-pages
Batch Processing
For many PDFs:
# Process folder with parallel extraction
memvid put memory.mv2 --input ./pdfs/ --tables --parallel-segments
# Import tables only (no text extraction)
memvid tables import memory.mv2 --input ./pdfs/ --tables-only
Memory Usage
Table extraction can be memory-intensive for complex PDFs:
# Limit concurrent extractions
memvid tables import memory.mv2 --input large.pdf --max-concurrent 2
# Process page-by-page (lower memory)
memvid tables import memory.mv2 --input large.pdf --streaming
Troubleshooting
No Tables Detected
# Try aggressive mode
memvid tables import memory.mv2 --input report.pdf --mode aggressive
# Try specific method
memvid tables import memory.mv2 --input report.pdf --method stream
memvid tables import memory.mv2 --input report.pdf --method lattice
# Check quality scores
memvid tables list memory.mv2 --json | jq '.[] | {id, quality}'
# Re-extract with different settings
memvid tables import memory.mv2 --input report.pdf --mode conservative --min-quality high
Missing Rows/Columns
# Adjust detection sensitivity
memvid tables import memory.mv2 --input report.pdf --sensitivity high
# Try lattice method for bordered tables
memvid tables import memory.mv2 --input report.pdf --method lattice
Limitations
Limitation Workaround Scanned PDFs Use OCR preprocessing first Complex nested tables May extract as multiple tables Very small text Increase DPI in source Decorative borders Use stream method Non-standard layouts Use aggressive mode
SDK Support
Currently, table extraction is CLI-only . SDK support coming soon.
Workaround for SDKs:
import subprocess
import json
# Extract tables via CLI
result = subprocess.run([
'memvid' , 'tables' , 'list' , 'memory.mv2' , '--json'
], capture_output = True , text = True )
tables = json.loads(result.stdout)
for table in tables:
print ( f "Table: { table[ 'id' ] } - { table[ 'rows' ] } x { table[ 'cols' ] } " )
import { execSync } from 'child_process'
// Extract tables via CLI
const output = execSync ( 'memvid tables list memory.mv2 --json' )
const tables = JSON . parse ( output . toString ())
tables . forEach ( table => {
console . log ( `Table: ${ table . id } - ${ table . rows } x ${ table . cols } ` )
})
Next Steps
CLI Reference Full put command options
Visual Embeddings Image and visual search