How It Works
Key features:- Multiple detection methods - Stream, Lattice, LineBased
- Quality scoring - Filter low-confidence extractions
- Row embedding - Make individual rows semantically searchable
- Export formats - CSV, JSON, or view inline
Extraction Methods
Memvid tries multiple methods and uses the best result:| Method | Description | Best For |
|---|---|---|
| Stream | Text position analysis | Borderless tables, text layouts |
| Lattice | Grid line detection | Tables with visible borders |
| LineBased | Row/column inference | Mixed formatting |
CLI Usage
Basic Table Extraction
Extraction Modes
Control extraction aggressiveness:| Mode | Description | Use Case |
|---|---|---|
conservative | High confidence only | Clean, formal documents |
balanced | Default behavior | General purpose |
aggressive | Extract everything | Messy/scanned documents |
Quality Filters
Filter by table quality:Size Filters
Filter by table dimensions:Managing Tables
List Tables
View Table
Export Table
--as-records):
Searching Table Data
Row Embedding
When--embed-rows is enabled (default), individual table rows are embedded for semantic search:
Searching Table Content
Use Cases
Invoice Processing
Financial Reports
Research Papers
Payroll/HR Documents
Quality Scoring
Each extracted table receives a quality score based on:| Factor | Description |
|---|---|
| Structure consistency | Regular row/column counts |
| Cell alignment | Properly aligned content |
| Header detection | Clear header row identified |
| Empty cells | Low percentage of empty cells |
| Content coherence | Related data in columns |
| Level | Score | Description |
|---|---|---|
high | 0.8 - 1.0 | Reliable, well-structured |
medium | 0.5 - 0.8 | Usable, may need review |
low | 0.0 - 0.5 | Possible extraction errors |
Handling Edge Cases
Merged Cells
Merged cells are expanded to fill all covered positions:Multi-Page Tables
Tables spanning multiple pages are detected and merged when possible:Nested Tables
Nested tables are extracted as separate tables with parent reference:Rotated/Sideways Tables
Landscape-oriented tables are automatically detected and rotated:Performance Tips
Large PDFs
For PDFs with many pages:Batch Processing
For many PDFs:Memory Usage
Table extraction can be memory-intensive for complex PDFs:Troubleshooting
No Tables Detected
Poor Quality Extraction
Missing Rows/Columns
Limitations
| Limitation | Workaround |
|---|---|
| Scanned PDFs | Use OCR preprocessing first |
| Complex nested tables | May extract as multiple tables |
| Very small text | Increase DPI in source |
| Decorative borders | Use stream method |
| Non-standard layouts | Use aggressive mode |