Invoice Parser
Upload a PDF invoice, get structured JSON back: vendor name, address, line items, subtotal, tax, and total. Uses rule-based extraction first, falls back to an LLM for unstructured layouts. Every field validated against a strict output schema.
Extraction
Rule-based + LLM
Accuracy
94%
Output
Validated JSON
Security
Sandboxed + rate-limited
How it works
Pipeline Architecture
Three-tier extraction strategy. Tier 1 uses direct text extraction with layout analysis and handles 60% of invoices. Tier 2 applies regex patterns for common invoice formats (28%). Tier 3 sends text to an LLM with a constrained output schema for the remaining 12%. Each tier is tried in order — the cheapest, fastest method runs first.
Extraction Engine
Core logic normalises every invoice into a standard schema regardless of format. Line items, tax calculations, and currency detection all happen before validation. Output is Pydantic-validated with cross-checks: line item totals must sum to subtotal, subtotal + tax must equal total.
Output Validation & Confidence
Every field gets a confidence score. Results with overall confidence below 0.80 are routed to a review queue rather than auto-accepted. Human corrections feed back into pattern matching, improving Tier 2 regex coverage over time.
Security
- File validation by magic bytes, not file extension
- Subprocess isolation for PDF parsing
- LLM prompt injection defence with XML delimiters
- Rate limiting: 5 requests/min per IP
- No long-term file persistence — processed and discarded
Want something like this built for your business?
I'll look at your problem, figure out the right approach, and ship working software. No slideshows.
Book a free consultation