PythonpdfplumberLLMAutomation

Invoice Parser

Upload a PDF invoice, get structured JSON back: vendor name, address, line items, subtotal, tax, and total. Uses rule-based extraction first, falls back to an LLM for unstructured layouts. Every field validated against a strict output schema.

Extraction

Rule-based + LLM

Accuracy

94%

Output

Validated JSON

Security

Sandboxed + rate-limited

How it works

1PDF Upload
2Text Extract
3Layout Analysis
4Regex Patterns
5LLM Fallback
6Pydantic Validation
7JSON Output

Pipeline Architecture

Three-tier extraction strategy. Tier 1 uses direct text extraction with layout analysis and handles 60% of invoices. Tier 2 applies regex patterns for common invoice formats (28%). Tier 3 sends text to an LLM with a constrained output schema for the remaining 12%. Each tier is tried in order — the cheapest, fastest method runs first.

Extraction Engine

Core logic normalises every invoice into a standard schema regardless of format. Line items, tax calculations, and currency detection all happen before validation. Output is Pydantic-validated with cross-checks: line item totals must sum to subtotal, subtotal + tax must equal total.

Output Validation & Confidence

Every field gets a confidence score. Results with overall confidence below 0.80 are routed to a review queue rather than auto-accepted. Human corrections feed back into pattern matching, improving Tier 2 regex coverage over time.

Security

  • File validation by magic bytes, not file extension
  • Subprocess isolation for PDF parsing
  • LLM prompt injection defence with XML delimiters
  • Rate limiting: 5 requests/min per IP
  • No long-term file persistence — processed and discarded

Want something like this built for your business?

I'll look at your problem, figure out the right approach, and ship working software. No slideshows.

Book a free consultation