PythonpdfplumberLLMAutomation

Invoice Parser

Upload a PDF invoice, get structured JSON back: vendor name, address, line items, subtotal, tax, and total. Uses rule-based extraction first, falls back to an LLM for unstructured layouts. Every field validated against a strict output schema.

Paper receipts feeding through a metallic funnel into a glowing purple data cube, representing PDF invoice parsing into structured JSON.

Extraction

Rule-based + LLM

Accuracy

94%

Output

Validated JSON

Security

Sandboxed + rate-limited

How it works

1PDF Upload

→

2Text Extract

→

3Layout Analysis

→

4Regex Patterns

→

5LLM Fallback

→

6Pydantic Validation

→

7JSON Output

Pipeline Architecture

Three-tier extraction strategy. Tier 1 uses direct text extraction with layout analysis and handles 60% of invoices. Tier 2 applies regex patterns for common invoice formats (28%). Tier 3 sends text to an LLM with a constrained output schema for the remaining 12%. Each tier is tried in order - the cheapest, fastest method runs first.

Extraction Engine

Core logic normalises every invoice into a standard schema regardless of format. Line items, tax calculations, and currency detection all happen before validation. Output is Pydantic-validated with cross-checks: line item totals must sum to subtotal, subtotal + tax must equal total.

Output Validation & Confidence

Every field gets a confidence score. Results with overall confidence below 0.80 are routed to a review queue rather than auto-accepted. Human corrections feed back into pattern matching, improving Tier 2 regex coverage over time.

Security

File validation by magic bytes, not file extension
Subprocess isolation for PDF parsing
LLM prompt injection defence with XML delimiters
Rate limiting: 5 requests/min per IP
No long-term file persistence - processed and discarded

Related service

This project demonstrates the kind of work I do under AI Automation.

Learn more →

← Back to all projects

Want something like this built for your business?

I'll look at your problem, figure out the right approach, and ship working software. No slideshows.

Book a free consultation