PDF parsing in 2026: what actually works

Jamie Blair28 January 20266 min read

PDFs are still the worst part of any document automation project. It's 2026 and businesses still receive invoices, contracts, purchase orders, and forms as PDF attachments. From the outside, a PDF looks like a structured document. From the inside, it's a bag of positioned text fragments, embedded fonts, and sometimes just a scanned image.

I've spent a lot of time working through the PDF parsing landscape to figure out what actually works. Here's the honest breakdown.

The three types of PDF you'll encounter

Not all PDFs are created equal, and the type determines your approach entirely.

Type 1: Text-based, well-structured

The PDF was generated digitally (exported from accounting software, generated by an ERP system). The text is selectable, tables have consistent columns, and the layout follows a template.

Best approach: pdfplumber or PyMuPDF. These libraries can extract text with position data, identify table structures, and give you clean output in most cases. Fast, free, no API calls needed.

Type 2: Text-based, inconsistent

The PDF has selectable text but the layout varies wildly. Different suppliers use different invoice templates. Columns shift, headers change, line items appear in unexpected places.

Best approach: pdfplumber for extraction, then a model for classification and normalisation. You extract the raw text and positions, then use a fine-tuned classifier or an LLM to map the messy output to your target schema.

Type 3: Scanned or image-based

The PDF is essentially a photograph. No selectable text, no structure. Someone scanned a paper document or the PDF was generated from an image.

Best approach: OCR first (Tesseract, Google Vision API, or AWS Textract), then the same classification step as Type 2. Quality depends heavily on scan quality. Faded ink, skewed pages, and handwriting all make this harder.

The tools, ranked by practicality

`pdfplumber` (Python)

My default starting point. Extracts text with bounding box positions, can identify tables automatically, handles most well-structured PDFs reliably. Free, open source, no API dependency.

Good for: Invoices, reports, forms from known templates. Bad for: Scanned documents, heavily formatted marketing PDFs.

import pdfplumber

with pdfplumber.open("invoice.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()
    text = page.extract_text()

`PyMuPDF` (Python)

Faster than pdfplumber, better at handling large files. Gives lower-level access to the PDF structure. The text extraction is solid but the table detection isn't as good out of the box.

Good for: High-volume processing, large PDFs, when speed matters. Bad for: Table extraction without custom logic.

Google Document AI

Cloud API that handles all three PDF types. Good OCR, reasonable table extraction, and it can learn custom document types. The pricing is per-page and adds up at volume.

Good for: Mixed document types, scanned PDFs, when you need a managed service. Bad for: Tight budgets, offline processing, data sensitivity concerns.

AWS Textract

Similar to Google Document AI but with stronger table and form extraction out of the box. The AnalyzeDocument API returns structured key-value pairs for forms. Pricing is also per-page.

Good for: Forms with labelled fields, tables, US tax documents. Bad for: Same budget and privacy concerns as Google.

Vision models (GPT-4o, Gemini, Claude)

Send the PDF page as an image to a multimodal LLM and ask it to extract the data. This is the "nuclear option": it works on basically anything, but it's slow and expensive per page.

Good for: The 5% of documents that nothing else can handle. Handwritten notes, complex layouts, mixed content. Bad for: High volume. At $0.01-0.03 per page, processing 10,000 invoices a month gets expensive.

Tip

The best approach is usually a cascade: try pdfplumber first (free, fast). If it fails to extract structured data, fall back to a cloud API or vision model. This keeps costs low while handling edge cases.

The real problems aren't the tools

The tools work fine for most individual PDFs. The hard part is building a system that handles variation at scale.

Template detection. If you're processing invoices from 50 different suppliers, each with a different layout, you need a way to detect which template you're looking at and apply the right extraction logic. I maintain a small library of parser profiles: one per common template, plus a fallback "generic" parser that uses an LLM.

Validation. Extracted data needs checking. Do the line items add up to the total? Is the date in a reasonable range? Does the supplier name match what's in your system? Automated validation catches extraction errors before they propagate.

Error handling. Some PDFs will fail. The page is blank, the scan is illegible, the format is completely novel. The system needs to gracefully flag these for manual review instead of silently producing garbage.

Feedback loops. When a human corrects an extraction error, that correction should improve future parsing. This can be as simple as updating a template profile or as complex as fine-tuning a model on corrected examples.

My recommended stack for document processing

For most small business automation projects, this stack covers 95% of cases:

pdfplumber for initial text and table extraction
Tesseract (via pytesseract) for OCR on scanned pages
A small text classifier to identify document type and template
Template-specific parsers for common formats
An LLM fallback (Gemini Flash or similar) for documents that don't match any template
Validation rules to catch extraction errors
A review queue for failed or low-confidence extractions

This cascade keeps the per-document cost close to zero for the majority of documents, only hitting the expensive LLM layer for the tricky ones.

Key Takeaways

Know your PDF type: text-based/structured, text-based/inconsistent, or scanned. The approach differs completely.
pdfplumber is the best free starting point for text-based PDFs in Python.
Use a cascade: try cheap extraction first, fall back to cloud APIs or LLMs for edge cases.
The hard part isn't extracting one PDF. It's handling variation across hundreds of different templates.
Always validate extracted data and build a review queue for failures.

Working on a document processing project?

PDF extraction is usually one piece of a bigger automation pipeline. If you're trying to automate invoice processing, contract review, or form data entry, get in touch. I can help you figure out the right approach for your document types and volume.

Related reading:

How I'd automate invoice reconciliation for a small business
Building a data pipeline with Python: a practical guide
My AI automation service: intelligent document processing and workflow automation