Document Extraction Benchmark

In today's data-driven world, we're all swimming in documents. From complex PDFs and scanned images to standard DOCX files, the challenge isn't just storing them it's getting the information out accurately and efficiently. This is a crucial step for any data processing pipeline.

To find the best tool for the job, we conducted a comprehensive comparative analysis of various document extraction packages. We tested their ability to convert a diverse set of documents into Markdown, focusing on three key metrics: content quality, accuracy, and the preservation of the original structure and formatting.

This report evaluates popular open-source and proprietary packages to give you a clear comparison of their strengths and weaknesses, helping you choose the best tool for your specific needs.

We put nine popular packages to the test:

Docling
PDFPlumber
PyMuPDF
PDFMiner
Tesseract
EasyOCR
PaddleOCR
PythonDocx
Mammoth

To see how they would compare, we created a set of 50 documents representing common real-world scenarios:

Complex Scanned PDF: A scanned exam registration document with tables and a multi-column layout.
PDF with Mixed Content: A report containing text, tables, and images.
Certificate PDF: A visually rich PDF of a certificate.
Invoice Image: A PNG screenshot of a sales invoice.
Standard DOCX: A technical manual with standard formatting.

Part 1: Overall Performance

First, we looked at the overall performance based on average extraction time across all 50 documents. When it comes to pure speed, the results are stark.

Table 1: Overall Performance Metrics

Extractor	Avg. Time (s)
Docling	8.678
PDFPlumber	0.332
PyMuPDF	0.480
PDFMiner	0.127
Tesseract	1.612
EasyOCR	8.282
PaddleOCR	21.515
PythonDocx	0.050
Mammoth	0.097

The specialized DOCX and text-based PDF tools were incredibly fast, while the complex OCR-based tools took significantly longer.

Part 2: Detailed Analysis by Document Type

Speed isn't everything. A fast extraction is useless if the output is wrong. Here’s how the packages handled our specific challenges.

The PDF Gauntlet: Scanned, Mixed, and Stylized

Performance on PDFs varied significantly based on the file's complexity.

Complex Scanned PDF: This proved to be a major hurdle for non-OCR tools.
- Docling, Tesseract, and EasyOCR successfully extracted the text, though with some character recognition and formatting errors. Docling notably managed to preserve some of the tabular structure.
- PDFPlumber, PyMuPDF, and PDFMiner failed to extract any meaningful content, as they simply aren't designed for scanned, image-based PDFs.
- PaddleOCR did extract text, but its accuracy was lower than the other OCR tools.
PDF with Mixed Content (Text, Tables, Images): This document was handled well by most packages.
- Docling, PDFPlumber, PyMuPDF, and PDFMiner all extracted the text with high fidelity.
- PDFPlumber and PyMuPDF also accurately extracted the tables.
- Tesseract and EasyOCR also performed well, showing they can handle text-based PDFs too.
Certificate PDF: This test focused on non-standard layouts and stylized text.
- All packages successfully extracted the main text content.
- Docling, PDFPlumber, and PyMuPDF were the most successful at preserving the certificate's layout and formatting.

The Image-Only Test: PNG Invoice

This test required pure OCR capabilities to extract data from an invoice image.

Tesseract and EasyOCR provided the most accurate extraction. EasyOCR was slightly better at recognizing the line items and prices.
Docling also performed well, demonstrating a robust OCR engine.
PaddleOCR struggled significantly with this document, producing an output that was largely inaccurate and unusable.

The DOCX Showdown: Standard Technical Manual

As you might expect, the packages designed specifically for DOCX files performed flawlessly.

PythonDocx and Mammoth both excelled, extracting the text with perfect accuracy. They also preserved the original formatting, including headings and lists.
Docling also handled the DOCX file effectively, providing a clean Markdown output.

Part 3: The Verdict and Recommendations

This benchmark analysis reveals one clear truth: the best document extraction package is highly dependent on your specific use case and the types of documents you process.

To help you decide, here is a summary of our findings:

Table 2: Summary of extractor performance, based on benchmark report data.

Extractor	Success Rate	Avg. Time (s)	Best Use Case
PythonDocx	100.0%	0.050	.docx files with tables
Mammoth	100.0%	0.097	.docx files (text-only)
PDFMiner	66.7%	0.127	Basic text extraction from PDFs
PDFPlumber	66.7%	0.332	Text-based PDFs with tables
PyMuPDF	66.7%	0.480	Fast text extraction from PDFs
Tesseract	100.0%	1.612	Scanned documents and images (fastest OCR)
Easy OCR	100.0%	8.282	Scanned documents (high accuracy OCR)
Docling	100.0%	8.678	Versatile all-in-one for mixed document types
Paddle OCR	100.0%	21.515	Not Recommended (failed quality tests)

Key Takeaways:

The All-Rounder: Docling stands out as a versatile and robust solution that performs well across all tested document types (DOCX, scanned PDFs, text PDFs, and images). This makes it a strong contender for a general-purpose document extraction tool.
The Specialists: If your task is highly specific, a specialized library is better.
- For high-speed DOCX processing, PythonDocx is the clear winner.
- For fast extraction from text-based PDFs (especially with tables), PDFPlumber is an excellent choice.
- For high-accuracy OCR on scanned images, EasyOCR had a slight edge, while Tesseract provided the fastest OCR performance.

Key Takeaways:

The All-Rounder: Docling stands out as a versatile and robust solution that performs well across all tested document types (DOCX, scanned PDFs, text PDFs, and images). This makes it a strong contender for a general-purpose document extraction tool.
The Specialists: If your task is highly specific, a specialized library is better.
- For high-speed DOCX processing, PythonDocx is the clear winner.
- For fast extraction from text-based PDFs (especially with tables), PDFPlumber is an excellent choice.
- For high-accuracy OCR on scanned images, EasyOCR had a slight edge, while Tesseract provided the fastest OCR performance.