Document Extraction Benchmark
In today's data-driven world, we're all swimming in documents. From complex PDFs and scanned images to standard DOCX files, the challenge isn't just storing them it's getting the information out accurately and efficiently. This is a crucial step for any data processing pipeline.
To find the best tool for the job, we conducted a comprehensive comparative analysis of various document extraction packages. We tested their ability to convert a diverse set of documents into Markdown, focusing on three key metrics: content quality, accuracy, and the preservation of the original structure and formatting.
This report evaluates popular open-source and proprietary packages to give you a clear comparison of their strengths and weaknesses, helping you choose the best tool for your specific needs.
We put nine popular packages to the test:
Docling
PDFPlumber
PyMuPDF
PDFMiner
Tesseract
EasyOCR
PaddleOCR
PythonDocx
Mammoth
To see how they would compare, we created a set of 50 documents representing common real-world scenarios:
Complex Scanned PDF: A scanned exam registration document with tables and a multi-column layout.
PDF with Mixed Content: A report containing text, tables, and images.
Certificate PDF: A visually rich PDF of a certificate.
Invoice Image: A PNG screenshot of a sales invoice.
Standard DOCX: A technical manual with standard formatting.
Part 1: Overall Performance
First, we looked at the overall performance based on average extraction time across all 50 documents. When it comes to pure speed, the results are stark.
Table 1: Overall Performance Metrics
| Extractor | Avg. Time (s) |
| Docling | 8.678 |
| PDFPlumber | 0.332 |
| PyMuPDF | 0.480 |
| PDFMiner | 0.127 |
| Tesseract | 1.612 |
| EasyOCR | 8.282 |
| PaddleOCR | 21.515 |
| PythonDocx | 0.050 |
| Mammoth | 0.097 |
The specialized DOCX and text-based PDF tools were incredibly fast, while the complex OCR-based tools took significantly longer.

Part 2: Detailed Analysis by Document Type
Speed isn't everything. A fast extraction is useless if the output is wrong. Here’s how the packages handled our specific challenges.
The PDF Gauntlet: Scanned, Mixed, and Stylized
Performance on PDFs varied significantly based on the file's complexity.
Complex Scanned PDF: This proved to be a major hurdle for non-OCR tools.
Docling, Tesseract, and EasyOCR successfully extracted the text, though with some character recognition and formatting errors. Docling notably managed to preserve some of the tabular structure.
PDFPlumber, PyMuPDF, and PDFMiner failed to extract any meaningful content, as they simply aren't designed for scanned, image-based PDFs.
PaddleOCR did extract text, but its accuracy was lower than the other OCR tools.
PDF with Mixed Content (Text, Tables, Images): This document was handled well by most packages.
Docling, PDFPlumber, PyMuPDF, and PDFMiner all extracted the text with high fidelity.
PDFPlumber and PyMuPDF also accurately extracted the tables.
Tesseract and EasyOCR also performed well, showing they can handle text-based PDFs too.
Certificate PDF: This test focused on non-standard layouts and stylized text.
All packages successfully extracted the main text content.
Docling, PDFPlumber, and PyMuPDF were the most successful at preserving the certificate's layout and formatting.

The Image-Only Test: PNG Invoice
This test required pure OCR capabilities to extract data from an invoice image.
Tesseract and EasyOCR provided the most accurate extraction. EasyOCR was slightly better at recognizing the line items and prices.
Docling also performed well, demonstrating a robust OCR engine.
PaddleOCR struggled significantly with this document, producing an output that was largely inaccurate and unusable.

The DOCX Showdown: Standard Technical Manual
As you might expect, the packages designed specifically for DOCX files performed flawlessly.
PythonDocx and Mammoth both excelled, extracting the text with perfect accuracy. They also preserved the original formatting, including headings and lists.
Docling also handled the DOCX file effectively, providing a clean Markdown output.

Part 3: The Verdict and Recommendations
This benchmark analysis reveals one clear truth: the best document extraction package is highly dependent on your specific use case and the types of documents you process.
To help you decide, here is a summary of our findings:
Table 2: Summary of extractor performance, based on benchmark report data.
| Extractor | Success Rate | Avg. Time (s) | Best Use Case |
| PythonDocx | 100.0% | 0.050 | .docx files with tables |
| Mammoth | 100.0% | 0.097 | .docx files (text-only) |
| PDFMiner | 66.7% | 0.127 | Basic text extraction from PDFs |
| PDFPlumber | 66.7% | 0.332 | Text-based PDFs with tables |
| PyMuPDF | 66.7% | 0.480 | Fast text extraction from PDFs |
| Tesseract | 100.0% | 1.612 | Scanned documents and images (fastest OCR) |
| Easy OCR | 100.0% | 8.282 | Scanned documents (high accuracy OCR) |
| Docling | 100.0% | 8.678 | Versatile all-in-one for mixed document types |
| Paddle OCR | 100.0% | 21.515 | Not Recommended (failed quality tests) |
Key Takeaways:
The All-Rounder: Docling stands out as a versatile and robust solution that performs well across all tested document types (DOCX, scanned PDFs, text PDFs, and images). This makes it a strong contender for a general-purpose document extraction tool.
The Specialists: If your task is highly specific, a specialized library is better.
For high-speed DOCX processing, PythonDocx is the clear winner.
For fast extraction from text-based PDFs (especially with tables), PDFPlumber is an excellent choice.
For high-accuracy OCR on scanned images, EasyOCR had a slight edge, while Tesseract provided the fastest OCR performance.
Key Takeaways:
The All-Rounder: Docling stands out as a versatile and robust solution that performs well across all tested document types (DOCX, scanned PDFs, text PDFs, and images). This makes it a strong contender for a general-purpose document extraction tool.
The Specialists: If your task is highly specific, a specialized library is better.
For high-speed DOCX processing, PythonDocx is the clear winner.
For fast extraction from text-based PDFs (especially with tables), PDFPlumber is an excellent choice.
For high-accuracy OCR on scanned images, EasyOCR had a slight edge, while Tesseract provided the fastest OCR performance.


