Tesseract
Open-source OCR for text extraction
The Tesseract open-source OCR engine is the first step in our pipeline and extracts the bulk of text from document pages.
Since v4, Tesseract uses LSTM (Long Short-Term Memory) neural network, combining traditional OCR techniques with modern neural networks.
Configuration
Scaling factor for image processing before OCR. Higher values increase resolution but require more processing power.
Minimum confidence threshold percentage for OCR results. Pages with confidence below this threshold may be processed by the Vision Language Model if enabled.
Hardware Requirements
Tesseract is CPU-intensive but optimized for modern hardware.
Installation
Tesseract is a system dependency and needs to be installed separately from Python.
On MacOS, we recommend installing with Homebrew. Other ways to install it.