The Tesseract open-source OCR engine is the first step in our pipeline and extracts the bulk of text from document pages.

Since v4, Tesseract uses LSTM (Long Short-Term Memory) neural network, combining traditional OCR techniques with modern neural networks.

Configuration

ZOOM_FACTOR
number
default:"2.0"

Scaling factor for image processing before OCR. Higher values increase resolution but require more processing power.

OCR_CONFIDENCE_THRESHOLD
number
default:"30.0"

Minimum confidence threshold percentage for OCR results. Pages with confidence below this threshold may be processed by the Vision Language Model if enabled.

Hardware Requirements

Tesseract is CPU-intensive but optimized for modern hardware.

Installation

Tesseract is a system dependency and needs to be installed separately from Python.

On MacOS, we recommend installing with Homebrew. Other ways to install it.

brew install tesseract

References