The Tesseract open-source OCR
engine is the first step in our pipeline and extracts the bulk of text from document
pages.Since v4, Tesseract uses
LSTM (Long Short-Term Memory)
neural network, combining traditional OCR techniques with modern neural networks.
Minimum confidence threshold percentage for OCR results. Pages with confidence below
this threshold may be processed by the Vision Language Model if enabled.
Tesseract is a system dependency and needs to be installed separately from Python.On MacOS, we recommend installing with Homebrew. Other ways to
install it.