> ## Documentation Index > Fetch the complete documentation index at: https://cstreams.mintlify.site/llms.txt > Use this file to discover all available pages before exploring further. # Tesseract > Open-source OCR for text extraction The [Tesseract](https://tesseract-ocr.github.io/tessdoc/#introduction) open-source OCR engine is the first step in our pipeline and extracts the bulk of text from document pages. Since v4, Tesseract uses [LSTM (Long Short-Term Memory)](https://tesseract-ocr.github.io/tessdoc/tess4/NeuralNetsInTesseract4.00.html) neural network, combining traditional OCR techniques with modern neural networks. ## Configuration Scaling factor for image processing before OCR. Higher values increase resolution but require more processing power. Minimum confidence threshold percentage for OCR results. Pages with confidence below this threshold may be processed by the Vision Language Model if enabled. ## Hardware Requirements Tesseract is CPU-intensive but optimized for modern hardware. ## Installation Tesseract is a system dependency and needs to be installed separately from Python. On MacOS, we recommend installing with Homebrew. Other ways to [install it](https://github.com/tesseract-ocr/tessdoc?tab=readme-ov-file#compiling-and-installation). ```bash theme={null} brew install tesseract ``` ### References * [Tesseract OCR Quality Improvement Guide](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html)