> ## Documentation Index
> Fetch the complete documentation index at: https://cstreams.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Tesseract

> Open-source OCR for text extraction

The [Tesseract](https://tesseract-ocr.github.io/tessdoc/#introduction) open-source OCR
engine is the first step in our pipeline and extracts the bulk of text from document
pages.

Since v4, Tesseract uses
[LSTM (Long Short-Term Memory)](https://tesseract-ocr.github.io/tessdoc/tess4/NeuralNetsInTesseract4.00.html)
neural network, combining traditional OCR techniques with modern neural networks.

## Configuration

<ParamField query="ZOOM_FACTOR" type="number" default="2.0">
  Scaling factor for image processing before OCR. Higher values increase resolution but
  require more processing power.
</ParamField>

<ParamField query="OCR_CONFIDENCE_THRESHOLD" type="number" default="30.0">
  Minimum confidence threshold percentage for OCR results. Pages with confidence below
  this threshold may be processed by the Vision Language Model if enabled.
</ParamField>

## Hardware Requirements

Tesseract is CPU-intensive but optimized for modern hardware.

## Installation

Tesseract is a system dependency and needs to be installed separately from Python.

On MacOS, we recommend installing with Homebrew. Other ways to
[install it](https://github.com/tesseract-ocr/tessdoc?tab=readme-ov-file#compiling-and-installation).

```bash theme={null}
brew install tesseract
```

### References

* [Tesseract OCR Quality Improvement Guide](https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html)
