> ## Documentation Index
> Fetch the complete documentation index at: https://cstreams.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Quickstart

Follow the steps below to set up your environment, configure essential settings, and run
the script.

### Install dependencies

* **Python** - For running the script
* **Tesseract** - The OCR engine for extracting text from PDF page images

#### Python and `UV`

We recommend `uv` to install and manage Python, project dependencies and environments.
It's significantly faster than `pip`, and does everything in one tool.

Homebrew is the easiest way to install it on MacOS.
[Other methods](https://docs.astral.sh/uv/getting-started/installation/).

```bash theme={null}
brew install uv
```

#### Tesseract OCR engine

Tesseract needs to be installed separately from Python. Homebrew is the easiest way to
install it on MacOS.
[Other methods](https://github.com/tesseract-ocr/tessdoc?tab=readme-ov-file#binaries).

```bash theme={null}
brew install tesseract
```

Confirm Tesseract is correctly installed by running `tesseract --version` in your
terminal.

### Set Up The Project

```bash theme={null}
# Get the project source on your local machine
cd pdf-toolbox

# Create an isolated environment, similar to `python -m venv .venv`
uv init

# Activate the virtual environment
source .venv/bin/activate

# Install Python dependencies, listed in `pyproject.toml`
# Similar to `pip install -r requirements.txt`
uv sync
```

### Set Up Your OpenAI API Key

Access to OpenAI's API is required to run the script. Follow these steps to create an API
key and add it to the project:

1. Create your environment file by copying the example file:

```bash theme={null}
# In project root
cp .env.example .env
```

2. Create an account on the [OpenAI Platform](https://platform.openai.com/api-keys) if you
   don't have one

3. [Generate an API key](https://help.openai.com/en/articles/4936850-where-do-i-find-my-openai-api-key)
   and copy it

4. Open the `.env` file and paste your key. Make sure there are no spaces, quotes, or
   extra characters.

```
OPENAI_API_KEY=your_key_here
```

The default model is set to `"gpt-4o-mini"` for the best balance of speed, quality and low
cost.

<Note>
  Our script caps input and output tokens as a safeguard for runaway billing costs.
  Always review the actual costs on your OpenAI Platform account when running the script
  for a large batch of files.
</Note>

### Prep Your PDFs

A sample book under Creative Commons license is included in the `data` directory:
[Grey Systems Analysis, by Liu, Sifeng](https://directory.doabooks.org/handle/20.500.12854/150111)

You can add additional source PDFs by dragging or copying them into the
`[project-root]/data` directory, or copying them in the terminal:

```bash theme={null}
cp /path/to/your/pdfs/ [project-root]/data/
```

<Warning>
  <span className="warning-text">Why copy rather than move the files?</span>

  This project is still in early development and the script might not give the
  intended results. Keep your originals in a safe location to experiment freely without risk.
</Warning>

### Validate Your PDFs First

Before processing, validate your PDFs to catch any structural issues early. This checks for:
corrupt page trees, incremental save compatibility, and metadata accessibility.

Use the `--validate-only` flag to run validation without processing:

```bash theme={null}
uv run src/main.py --validate-only
```

You can also specify a custom directory with `--dir`:

```bash theme={null}
uv run src/main.py --validate-only --dir /path/to/your/pdfs
```

The validation output shows how many files are valid or invalid:

```bash [expandable] theme={null}
$ uv run src/main.py --validate-only

PyMuPDF version: 1.25.2
*************
2	Total PDFs to validate

*************
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 28.00it/s]

2	Valid
0	Invalid

Validation complete. Exiting (--validate-only mode).
```

Files marked as invalid will be skipped during processing. Check the log file for details
about why files failed validation, including repair warnings and page tree corruption
errors.

### Run The Pipeline Script

It takes 4 optional arguments:

* `--validate-only`: Run validation only and exit without processing files
* `--with-stats-export`: Saves an easy-to-read text file of proposed filenames. This is in
  addition to the log file which is always saved.
* `--verbose-term`: Shows detailed progress in your terminal
* `--dir`: Absolute directory path to process PDFs from (default: ./data)

```bash theme={null}
    # recommended for your first run
    uv run src/main.py --with-stats-export --verbose-term
```

### Review The Results

You should see a terminal output like the following:

```bash [expandable] theme={null}
$ uv run src/main.py --with-stats-export --verbose-term --with-annot-export  --dir 


PyMuPDF version: 1.25.2
*************
1	Total PDFs to process

~30 seconds and $0.001 per PDF with default settings...

*************

.
.
.
[full output omitted]
.
.
.

Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.96s/it]

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 26s
	Tokens: Input: 3909, Output: 719
	$0.001018

Average per PDF:
	0m 26s
	Tokens: Input: 3909.00, Output: 719.00
	$0.001018
```

Additional outputs to review are:

* Log file
* Page images with OCR text
* `stats.txt` file

The [Evaluation](/analysis-and-iteration/evaluation) guide has more details on how to
review these results and iterate on the process.

### Enable Write Mode and Rerun The Script

Once you're satisfied with the dry run, enable write mode and rerun the script to apply
the changes.

<Note>Write mode is not yet implemented. It will be available in the next update.</Note>

Open `config.py` and set:

```python config.py theme={null}
WRITE_PDF_CHANGES = True
```

Rerun the script, no arguments necessary.

```bash theme={null}
uv run src/main.py
```

You should see terminal output like this:

```bash [expandable] theme={null}
$ uv run src/main.py

******************
1	Total PDFs to process

~30 seconds and $0.001 per PDF with default settings...

******************
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 28.00it/s]
Validating:   0%|                                                                                              | 0/1 [00:00<?, ?it/s]
1	Valid
0	Invalid

Processing valid files...
******************

        Current: grey_systems_analysis.pdf
        Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf

Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:27<00:00, 27.23s/it]

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 27s
	Tokens: Input: 3909, Output: 745
	$0.001033

Average per PDF:
	0m 27s
	Tokens: Input: 3909.00, Output: 745.00
	$0.001033
```

The result would rename the file:

```txt theme={null}
grey_systems_analysis.pdf -> Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf
```

<Note>Configuring the filename format is planned for the next update.</Note>

### Celebrate!

<p class="m-1 text-2xl">🚀 🎉 🎶 ✨</p>

## Next Steps

Now that you've got the basics working, explore these guides to get the most out of PDF
Toolbox:

<CardGroup cols={2}>
  <Card title="Evaluate The Results" icon="vial" href="/analysis-and-iteration/evaluation">
    Learn how to analyze the output and iterate on the process
  </Card>

  <Card title="Matter Pages" icon="book-open" href="/key-concepts/matter-pages">
    Learn about front, body and back matter pages
  </Card>

  <Card title="Key Concepts" icon="compass" href="/key-concepts/overview">
    Understand the project's core ideas and technologies
  </Card>

  <Card title="Vision Language Models" icon="compass" href="/config/ai-models/llama">
    Add an additional layer of understanding
  </Card>
</CardGroup>

### Need Help?

If you need help or have any questions, we'll be glad to help with your specific use case.
