Quickstart

Follow the steps below to set up your environment, configure essential settings, and run the script.

Install dependencies

Python - For running the script
Tesseract - The OCR engine for extracting text from PDF page images

Python and `UV`

We recommend uv to install and manage Python, project dependencies and environments. It’s significantly faster than pip, and does everything in one tool.

Homebrew is the easiest way to install it on MacOS. Other methods.

brew install uv

Tesseract OCR engine

Tesseract needs to be installed separately from Python. Homebrew is the easiest way to install it on MacOS. Other methods.

brew install tesseract

Confirm Tesseract is correctly installed by running tesseract --version in your terminal.

Set Up The Project

# Get the project source on your local machine
git clone https://github.com/lifeinchords/pdf-toolbox.git
cd pdf-toolbox

# Create an isolated environment, similar to `python -m venv .venv`
uv init

# Activate the virtual environment
source .venv/bin/activate

# Install Python dependencies, listed in `pyproject.toml`
# Similar to `pip install -r requirements.txt`
uv sync

Set Up Your OpenAI API Key

Access to OpenAI’s API is required to run the script. Follow these steps to create an API key and add it to the project:

Create your environment file by copying the example file:

# In project root
cp .env.example .env

Create an account on the OpenAI Platform if you don’t have one
Generate an API key and copy it
Open the .env file and paste your key. Make sure there are no spaces, quotes, or extra characters.

OPENAI_API_KEY=your_key_here

The default model is set to "gpt-4o-mini" for the best balance of speed, quality and low cost.

Our script caps input and output tokens as a safeguard for runaway billing costs. Always review the actual costs on your OpenAI Platform account when running the script for a large batch of files.

Prep Your PDFs

A sample book under Creative Commons license is included in the data directory: Grey Systems Analysis, by Liu, Sifeng

You can add additional source PDFs by dragging or copying them into the [project-root]/data directory, or copying them in the terminal:

cp /path/to/your/pdfs/ [project-root]/data/

Why copy rather than move the files?

This project is still in early development and the script might not give the intended results. Keep your originals in a safe location to experiment freely without risk.

Run The Pipeline Script

It takes 2 optional arguments:

--with-stats-export: Saves an easy-to-read text file of proposed filenames. This is in addition to the log file which is always saved.
--verbose-term: Shows detailed progress in your terminal

    # recommended for your first run
    uv run src/main.py --with-stats-export --verbose-term

Review The Results

You should see a terminal output like the following:

$ uv run src/main.py --with-stats-export --verbose-term

PyMuPDF version: 1.25.2
*************
1	Total PDFs to process

~30 seconds and $0.001 per PDF with default settings...

*************

.
.
.
[full output omitted]
.
.
.

Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.96s/it]

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 26s
	Tokens: Input: 3909, Output: 719
	$0.001018

Average per PDF:
	0m 26s
	Tokens: Input: 3909.00, Output: 719.00
	$0.001018

Additional outputs to review are:

Log file
Page images with OCR text
stats.txt file

The Evaluation guide has more details on how to review these results and iterate on the process.

Enable Write Mode and Rerun The Script

Once you’re satisfied with the dry run, enable write mode and rerun the script to apply the changes.

Write mode is not yet implemented. It will be available in the next update.

Open the config file and set:

config.py

WRITE_PDF_CHANGES = True

# Optionally disable for better performance and saved disk space
SAVE_DIAGNOSTIC_FILES = False

Rerun the script, no arguments necessary.

uv run src/main.py

You should see terminal output like this:

$ uv run src/main.py

******************
1	Total PDFs to process

~30 seconds and $0.001 per PDF with default settings...

******************
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 28.00it/s]
Validating:   0%|                                                                                              | 0/1 [00:00<?, ?it/s]
1	Valid
0	Invalid

Processing valid files...
******************

        Current: grey_systems_analysis.pdf
        Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf

Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:27<00:00, 27.23s/it]

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 27s
	Tokens: Input: 3909, Output: 745
	$0.001033

Average per PDF:
	0m 27s
	Tokens: Input: 3909.00, Output: 745.00
	$0.001033

The result would rename the file:

grey_systems_analysis.pdf -> Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf

Configuring the filename format is planned for the next update.

Celebrate!

🚀 🎉 🎶 ✨

Next Steps

Now that you’ve got the basics working, explore these guides to get the most out of PDF Toolbox:

Evaluate The Results

Learn how to analyze the output and iterate on the process

Matter Pages

Learn about front, body and back matter pages

Key Concepts

Understand the project’s core ideas and technologies

Vision Language Models

Add an additional layer of understanding

Need Help?

If you need help or have any questions, start a Discussion on GitHub. We’ll be glad to help with your specific use case.

Getting Started

Key Concepts

Configuration

Analysis & Iteration

Project

Install dependencies

Python and `UV`

Tesseract OCR engine

Set Up The Project

Set Up Your OpenAI API Key

Prep Your PDFs

Run The Pipeline Script

Review The Results

Enable Write Mode and Rerun The Script

Celebrate!

Next Steps

Evaluate The Results

Matter Pages

Key Concepts

Vision Language Models

Need Help?

Getting Started

Key Concepts

Configuration

Analysis & Iteration

Project

​Install dependencies

​Python and UV

​Tesseract OCR engine

​Set Up The Project

​Set Up Your OpenAI API Key

​Prep Your PDFs

​Run The Pipeline Script

​Review The Results

​Enable Write Mode and Rerun The Script

​Celebrate!

​Next Steps

Evaluate The Results

Matter Pages

Key Concepts

Vision Language Models

​Need Help?

Install dependencies

Python and `UV`

Tesseract OCR engine

Set Up The Project

Set Up Your OpenAI API Key

Prep Your PDFs

Run The Pipeline Script

Review The Results

Enable Write Mode and Rerun The Script

Celebrate!

Next Steps

Need Help?