Follow the steps below to set up your environment, configure essential settings, and run the script.

Install dependencies

  • Python - For running the script
  • Tesseract - The OCR engine for extracting text from PDF page images

Python and UV

We recommend uv to install and manage Python, project dependencies and environments. It’s significantly faster than pip, and does everything in one tool.

Homebrew is the easiest way to install it on MacOS. Other methods.

brew install uv

Tesseract OCR engine

Tesseract needs to be installed separately from Python. Homebrew is the easiest way to install it on MacOS. Other methods.

brew install tesseract

Confirm Tesseract is correctly installed by running tesseract --version in your terminal.

Set Up The Project

# Get the project source on your local machine
git clone https://github.com/lifeinchords/pdf-toolbox.git
cd pdf-toolbox

# Create an isolated environment, similar to `python -m venv .venv`
uv init

# Activate the virtual environment
source .venv/bin/activate

# Install Python dependencies, listed in `pyproject.toml`
# Similar to `pip install -r requirements.txt`
uv sync

Set Up Your OpenAI API Key

Access to OpenAI’s API is required to run the script. Follow these steps to create an API key and add it to the project:

  1. Create your environment file by copying the example file:
# In project root
cp .env.example .env
  1. Create an account on the OpenAI Platform if you don’t have one

  2. Generate an API key and copy it

  3. Open the .env file and paste your key. Make sure there are no spaces, quotes, or extra characters.

OPENAI_API_KEY=your_key_here

The default model is set to "gpt-4o-mini" for the best balance of speed, quality and low cost.

Our script caps input and output tokens as a safeguard for runaway billing costs. Always review the actual costs on your OpenAI Platform account when running the script for a large batch of files.

Prep Your PDFs

A sample book under Creative Commons license is included in the data directory: Grey Systems Analysis, by Liu, Sifeng

You can add additional source PDFs by dragging or copying them into the [project-root]/data directory, or copying them in the terminal:

cp /path/to/your/pdfs/ [project-root]/data/
Why copy rather than move the files?

This project is still in early development and the script might not give the intended results. Keep your originals in a safe location to experiment freely without risk.

Run The Pipeline Script

It takes 2 optional arguments:

  • --with-stats-export: Saves an easy-to-read text file of proposed filenames. This is in addition to the log file which is always saved.
  • --verbose-term: Shows detailed progress in your terminal
    # recommended for your first run
    uv run src/main.py --with-stats-export --verbose-term

Review The Results

You should see a terminal output like the following:

$ uv run src/main.py --with-stats-export --verbose-term

PyMuPDF version: 1.25.2
*************
1	Total PDFs to process

~30 seconds and $0.001 per PDF with default settings...

*************

.
.
.
[full output omitted]
.
.
.

Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.96s/it]

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 26s
	Tokens: Input: 3909, Output: 719
	$0.001018

Average per PDF:
	0m 26s
	Tokens: Input: 3909.00, Output: 719.00
	$0.001018

Additional outputs to review are:

  • Log file
  • Page images with OCR text
  • stats.txt file

The Evaluation guide has more details on how to review these results and iterate on the process.

Enable Write Mode and Rerun The Script

Once you’re satisfied with the dry run, enable write mode and rerun the script to apply the changes.

Write mode is not yet implemented. It will be available in the next update.

Open the config file and set:

config.py
WRITE_PDF_CHANGES = True

# Optionally disable for better performance and saved disk space
SAVE_DIAGNOSTIC_FILES = False

Rerun the script, no arguments necessary.

uv run src/main.py

You should see terminal output like this:

$ uv run src/main.py

******************
1	Total PDFs to process

~30 seconds and $0.001 per PDF with default settings...

******************
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 28.00it/s]
Validating:   0%|                                                                                              | 0/1 [00:00<?, ?it/s]
1	Valid
0	Invalid

Processing valid files...
******************

        Current: grey_systems_analysis.pdf
        Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf

Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:27<00:00, 27.23s/it]

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 27s
	Tokens: Input: 3909, Output: 745
	$0.001033

Average per PDF:
	0m 27s
	Tokens: Input: 3909.00, Output: 745.00
	$0.001033

The result would rename the file:

grey_systems_analysis.pdf -> Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf
Configuring the filename format is planned for the next update.

Celebrate!

🚀 🎉 🎶 ✨

Next Steps

Now that you’ve got the basics working, explore these guides to get the most out of PDF Toolbox:

Need Help?

If you need help or have any questions, start a Discussion on GitHub. We’ll be glad to help with your specific use case.