We recommend uv to install and manage Python, project dependencies and environments.
It’s significantly faster than pip, and does everything in one tool.Homebrew is the easiest way to install it on MacOS.
Other methods.
# Get the project source on your local machinecd pdf-toolbox# Create an isolated environment, similar to `python -m venv .venv`uv init# Activate the virtual environmentsource .venv/bin/activate# Install Python dependencies, listed in `pyproject.toml`# Similar to `pip install -r requirements.txt`uv sync
Open the .env file and paste your key. Make sure there are no spaces, quotes, or
extra characters.
Copy
OPENAI_API_KEY=your_key_here
The default model is set to "gpt-4o-mini" for the best balance of speed, quality and low
cost.
Our script caps input and output tokens as a safeguard for runaway billing costs.
Always review the actual costs on your OpenAI Platform account when running the script
for a large batch of files.
A sample book under Creative Commons license is included in the data directory:
Grey Systems Analysis, by Liu, SifengYou can add additional source PDFs by dragging or copying them into the
[project-root]/data directory, or copying them in the terminal:
Copy
cp /path/to/your/pdfs/ [project-root]/data/
Why copy rather than move the files?This project is still in early development and the script might not give the
intended results. Keep your originals in a safe location to experiment freely without risk.
Before processing, validate your PDFs to catch any structural issues early. This checks for:
corrupt page trees, incremental save compatibility, and metadata accessibility.Use the --validate-only flag to run validation without processing:
Copy
uv run src/main.py --validate-only
You can also specify a custom directory with --dir:
Copy
uv run src/main.py --validate-only --dir /path/to/your/pdfs
The validation output shows how many files are valid or invalid:
Copy
$ uv run src/main.py --validate-onlyPyMuPDF version: 1.25.2*************2 Total PDFs to validate*************Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 28.00it/s]2 Valid0 InvalidValidation complete. Exiting (--validate-only mode).
Files marked as invalid will be skipped during processing. Check the log file for details
about why files failed validation, including repair warnings and page tree corruption
errors.