Quickstart
Follow the steps below to set up your environment, configure essential settings, and run the script.
Install dependencies
- Python - For running the script
- Tesseract - The OCR engine for extracting text from PDF page images
Python and UV
We recommend uv
to install and manage Python, project dependencies and environments.
It’s significantly faster than pip
, and does everything in one tool.
Homebrew is the easiest way to install it on MacOS. Other methods.
Tesseract OCR engine
Tesseract needs to be installed separately from Python. Homebrew is the easiest way to install it on MacOS. Other methods.
Confirm Tesseract is correctly installed by running tesseract --version
in your
terminal.
Set Up The Project
Set Up Your OpenAI API Key
Access to OpenAI’s API is required to run the script. Follow these steps to create an API key and add it to the project:
- Create your environment file by copying the example file:
-
Create an account on the OpenAI Platform if you don’t have one
-
Generate an API key and copy it
-
Open the
.env
file and paste your key. Make sure there are no spaces, quotes, or extra characters.
The default model is set to "gpt-4o-mini"
for the best balance of speed, quality and low
cost.
Our script caps input and output tokens as a safeguard for runaway billing costs. Always review the actual costs on your OpenAI Platform account when running the script for a large batch of files.
Prep Your PDFs
A sample book under Creative Commons license is included in the data
directory:
Grey Systems Analysis, by Liu, Sifeng
You can add additional source PDFs by dragging or copying them into the
[project-root]/data
directory, or copying them in the terminal:
This project is still in early development and the script might not give the intended results. Keep your originals in a safe location to experiment freely without risk.
Run The Pipeline Script
It takes 2 optional arguments:
--with-stats-export
: Saves an easy-to-read text file of proposed filenames. This is in addition to the log file which is always saved.--verbose-term
: Shows detailed progress in your terminal
Review The Results
You should see a terminal output like the following:
Additional outputs to review are:
- Log file
- Page images with OCR text
stats.txt
file
The Evaluation guide has more details on how to review these results and iterate on the process.
Enable Write Mode and Rerun The Script
Once you’re satisfied with the dry run, enable write mode and rerun the script to apply the changes.
Open the config file and set:
Rerun the script, no arguments necessary.
You should see terminal output like this:
The result would rename the file:
Celebrate!
🚀 🎉 🎶 ✨
Next Steps
Now that you’ve got the basics working, explore these guides to get the most out of PDF Toolbox:
Evaluate The Results
Learn how to analyze the output and iterate on the process
Matter Pages
Learn about front, body and back matter pages
Key Concepts
Understand the project’s core ideas and technologies
Vision Language Models
Add an additional layer of understanding
Need Help?
If you need help or have any questions, start a Discussion on GitHub. We’ll be glad to help with your specific use case.