How It Works
A detailed look at the processing pipeline
Setup
A log file is created in the DIAGNOSTIC_FOLDER, where all pipeline steps will be logged, along with raw prompt requests and responses.
The pipeline begins by looking for PDFs in the data directory.
PDF’s are validated to ensure metadata access.
Preprocess pages
For each PDF file, the script uses the PyMuPDF
Python library to identify the front and back matter pages. What are matter pages?
By default, the first 8 front matter pages are processed. Body matter is always ignored, and back matter inclusion can be optionally configured with MATTER_CONFIG.back settings.
Processed page images are then enhanced for OCR with the PIL
(Pillow) Python library.
Original and enhanced images are written to the diagnostic folder when SAVE_DIAGNOSTIC_FILES=true.
Run traditional OCR
Each page is identified by type (text
, image
, or mixed
). Tesseract OCR runs and quality is scored with OCR_CONFIDENCE_THRESHOLD values.
This produces raw text content but doesn’t preserve document structure or hierarchy.
The extracted, unstructured text files are written to the diagnostic folder when SAVE_DIAGNOSTIC_FILES=true.
Extract structured text
PyMuPDF4LLM extracts text to Markdown. Unlike Tesseract, it can differentiate between titles and subtitles by detecting font size and style nuances.
Extracted markdown files are written to the diagnostic folder when SAVE_DIAGNOSTIC_FILES=true.
Analyze with a VLM
When ANALYZE_WITH_VISION=true, the page is analyzed by a Vision Language Model that holistically understands the page elements and their relationships.
We use Meta’s llama3.2-vision, hosted by Ollama locally, on your network or in the cloud. Setup and configuration instructions.
Analysis text files are written to the diagnostic folder when SAVE_DIAGNOSTIC_FILES=true.
This step is in active development, and VLM results are not yet integrated into the pipeline.
Compose metadata prompt
All of the extracted data is consolidated into a single prompt and sent to the OpenAI MODEL_NAME.
The prompt includes detailed extraction rules, ie, what to do when multiple editions are found.
Process metadata response
OpenAI returns the metadata fields with confidence scores and generation rationale.
This structured JSON response adheres to our defined schemas.py, reducing hallucinations, alignment warnings, and pre/post-ambles.
Compose filename prompt
The prompt for filename generation is sent to OpenAI, with detailed format constraints:
Process filename response
The generated filename in the OpenAI response is cleaned of unsafe characters and trimmed to MAX_FILENAME_LENGTH
Finally, when WRITE_PDF_CHANGES=true, the generated metadata and filename are written to the file.