To just get going, follow the Quickstart guide.
0
Setup
A log file is created in the DIAGNOSTIC_FOLDER, where all pipeline steps will be logged, along with raw prompt requests and responses.The pipeline begins by looking for PDFs in the data directory.PDF’s are validated to ensure metadata access.
1
Preprocess pages

PyMuPDF
Python library to identify the front and back matter pages. What are matter pages?By default, the first 8 front matter pages are processed. Body matter is always ignored, and back matter inclusion can be optionally configured with MATTER_CONFIG.back settings.Processed page images are then enhanced for OCR with the PIL
(Pillow) Python library.Original and enhanced images are written to the diagnostic folder when SAVE_DIAGNOSTIC_FILES=true.2
Run traditional OCR

text
, image
, or mixed
). Tesseract OCR runs and quality is scored with OCR_CONFIDENCE_THRESHOLD values.This produces raw text content but doesn’t preserve document structure or hierarchy.The extracted, unstructured text files are written to the diagnostic folder
when SAVE_DIAGNOSTIC_FILES=true.3
Extract structured text

4
Analyze with a VLM
When ANALYZE_WITH_VISION=true,
the page is analyzed by a Vision Language Model that holistically
understands the page elements and their relationships.We use Meta’s
llama3.2-vision, hosted by Ollama locally, on your network or in the cloud.
Setup and configuration instructions.Analysis text files are written to the diagnostic folder when SAVE_DIAGNOSTIC_FILES=true.
This step is in active development, and VLM results are not yet integrated into the pipeline.
5
Compose metadata prompt

6
Process metadata response

7
Compose filename prompt
The prompt for filename generation is sent to OpenAI,
with detailed format constraints:
8
Process filename response
