To just get going, follow the [Quickstart guide](/getting-started/quickstart). A log file is created in the DIAGNOSTIC\_FOLDER, where all pipeline steps will be logged, along with raw prompt requests and responses. The pipeline begins by looking for PDFs in the data directory. PDF's are validated to ensure metadata access.

Pipeline, Step 1: Load and preprocess PDF. Shows an
image of the front, body and back matter pages of a book

For each PDF file, the script uses the `PyMuPDF` Python library to identify the front and back matter pages. [What are matter pages?](/key-concepts/matter-pages) By default, the first [8 front matter pages](/config/processing#front-matter) are processed. Body matter is always ignored, and back matter inclusion can be optionally configured with MATTER\_CONFIG.back settings. Processed page images are then enhanced for OCR with the `PIL` (Pillow) Python library. Original and enhanced images are written to the diagnostic folder when SAVE\_DIAG\_PER\_PAGE\_IMG=true.

Pipeline, Step 2: Identify Page Types and Run Tesseract OCR.
Shows yellow text regions overlaid on the front and back matter pages, identifying the metadata fields

Each page is identified by type (`text`, `image`, or `mixed`). [Tesseract OCR](/config/ai-models/tesseract) runs and quality is scored with OCR\_CONFIDENCE\_THRESHOLD values. This produces raw text content but doesn't preserve document structure or hierarchy. The extracted, unstructured text files are written to the diagnostic folder when SAVE\_DIAG\_TXT\_PER\_PG=true.

Pipeline, Step 3: Extract Structured Text with PyMuPDF4LLM. Shows the extracted text being transformed into a hierarchical structure with title, subtitle, and other metadata elements properly identified.

PyMuPDF4LLM extracts text to Markdown. Unlike Tesseract, it can differentiate between titles and subtitles by detecting font size and style nuances. Extracted markdown files are written to the diagnostic folder when SAVE\_DIAG\_TXT\_PER\_PG=true. When ANALYZE\_WITH\_VISION=true, the page is analyzed by a Vision Language Model that holistically understands the page elements and their relationships. We use Meta's llama3.2-vision, hosted by Ollama locally, on your network or in the cloud. [Setup and configuration](/config/ai-models/llama#how-to-run-a-vlm) instructions. Analysis *text files* are written to the diagnostic folder when SAVE\_DIAG\_TXT\_PER\_PG=true. This step is in active development, and VLM results are not yet integrated into the pipeline.

Pipeline, Step 4: Consolidate Data and Generate LLM Prompt. Shows all the arrows of the extracted text regions consolidating into one arrow, leading to the LLM prompt and rules.

All of the extracted data is consolidated into a single prompt and sent to the active provider defined by `SELECTED_MODEL_NAME`. The prompt includes detailed extraction rules, ie, what to do when multiple editions are found.

Pipeline, Step 5: Process LLM Response and Validate Metadata.
Shows the structured LLM response being processed and validated.

OpenAI returns the metadata fields with confidence scores and generation rationale. This structured JSON response adheres to our defined schema, reducing hallucinations, alignment warnings, and pre/post-ambles. The prompt for filename generation is sent to OpenAI, with detailed format constraints: ```bash theme={null} [Title], [Subtitle], ([Author]), Publisher, ([Edition]), ([Year]).pdf ``` Pipeline, Step 6: Generate Filename with OpenAI.
Shows the final LLM response of the generated filename. This will be written to the file.

Pipeline, Step 6: Generate Filename with OpenAI.
Shows the final LLM response of the generated filename. This will be written to the file.

The generated filename in the OpenAI response is cleaned of unsafe characters and trimmed to MAX\_FILENAME\_LENGTH Finally, when WRITE\_PDF\_CHANGES=true, the generated metadata and filename are written to the file.