Evaluation - PDF Toolbox

To get the best results, analyze the results of each run and adjust the configuration settings to fit your specific collection. The script produces four key outputs:

Terminal output - Real-time progress per page, ending with a summary
Log file - Includes every step of the process, including AI model requests and responses
Page images and extracted text - Original + enhanced images, and extracted text as plain text files
stats.txt - Easy-to-read summary of processed files with proposed filenames

Terminal output

When given the --verbose-output argument, the script displays detailed, real-time progress and a summary.

Terminal output

PyMuPDF version: 1.25.2
*************
1	Total PDFs to process

~30 seconds and $0.001 per PDF with default settings...

*************
Pre-warming OpenAI schema cache...
Pre-warming metadata extraction schema...
Pre-warming filename generation call...
Schema cache pre-warming complete
Testing schema caching behavior...
Testing metadata schema caching (3 identical calls):
  Call 1: 0.87 seconds
  Call 2: 1.74 seconds
  Call 3: 0.67 seconds
Testing filename generation caching (3 identical calls):
  Call 1: 0.37 seconds
  Call 2: 0.45 seconds
  Call 3: 1.40 seconds
Validating PDF files...
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.23it/s]
Validating:   0%|                                                                                              | 0/1 [00:00<?, ?it/s]
1	Valid
0	Invalid

Processing valid files...
*************
Starting: /Users/[your-username]/pdf-toolbox/data/grey_systems_analysis.pdf

Processing front matter pages 1 to 8
Page 1 analysis: text_length=98, image_ratio=0.29
Page 1 classified as mixed content
Page 1 determined as mixed
PDF Debug Information:
Number of fonts on page: 3
Successfully extracted text using dict method
Successfully extracted 15 words from page 1
Page 1 original OCR confidence: 37.65
Page 1 enhanced OCR confidence: 24.47
Using original image for page 1 (confidence: 37.65 > 24.47)
Saved extracted text to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_1_text.png.txt
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 1)
Successfully extracted structured markdown (87 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_1_structured.png.md
Detected hierarchy: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Structured extraction for page 1: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Page 2 analysis: text_length=332, image_ratio=0.00
Page 2 classified as text
Page 2 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 45 words from page 2
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 2)
Successfully extracted structured markdown (362 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_2_structured.png.md
Detected hierarchy: Title='Series Editors', Subtitle='Sifeng Liu, Institute of Grey Systems Studies, Nanjing University of Aeronautics'
Structured extraction for page 2: Title='Series Editors', Subtitle='Sifeng Liu, Institute of Grey Systems Studies, Nanjing University of Aeronautics'
Page 3 analysis: text_length=1315, image_ratio=0.00
Page 3 classified as text
Page 3 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 202 words from page 3
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 3)
Successfully extracted structured markdown (1359 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_3_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 3: Title='None', Subtitle='None'
Page 4 analysis: text_length=77, image_ratio=0.00
Page 4 classified as text
Page 4 determined as text
PDF Debug Information:
Number of fonts on page: 1
Successfully extracted text using dict method
Successfully extracted 11 words from page 4
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 4)
Successfully extracted structured markdown (111 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_4_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 4: Title='None', Subtitle='None'
Page 5 analysis: text_length=3471, image_ratio=0.00
Page 5 classified as text
Page 5 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 527 words from page 5
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 5)
Successfully extracted structured markdown (3589 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_5_structured.png.md
Detected hierarchy: Title='Open Access This book is licensed under the terms of the Creative Commons Attribution-', Subtitle='NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-ncnd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or'
Structured extraction for page 5: Title='Open Access This book is licensed under the terms of the Creative Commons Attribution-', Subtitle='NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-ncnd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or'
Page 6 analysis: text_length=2036, image_ratio=0.00
Page 6 classified as text
Page 6 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 329 words from page 6
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 6)
Successfully extracted structured markdown (2082 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_6_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 6: Title='None', Subtitle='None'
Page 7 analysis: text_length=827, image_ratio=0.00
Page 7 classified as text
Page 7 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 133 words from page 7
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 7)
Successfully extracted structured markdown (897 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_7_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 7: Title='None', Subtitle='None'
Page 8 analysis: text_length=2459, image_ratio=0.00
Page 8 classified as text
Page 8 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 390 words from page 8
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 8)
Successfully extracted structured markdown (2501 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_8_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 8: Title='None', Subtitle='None'
Processing grey_systems_analysis.pdf: 100%|████████████████████████████████████████████████████████████████████| 8/8 [00:18<00:00,  2.33s/it]
Skipping back matter processing: mode=never
Combined structured data: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Raw metadata from PDF:
[1]   creationDate: D:20241226162248+05'30'
[1]   format: PDF 1.4
[1]   modDate: D:20241226200022+05'30'
[1]   producer: Springer-i
Cleaned metadata before AI:
[1]   creationDate: D:20241226162248+05'30'
[1]   format: PDF 1.4
[1]   modDate: D:20241226200022+05'30'
[1]   producer: Springer-i
Extracted important identifiers from full OCR text: {'isbn': [{'value': 'ISBN 978-981-97-8726-5', 'context': 'ISSN 2731-4944 (electronic)\nSeries on Grey System\nISBN 978-981-97-8726-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10'}, {'value': 'ISBN 978-981-97-8727-2', 'context': 'nic)\nSeries on Grey System\nISBN 978-981-97-8726-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2'}, {'value': '978-981-97-8727-2', 'context': '978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2\nThis work was made possible due to projects suppo'}], 'doi': [{'value': '10.1007/978-981-97-8727-2', 'context': '-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2\nThis work was made possible due to projects suppo'}]}
Requesting metadata consolidation from AI
Cost calculation: 3909i + 719o tokens = $0.0010
Field 'title' value: type=<class 'str'>, value=Grey Systems Analysis
Field 'author' value: type=<class 'str'>, value=Sifeng Liu
Field 'publisher' value: type=<class 'str'>, value=Springer
Field 'year' value: type=<class 'str'>, value=2025
Field 'subtitle' value: type=<class 'str'>, value=Methods, Models and Applications
Field 'edition' value: type=<class 'str'>, value=2nd Ed.
Field 'doi' value: type=<class 'str'>, value=10.1007/978-981-97-8727-2
Field 'loc' value: type=<class 'str'>, value=null
About to process metadata response. parsed type=<class 'dict'>
Processing decisions: type=<class 'dict'>, value={'title': {'value': 'Grey Systems Analysis', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'author': {'value': 'Sifeng Liu', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'publisher': {'value': 'Springer', 'confidence': 'high', 'sources': ['original extracted metadata']}, 'year': {'value': '2025', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'subtitle': {'value': 'Methods, Models and Applications', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'edition': {'value': '2nd Ed.', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'isbn': [{'value': '978-981-97-8727-2', 'medium': 'eBook', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, {'value': '978-981-97-8726-5', 'medium': 'Hardcover', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}], 'doi': {'value': '10.1007/978-981-97-8727-2', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'loc': {'value': 'null', 'confidence': 'low', 'sources': []}}
Checking required field 'title'
Field 'title' data: type=<class 'dict'>, value={'value': 'Grey Systems Analysis', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}
Checking required field 'author'
Field 'author' data: type=<class 'dict'>, value={'value': 'Sifeng Liu', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}
Processing optional field 'publisher': type=<class 'dict'>
Processing optional field 'year': type=<class 'dict'>
Processing optional field 'subtitle': type=<class 'dict'>
Processing optional field 'edition': type=<class 'dict'>
Processing optional field 'isbn': type=<class 'list'>
Processing optional field 'doi': type=<class 'dict'>
Processing optional field 'loc': type=<class 'dict'>
About to clean metadata. processed_metadata type=<class 'dict'>
Metadata sources used:
  title: extracted text from traditional OCR methods (confidence: high)
  author: extracted text from traditional OCR methods (confidence: high)
  publisher: original extracted metadata (confidence: high)
  year: extracted text from traditional OCR methods (confidence: high)
  subtitle: extracted text from traditional OCR methods (confidence: high)
  edition: extracted text from traditional OCR methods (confidence: high)
  doi: extracted text from traditional OCR methods (confidence: high)
  loc:  (confidence: low)
Tokens: I: 3909 O: 719
Cost: $0.0010
Cleaning up filename
Has annotations: no

        Current: grey_systems_analysis.pdf
        Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf

  author: Sifeng Liu :: extracted text from traditional OCR methods
  doi: 10.1007/978-981-97-8727-2 :: extracted text from traditional OCR methods
  edition: 2nd Ed. :: extracted text from traditional OCR methods
  isbn: [{'value': 'grey_systems_analysis', 'medium': 'ebk', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, {'value': '9789819787265', 'medium': 'hbk', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}] :: extracted text from traditional OCR methods
  publisher: Springer :: original extracted metadata
  subtitle: Methods, Models and Applications :: extracted text from traditional OCR methods
  title: Grey Systems Analysis :: extracted text from traditional OCR methods
  year: 2025 :: extracted text from traditional OCR methods
8.3mb  First 8 of 419pg  {'text': 7, 'image': 0, 'mixed': 1, 'unknown': 0, 'error': 0}
Tokens: I: 3909 O: 719
$0.001018
0m 26s
Reasoning:
  title: The title was clearly identified in the extracted text, appearing prominently as 'Grey Systems Analysis'. No conflicts were found.
  author: The author was identified as 'Sifeng Liu' from the extracted text. No conflicts were present.
  publisher: The publisher 'Springer' was derived from the original extracted metadata. No conflicts were found.
  year: The year '2025' was identified from the extracted text, indicating the publication date of the second edition. No conflicts were present.
  subtitle: The subtitle 'Methods, Models and Applications' was clearly identified in the extracted text. No conflicts were found.
  edition: The edition '2nd Ed.' was identified in the extracted text, indicating it is the second edition. No conflicts were present.
  isbn: Two ISBNs were identified: '978-981-97-8727-2' for the eBook and '978-981-97-8726-5' for the hardcover edition. Both were confirmed from the extracted text.
  doi: The DOI '10.1007/978-981-97-8727-2' was found in the extracted text, confirming its validity.
  loc: No Library of Congress Control Number was found in the provided materials.
*************
Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.96s/it]

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 26s
	Tokens: Input: 3909, Output: 719
	$0.001018

Average per PDF:
	0m 26s
	Tokens: Input: 3909.00, Output: 719.00
	$0.001018

Log file, page images and extracted text

You’ll find a timestamped log file in [project-root]/logs/[your-run-timestamp]/[pdf-filename], where logs is the directory you set for DIAGNOSTIC_FOLDER. Review the per-page images and corresponding text files to get a sense of what the models are picking up and extracting.

Log file directory

([your-environment-name]) [14:04][your-username]@[your-machine]:~/pdf-toolbox/logs(dev)]$ tree .
.
└── 20250228_135149
    ├── 20250228_135149.log
    ├── grey_systems_analysis
    │   ├── grey_systems_analysis_page_1_enhanced.png
    │   ├── grey_systems_analysis_page_1_original.png
    │   ├── grey_systems_analysis_page_1_structured.png.md
    │   ├── grey_systems_analysis_page_1_text.png.txt
    │   ├── grey_systems_analysis_page_2_original.png
    │   ├── grey_systems_analysis_page_2_structured.png.md
    │   ├── grey_systems_analysis_page_2_text.png.txt
    │   ├── grey_systems_analysis_page_3_original.png
    │   ├── grey_systems_analysis_page_3_structured.png.md
    │   ├── grey_systems_analysis_page_3_text.png.txt
    │   ├── grey_systems_analysis_page_4_original.png
    │   ├── grey_systems_analysis_page_4_structured.png.md
    │   ├── grey_systems_analysis_page_4_text.png.txt
    │   ├── grey_systems_analysis_page_5_original.png
    │   ├── grey_systems_analysis_page_5_structured.png.md
    │   ├── grey_systems_analysis_page_5_text.png.txt
    │   ├── grey_systems_analysis_page_6_original.png
    │   ├── grey_systems_analysis_page_6_structured.png.md
    │   ├── grey_systems_analysis_page_6_text.png.txt
    │   ├── grey_systems_analysis_page_7_original.png
    │   ├── grey_systems_analysis_page_7_structured.png.md
    │   ├── grey_systems_analysis_page_7_text.png.txt
    │   ├── grey_systems_analysis_page_8_original.png
    │   ├── grey_systems_analysis_page_8_structured.png.md
    │   └── grey_systems_analysis_page_8_text.png.txt
    └── stats.txt

3 directories, 27 files

Read the sample output log Gist.

`stats.txt` Output

When given the --with-stats-export flag, the script will write a stats.txt file to the same directory as the log file.

stats.txt

Current:  grey_systems_analysis.pdf
Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf
Has Annotations: no
Tags: none
File CRC: 74140e38
Metadata: 70682d14

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 27s
	Tokens: Input: 3909, Output: 721
	$0.001019

Average per PDF:
	0m 27s
	Tokens: Input: 3909.00, Output: 721.00
	$0.001019

Use `.runtime-ignore` folders to selectively run sets of files

@todo: Work in progress

Take the following [project-root]/data/ directory structure:

/data/
├── .runtime-ignore/
│   ├── character_sets/
│   │   ├── emoji_test/
│   │   │   └── Hello_World_🌍.pdf
│   │   └── non_latin/
│   │       ├── 量子コンピューティング入門_2024.pdf
│   │       └── AI_機械学習_Advanced_Topics__Tokyo_2024.pdf
│   │
│   ├── author_cases/
│   │   ├── multiple_authors/
│   │   │   ├── .runtime-ignore/
│   │   │   │   └── needs_review/
│   │   │   │       └── Three_Body_Problem_Cixin_Liu_Ken_Liu.pdf
│   │   │   └── A_Course_in_Combinatorics_and_Graphs_Simeon_Ball_Oriol_Serra.pdf
│   │   └── name_order/
│   │       └── Smith_John_vs_John_Smith.pdf
│   │
│   └── edition_cases/
│       └── multiple_editions/
│           ├── .runtime-ignore/
│           │   └── conflicting_editions/
│           │       ├── Linear_Algebra_Second_Edition.pdf
│           │       └── Linear_Algebra_2nd_Ed.pdf
│           ├── Linear_Algebra_2nd_ed.pdf
│           └── Linear_Algebra_3rd_ed.pdf
│
├── in_progress/
│   ├── .runtime-ignore/  # This entire subtree will be skipped
│   │   └── partial_ocr/
│   │       ├── page_1_to_50_done/
│   │       │   └── Large_Technical_Manual.pdf
│   │       └── needs_restart/
│   │           └── Failed_At_Page_127.pdf
│   └── ready_for_processing/
│       └── Next_Batch.pdf

.runtime-ignore folders provide a flexible way to organize and isolate problematic PDFs during testing. This example follows our Test Cases directory pattern, making it easy to move files between folders while troubleshooting without breaking the subtrees.

​Terminal output

​Log file, page images and extracted text

​stats.txt Output

​Use .runtime-ignore folders to selectively run sets of files

Terminal output

Log file, page images and extracted text

`stats.txt` Output

Use `.runtime-ignore` folders to selectively run sets of files