Analyzing script accuracy and performance
stats.txt
- Easy-to-read summary of processed files with proposed filenames--verbose-output
argument, the script displays detailed, real-time
progress and a summary.
PyMuPDF version: 1.25.2
*************
1 Total PDFs to process
~30 seconds and $0.001 per PDF with default settings...
*************
Pre-warming OpenAI schema cache...
Pre-warming metadata extraction schema...
Pre-warming filename generation call...
Schema cache pre-warming complete
Testing schema caching behavior...
Testing metadata schema caching (3 identical calls):
Call 1: 0.87 seconds
Call 2: 1.74 seconds
Call 3: 0.67 seconds
Testing filename generation caching (3 identical calls):
Call 1: 0.37 seconds
Call 2: 0.45 seconds
Call 3: 1.40 seconds
Validating PDF files...
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.23it/s]
Validating: 0%| | 0/1 [00:00<?, ?it/s]
1 Valid
0 Invalid
Processing valid files...
*************
Starting: /Users/[your-username]/pdf-toolbox/data/grey_systems_analysis.pdf
Processing front matter pages 1 to 8
Page 1 analysis: text_length=98, image_ratio=0.29
Page 1 classified as mixed content
Page 1 determined as mixed
PDF Debug Information:
Number of fonts on page: 3
Successfully extracted text using dict method
Successfully extracted 15 words from page 1
Page 1 original OCR confidence: 37.65
Page 1 enhanced OCR confidence: 24.47
Using original image for page 1 (confidence: 37.65 > 24.47)
Saved extracted text to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_1_text.png.txt
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 1)
Successfully extracted structured markdown (87 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_1_structured.png.md
Detected hierarchy: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Structured extraction for page 1: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Page 2 analysis: text_length=332, image_ratio=0.00
Page 2 classified as text
Page 2 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 45 words from page 2
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 2)
Successfully extracted structured markdown (362 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_2_structured.png.md
Detected hierarchy: Title='Series Editors', Subtitle='Sifeng Liu, Institute of Grey Systems Studies, Nanjing University of Aeronautics'
Structured extraction for page 2: Title='Series Editors', Subtitle='Sifeng Liu, Institute of Grey Systems Studies, Nanjing University of Aeronautics'
Page 3 analysis: text_length=1315, image_ratio=0.00
Page 3 classified as text
Page 3 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 202 words from page 3
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 3)
Successfully extracted structured markdown (1359 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_3_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 3: Title='None', Subtitle='None'
Page 4 analysis: text_length=77, image_ratio=0.00
Page 4 classified as text
Page 4 determined as text
PDF Debug Information:
Number of fonts on page: 1
Successfully extracted text using dict method
Successfully extracted 11 words from page 4
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 4)
Successfully extracted structured markdown (111 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_4_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 4: Title='None', Subtitle='None'
Page 5 analysis: text_length=3471, image_ratio=0.00
Page 5 classified as text
Page 5 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 527 words from page 5
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 5)
Successfully extracted structured markdown (3589 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_5_structured.png.md
Detected hierarchy: Title='Open Access This book is licensed under the terms of the Creative Commons Attribution-', Subtitle='NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-ncnd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or'
Structured extraction for page 5: Title='Open Access This book is licensed under the terms of the Creative Commons Attribution-', Subtitle='NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-ncnd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or'
Page 6 analysis: text_length=2036, image_ratio=0.00
Page 6 classified as text
Page 6 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 329 words from page 6
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 6)
Successfully extracted structured markdown (2082 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_6_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 6: Title='None', Subtitle='None'
Page 7 analysis: text_length=827, image_ratio=0.00
Page 7 classified as text
Page 7 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 133 words from page 7
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 7)
Successfully extracted structured markdown (897 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_7_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 7: Title='None', Subtitle='None'
Page 8 analysis: text_length=2459, image_ratio=0.00
Page 8 classified as text
Page 8 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 390 words from page 8
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 8)
Successfully extracted structured markdown (2501 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_8_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 8: Title='None', Subtitle='None'
Processing grey_systems_analysis.pdf: 100%|████████████████████████████████████████████████████████████████████| 8/8 [00:18<00:00, 2.33s/it]
Skipping back matter processing: mode=never
Combined structured data: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Raw metadata from PDF:
[1] creationDate: D:20241226162248+05'30'
[1] format: PDF 1.4
[1] modDate: D:20241226200022+05'30'
[1] producer: Springer-i
Cleaned metadata before AI:
[1] creationDate: D:20241226162248+05'30'
[1] format: PDF 1.4
[1] modDate: D:20241226200022+05'30'
[1] producer: Springer-i
Extracted important identifiers from full OCR text: {'isbn': [{'value': 'ISBN 978-981-97-8726-5', 'context': 'ISSN 2731-4944 (electronic)\nSeries on Grey System\nISBN 978-981-97-8726-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10'}, {'value': 'ISBN 978-981-97-8727-2', 'context': 'nic)\nSeries on Grey System\nISBN 978-981-97-8726-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2'}, {'value': '978-981-97-8727-2', 'context': '978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2\nThis work was made possible due to projects suppo'}], 'doi': [{'value': '10.1007/978-981-97-8727-2', 'context': '-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2\nThis work was made possible due to projects suppo'}]}
Requesting metadata consolidation from AI
Cost calculation: 3909i + 719o tokens = $0.0010
Field 'title' value: type=<class 'str'>, value=Grey Systems Analysis
Field 'author' value: type=<class 'str'>, value=Sifeng Liu
Field 'publisher' value: type=<class 'str'>, value=Springer
Field 'year' value: type=<class 'str'>, value=2025
Field 'subtitle' value: type=<class 'str'>, value=Methods, Models and Applications
Field 'edition' value: type=<class 'str'>, value=2nd Ed.
Field 'doi' value: type=<class 'str'>, value=10.1007/978-981-97-8727-2
Field 'loc' value: type=<class 'str'>, value=null
About to process metadata response. parsed type=<class 'dict'>
Processing decisions: type=<class 'dict'>, value={'title': {'value': 'Grey Systems Analysis', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'author': {'value': 'Sifeng Liu', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'publisher': {'value': 'Springer', 'confidence': 'high', 'sources': ['original extracted metadata']}, 'year': {'value': '2025', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'subtitle': {'value': 'Methods, Models and Applications', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'edition': {'value': '2nd Ed.', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'isbn': [{'value': '978-981-97-8727-2', 'medium': 'eBook', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, {'value': '978-981-97-8726-5', 'medium': 'Hardcover', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}], 'doi': {'value': '10.1007/978-981-97-8727-2', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'loc': {'value': 'null', 'confidence': 'low', 'sources': []}}
Checking required field 'title'
Field 'title' data: type=<class 'dict'>, value={'value': 'Grey Systems Analysis', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}
Checking required field 'author'
Field 'author' data: type=<class 'dict'>, value={'value': 'Sifeng Liu', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}
Processing optional field 'publisher': type=<class 'dict'>
Processing optional field 'year': type=<class 'dict'>
Processing optional field 'subtitle': type=<class 'dict'>
Processing optional field 'edition': type=<class 'dict'>
Processing optional field 'isbn': type=<class 'list'>
Processing optional field 'doi': type=<class 'dict'>
Processing optional field 'loc': type=<class 'dict'>
About to clean metadata. processed_metadata type=<class 'dict'>
Metadata sources used:
title: extracted text from traditional OCR methods (confidence: high)
author: extracted text from traditional OCR methods (confidence: high)
publisher: original extracted metadata (confidence: high)
year: extracted text from traditional OCR methods (confidence: high)
subtitle: extracted text from traditional OCR methods (confidence: high)
edition: extracted text from traditional OCR methods (confidence: high)
doi: extracted text from traditional OCR methods (confidence: high)
loc: (confidence: low)
Tokens: I: 3909 O: 719
Cost: $0.0010
Cleaning up filename
Has annotations: no
Current: grey_systems_analysis.pdf
Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf
author: Sifeng Liu :: extracted text from traditional OCR methods
doi: 10.1007/978-981-97-8727-2 :: extracted text from traditional OCR methods
edition: 2nd Ed. :: extracted text from traditional OCR methods
isbn: [{'value': 'grey_systems_analysis', 'medium': 'ebk', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, {'value': '9789819787265', 'medium': 'hbk', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}] :: extracted text from traditional OCR methods
publisher: Springer :: original extracted metadata
subtitle: Methods, Models and Applications :: extracted text from traditional OCR methods
title: Grey Systems Analysis :: extracted text from traditional OCR methods
year: 2025 :: extracted text from traditional OCR methods
8.3mb First 8 of 419pg {'text': 7, 'image': 0, 'mixed': 1, 'unknown': 0, 'error': 0}
Tokens: I: 3909 O: 719
$0.001018
0m 26s
Reasoning:
title: The title was clearly identified in the extracted text, appearing prominently as 'Grey Systems Analysis'. No conflicts were found.
author: The author was identified as 'Sifeng Liu' from the extracted text. No conflicts were present.
publisher: The publisher 'Springer' was derived from the original extracted metadata. No conflicts were found.
year: The year '2025' was identified from the extracted text, indicating the publication date of the second edition. No conflicts were present.
subtitle: The subtitle 'Methods, Models and Applications' was clearly identified in the extracted text. No conflicts were found.
edition: The edition '2nd Ed.' was identified in the extracted text, indicating it is the second edition. No conflicts were present.
isbn: Two ISBNs were identified: '978-981-97-8727-2' for the eBook and '978-981-97-8726-5' for the hardcover edition. Both were confirmed from the extracted text.
doi: The DOI '10.1007/978-981-97-8727-2' was found in the extracted text, confirming its validity.
loc: No Library of Congress Control Number was found in the provided materials.
*************
Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.96s/it]
Summary:
Valid: 1
Invalid: 0
Successful: 1
Failed: 0
With Annotations: 0
Files with API Timeouts: 0
Total:
0m 26s
Tokens: Input: 3909, Output: 719
$0.001018
Average per PDF:
0m 26s
Tokens: Input: 3909.00, Output: 719.00
$0.001018
[project-root]/logs/[your-run-timestamp]/[pdf-filename]
, where logs
is the directory you set for DIAGNOSTIC_FOLDER.
Review the per-page images and corresponding text files to get a sense of what the models
are picking up and extracting.
([your-environment-name]) [14:04][your-username]@[your-machine]:~/pdf-toolbox/logs(dev)]$ tree .
.
└── 20250228_135149
├── 20250228_135149.log
├── grey_systems_analysis
│ ├── grey_systems_analysis_page_1_enhanced.png
│ ├── grey_systems_analysis_page_1_original.png
│ ├── grey_systems_analysis_page_1_structured.png.md
│ ├── grey_systems_analysis_page_1_text.png.txt
│ ├── grey_systems_analysis_page_2_original.png
│ ├── grey_systems_analysis_page_2_structured.png.md
│ ├── grey_systems_analysis_page_2_text.png.txt
│ ├── grey_systems_analysis_page_3_original.png
│ ├── grey_systems_analysis_page_3_structured.png.md
│ ├── grey_systems_analysis_page_3_text.png.txt
│ ├── grey_systems_analysis_page_4_original.png
│ ├── grey_systems_analysis_page_4_structured.png.md
│ ├── grey_systems_analysis_page_4_text.png.txt
│ ├── grey_systems_analysis_page_5_original.png
│ ├── grey_systems_analysis_page_5_structured.png.md
│ ├── grey_systems_analysis_page_5_text.png.txt
│ ├── grey_systems_analysis_page_6_original.png
│ ├── grey_systems_analysis_page_6_structured.png.md
│ ├── grey_systems_analysis_page_6_text.png.txt
│ ├── grey_systems_analysis_page_7_original.png
│ ├── grey_systems_analysis_page_7_structured.png.md
│ ├── grey_systems_analysis_page_7_text.png.txt
│ ├── grey_systems_analysis_page_8_original.png
│ ├── grey_systems_analysis_page_8_structured.png.md
│ └── grey_systems_analysis_page_8_text.png.txt
└── stats.txt
3 directories, 27 files
stats.txt
Output--with-stats-export
flag, the script will write a stats.txt
file to the
same directory as the log file.
Current: grey_systems_analysis.pdf
Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf
Has Annotations: no
Tags: none
File CRC: 74140e38
Metadata: 70682d14
Summary:
Valid: 1
Invalid: 0
Successful: 1
Failed: 0
With Annotations: 0
Files with API Timeouts: 0
Total:
0m 27s
Tokens: Input: 3909, Output: 721
$0.001019
Average per PDF:
0m 27s
Tokens: Input: 3909.00, Output: 721.00
$0.001019
.runtime-ignore
folders to selectively run sets of files[project-root]/data/
directory structure:
/data/
├── .runtime-ignore/
│ ├── character_sets/
│ │ ├── emoji_test/
│ │ │ └── Hello_World_🌍.pdf
│ │ └── non_latin/
│ │ ├── 量子コンピューティング入門_2024.pdf
│ │ └── AI_機械学習_Advanced_Topics__Tokyo_2024.pdf
│ │
│ ├── author_cases/
│ │ ├── multiple_authors/
│ │ │ ├── .runtime-ignore/
│ │ │ │ └── needs_review/
│ │ │ │ └── Three_Body_Problem_Cixin_Liu_Ken_Liu.pdf
│ │ │ └── A_Course_in_Combinatorics_and_Graphs_Simeon_Ball_Oriol_Serra.pdf
│ │ └── name_order/
│ │ └── Smith_John_vs_John_Smith.pdf
│ │
│ └── edition_cases/
│ └── multiple_editions/
│ ├── .runtime-ignore/
│ │ └── conflicting_editions/
│ │ ├── Linear_Algebra_Second_Edition.pdf
│ │ └── Linear_Algebra_2nd_Ed.pdf
│ ├── Linear_Algebra_2nd_ed.pdf
│ └── Linear_Algebra_3rd_ed.pdf
│
├── in_progress/
│ ├── .runtime-ignore/ # This entire subtree will be skipped
│ │ └── partial_ocr/
│ │ ├── page_1_to_50_done/
│ │ │ └── Large_Technical_Manual.pdf
│ │ └── needs_restart/
│ │ └── Failed_At_Page_127.pdf
│ └── ready_for_processing/
│ └── Next_Batch.pdf