> ## Documentation Index
> Fetch the complete documentation index at: https://cstreams.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluation

> Analyzing script accuracy and performance

To get the best results, analyze the results of each run and adjust the
[configuration settings](/config/overview) to fit your specific collection.

The script produces four key outputs:

* **Terminal output** - Real-time progress per page, ending with a summary
* **Log file** - Includes every step of the process, including AI model requests and
  responses
* **Page images and extracted text** - Original + enhanced images, and extracted text as
  plain text files
* **`stats.txt`** - Easy-to-read summary of processed files with proposed filenames

### Terminal output

When given the `--verbose-output` argument, the script displays detailed, real-time
progress and a summary.

```bash Terminal output [expandable] theme={null}
PyMuPDF version: 1.25.2
*************
1	Total PDFs to process

~30 seconds and $0.001 per PDF with default settings...

*************
Pre-warming OpenAI schema cache...
Pre-warming metadata extraction schema...
Pre-warming filename generation call...
Schema cache pre-warming complete
Testing schema caching behavior...
Testing metadata schema caching (3 identical calls):
  Call 1: 0.87 seconds
  Call 2: 1.74 seconds
  Call 3: 0.67 seconds
Testing filename generation caching (3 identical calls):
  Call 1: 0.37 seconds
  Call 2: 0.45 seconds
  Call 3: 1.40 seconds
Validating PDF files...
Validating: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 36.23it/s]
Validating:   0%|                                                                                              | 0/1 [00:00<?, ?it/s]
1	Valid
0	Invalid

Processing valid files...
*************
Starting: /Users/[your-username]/pdf-toolbox/data/grey_systems_analysis.pdf

Processing front matter pages 1 to 8
Page 1 analysis: text_length=98, image_ratio=0.29
Page 1 classified as mixed content
Page 1 determined as mixed
PDF Debug Information:
Number of fonts on page: 3
Successfully extracted text using dict method
Successfully extracted 15 words from page 1
Page 1 original OCR confidence: 37.65
Page 1 enhanced OCR confidence: 24.47
Using original image for page 1 (confidence: 37.65 > 24.47)
Saved extracted text to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_1_text.png.txt
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 1)
Successfully extracted structured markdown (87 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_1_structured.png.md
Detected hierarchy: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Structured extraction for page 1: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Page 2 analysis: text_length=332, image_ratio=0.00
Page 2 classified as text
Page 2 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 45 words from page 2
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 2)
Successfully extracted structured markdown (362 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_2_structured.png.md
Detected hierarchy: Title='Series Editors', Subtitle='Sifeng Liu, Institute of Grey Systems Studies, Nanjing University of Aeronautics'
Structured extraction for page 2: Title='Series Editors', Subtitle='Sifeng Liu, Institute of Grey Systems Studies, Nanjing University of Aeronautics'
Page 3 analysis: text_length=1315, image_ratio=0.00
Page 3 classified as text
Page 3 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 202 words from page 3
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 3)
Successfully extracted structured markdown (1359 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_3_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 3: Title='None', Subtitle='None'
Page 4 analysis: text_length=77, image_ratio=0.00
Page 4 classified as text
Page 4 determined as text
PDF Debug Information:
Number of fonts on page: 1
Successfully extracted text using dict method
Successfully extracted 11 words from page 4
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 4)
Successfully extracted structured markdown (111 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_4_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 4: Title='None', Subtitle='None'
Page 5 analysis: text_length=3471, image_ratio=0.00
Page 5 classified as text
Page 5 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 527 words from page 5
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 5)
Successfully extracted structured markdown (3589 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_5_structured.png.md
Detected hierarchy: Title='Open Access This book is licensed under the terms of the Creative Commons Attribution-', Subtitle='NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-ncnd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or'
Structured extraction for page 5: Title='Open Access This book is licensed under the terms of the Creative Commons Attribution-', Subtitle='NonCommercial-NoDerivatives 4.0 International License (http://creativecommons.org/licenses/by-ncnd/4.0/), which permits any noncommercial use, sharing, distribution and reproduction in any medium or'
Page 6 analysis: text_length=2036, image_ratio=0.00
Page 6 classified as text
Page 6 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 329 words from page 6
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 6)
Successfully extracted structured markdown (2082 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_6_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 6: Title='None', Subtitle='None'
Page 7 analysis: text_length=827, image_ratio=0.00
Page 7 classified as text
Page 7 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 133 words from page 7
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 7)
Successfully extracted structured markdown (897 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_7_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 7: Title='None', Subtitle='None'
Page 8 analysis: text_length=2459, image_ratio=0.00
Page 8 classified as text
Page 8 determined as text
PDF Debug Information:
Number of fonts on page: 2
Successfully extracted text using dict method
Successfully extracted 390 words from page 8
Extracting structured text with PyMuPDF4LLM from grey_systems_analysis.pdf (page 8)
Successfully extracted structured markdown (2501 characters)
Saved structured markdown to: logs/20250228_143143/grey_systems_analysis/grey_systems_analysis_page_8_structured.png.md
Detected hierarchy: Title='None', Subtitle='None'
Structured extraction for page 8: Title='None', Subtitle='None'
Processing grey_systems_analysis.pdf: 100%|████████████████████████████████████████████████████████████████████| 8/8 [00:18<00:00,  2.33s/it]
Skipping back matter processing: mode=never
Combined structured data: Title='Grey Systems Analysis', Subtitle='Methods, Models and Applications'
Raw metadata from PDF:
[1]   creationDate: D:20241226162248+05'30'
[1]   format: PDF 1.4
[1]   modDate: D:20241226200022+05'30'
[1]   producer: Springer-i
Cleaned metadata before AI:
[1]   creationDate: D:20241226162248+05'30'
[1]   format: PDF 1.4
[1]   modDate: D:20241226200022+05'30'
[1]   producer: Springer-i
Extracted important identifiers from full OCR text: {'isbn': [{'value': 'ISBN 978-981-97-8726-5', 'context': 'ISSN 2731-4944 (electronic)\nSeries on Grey System\nISBN 978-981-97-8726-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10'}, {'value': 'ISBN 978-981-97-8727-2', 'context': 'nic)\nSeries on Grey System\nISBN 978-981-97-8726-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2'}, {'value': '978-981-97-8727-2', 'context': '978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2\nThis work was made possible due to projects suppo'}], 'doi': [{'value': '10.1007/978-981-97-8727-2', 'context': '-5\nISBN 978-981-97-8727-2 (eBook)\nhttps://doi.org/10.1007/978-981-97-8727-2\nThis work was made possible due to projects suppo'}]}
Requesting metadata consolidation from AI
Cost calculation: 3909i + 719o tokens = $0.0010
Field 'title' value: type=<class 'str'>, value=Grey Systems Analysis
Field 'author' value: type=<class 'str'>, value=Sifeng Liu
Field 'publisher' value: type=<class 'str'>, value=Springer
Field 'year' value: type=<class 'str'>, value=2025
Field 'subtitle' value: type=<class 'str'>, value=Methods, Models and Applications
Field 'edition' value: type=<class 'str'>, value=2nd Ed.
Field 'doi' value: type=<class 'str'>, value=10.1007/978-981-97-8727-2
Field 'loc' value: type=<class 'str'>, value=null
About to process metadata response. parsed type=<class 'dict'>
Processing decisions: type=<class 'dict'>, value={'title': {'value': 'Grey Systems Analysis', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'author': {'value': 'Sifeng Liu', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'publisher': {'value': 'Springer', 'confidence': 'high', 'sources': ['original extracted metadata']}, 'year': {'value': '2025', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'subtitle': {'value': 'Methods, Models and Applications', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'edition': {'value': '2nd Ed.', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'isbn': [{'value': '978-981-97-8727-2', 'medium': 'eBook', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, {'value': '978-981-97-8726-5', 'medium': 'Hardcover', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}], 'doi': {'value': '10.1007/978-981-97-8727-2', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, 'loc': {'value': 'null', 'confidence': 'low', 'sources': []}}
Checking required field 'title'
Field 'title' data: type=<class 'dict'>, value={'value': 'Grey Systems Analysis', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}
Checking required field 'author'
Field 'author' data: type=<class 'dict'>, value={'value': 'Sifeng Liu', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}
Processing optional field 'publisher': type=<class 'dict'>
Processing optional field 'year': type=<class 'dict'>
Processing optional field 'subtitle': type=<class 'dict'>
Processing optional field 'edition': type=<class 'dict'>
Processing optional field 'isbn': type=<class 'list'>
Processing optional field 'doi': type=<class 'dict'>
Processing optional field 'loc': type=<class 'dict'>
About to clean metadata. processed_metadata type=<class 'dict'>
Metadata sources used:
  title: extracted text from traditional OCR methods (confidence: high)
  author: extracted text from traditional OCR methods (confidence: high)
  publisher: original extracted metadata (confidence: high)
  year: extracted text from traditional OCR methods (confidence: high)
  subtitle: extracted text from traditional OCR methods (confidence: high)
  edition: extracted text from traditional OCR methods (confidence: high)
  doi: extracted text from traditional OCR methods (confidence: high)
  loc:  (confidence: low)
Tokens: I: 3909 O: 719
Cost: $0.0010
Cleaning up filename
Has annotations: no

        Current: grey_systems_analysis.pdf
        Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf

  author: Sifeng Liu :: extracted text from traditional OCR methods
  doi: 10.1007/978-981-97-8727-2 :: extracted text from traditional OCR methods
  edition: 2nd Ed. :: extracted text from traditional OCR methods
  isbn: [{'value': 'grey_systems_analysis', 'medium': 'ebk', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}, {'value': '9789819787265', 'medium': 'hbk', 'confidence': 'high', 'sources': ['extracted text from traditional OCR methods']}] :: extracted text from traditional OCR methods
  publisher: Springer :: original extracted metadata
  subtitle: Methods, Models and Applications :: extracted text from traditional OCR methods
  title: Grey Systems Analysis :: extracted text from traditional OCR methods
  year: 2025 :: extracted text from traditional OCR methods
8.3mb  First 8 of 419pg  {'text': 7, 'image': 0, 'mixed': 1, 'unknown': 0, 'error': 0}
Tokens: I: 3909 O: 719
$0.001018
0m 26s
Reasoning:
  title: The title was clearly identified in the extracted text, appearing prominently as 'Grey Systems Analysis'. No conflicts were found.
  author: The author was identified as 'Sifeng Liu' from the extracted text. No conflicts were present.
  publisher: The publisher 'Springer' was derived from the original extracted metadata. No conflicts were found.
  year: The year '2025' was identified from the extracted text, indicating the publication date of the second edition. No conflicts were present.
  subtitle: The subtitle 'Methods, Models and Applications' was clearly identified in the extracted text. No conflicts were found.
  edition: The edition '2nd Ed.' was identified in the extracted text, indicating it is the second edition. No conflicts were present.
  isbn: Two ISBNs were identified: '978-981-97-8727-2' for the eBook and '978-981-97-8726-5' for the hardcover edition. Both were confirmed from the extracted text.
  doi: The DOI '10.1007/978-981-97-8727-2' was found in the extracted text, confirming its validity.
  loc: No Library of Congress Control Number was found in the provided materials.
*************
Processing: 100%|██████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:26<00:00, 26.96s/it]

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 26s
	Tokens: Input: 3909, Output: 719
	$0.001018

Average per PDF:
	0m 26s
	Tokens: Input: 3909.00, Output: 719.00
	$0.001018
```

### Log file, page images and extracted text

You'll find a timestamped log file in `[project-root]/logs/[your-run-timestamp]/[pdf-filename]`, where `logs` is the directory you set for <a href="/config/diagnostic#diagnostic-folder" class="param-field-link">DIAGNOSTIC\_FOLDER</a>.

Review the per-page images and corresponding text files to get a sense of what the models
are picking up and extracting.

```bash Log file directory theme={null}
([your-environment-name]) [14:04][your-username]@[your-machine]:~/pdf-toolbox/logs(dev)]$ tree .
.
└── 20250228_135149
    ├── 20250228_135149.log
    ├── grey_systems_analysis
    │   ├── grey_systems_analysis_page_1_enhanced.png
    │   ├── grey_systems_analysis_page_1_original.png
    │   ├── grey_systems_analysis_page_1_structured.png.md
    │   ├── grey_systems_analysis_page_1_text.png.txt
    │   ├── grey_systems_analysis_page_2_original.png
    │   ├── grey_systems_analysis_page_2_structured.png.md
    │   ├── grey_systems_analysis_page_2_text.png.txt
    │   ├── grey_systems_analysis_page_3_original.png
    │   ├── grey_systems_analysis_page_3_structured.png.md
    │   ├── grey_systems_analysis_page_3_text.png.txt
    │   ├── grey_systems_analysis_page_4_original.png
    │   ├── grey_systems_analysis_page_4_structured.png.md
    │   ├── grey_systems_analysis_page_4_text.png.txt
    │   ├── grey_systems_analysis_page_5_original.png
    │   ├── grey_systems_analysis_page_5_structured.png.md
    │   ├── grey_systems_analysis_page_5_text.png.txt
    │   ├── grey_systems_analysis_page_6_original.png
    │   ├── grey_systems_analysis_page_6_structured.png.md
    │   ├── grey_systems_analysis_page_6_text.png.txt
    │   ├── grey_systems_analysis_page_7_original.png
    │   ├── grey_systems_analysis_page_7_structured.png.md
    │   ├── grey_systems_analysis_page_7_text.png.txt
    │   ├── grey_systems_analysis_page_8_original.png
    │   ├── grey_systems_analysis_page_8_structured.png.md
    │   └── grey_systems_analysis_page_8_text.png.txt
    └── stats.txt

3 directories, 27 files
```

Read the
[sample output log Gist](https://gist.github.com/lifeinchords/7f96e2728707c28d46f597ea65926a61).

### `stats.txt` Output

When given the `--with-stats-export` flag, the script will write a `stats.txt` file to the
same directory as the log file.

```bash stats.txt theme={null}
Current:  grey_systems_analysis.pdf
Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf
Has Annotations: no
Tags: none
File CRC: 74140e38
Metadata: 70682d14

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 27s
	Tokens: Input: 3909, Output: 721
	$0.001019

Average per PDF:
	0m 27s
	Tokens: Input: 3909.00, Output: 721.00
	$0.001019

```

### Use `.runtime-ignore` folders to selectively run sets of files

<Note>
  <b>@todo:</b> Work in progress
</Note>

Take the following `[project-root]/data/` directory structure:

```bash theme={null}
/data/
├── .runtime-ignore/
│   ├── character_sets/
│   │   ├── emoji_test/
│   │   │   └── Hello_World_🌍.pdf
│   │   └── non_latin/
│   │       ├── 量子コンピューティング入門_2024.pdf
│   │       └── AI_機械学習_Advanced_Topics__Tokyo_2024.pdf
│   │
│   ├── author_cases/
│   │   ├── multiple_authors/
│   │   │   ├── .runtime-ignore/
│   │   │   │   └── needs_review/
│   │   │   │       └── Three_Body_Problem_Cixin_Liu_Ken_Liu.pdf
│   │   │   └── A_Course_in_Combinatorics_and_Graphs_Simeon_Ball_Oriol_Serra.pdf
│   │   └── name_order/
│   │       └── Smith_John_vs_John_Smith.pdf
│   │
│   └── edition_cases/
│       └── multiple_editions/
│           ├── .runtime-ignore/
│           │   └── conflicting_editions/
│           │       ├── Linear_Algebra_Second_Edition.pdf
│           │       └── Linear_Algebra_2nd_Ed.pdf
│           ├── Linear_Algebra_2nd_ed.pdf
│           └── Linear_Algebra_3rd_ed.pdf
│
├── in_progress/
│   ├── .runtime-ignore/  # This entire subtree will be skipped
│   │   └── partial_ocr/
│   │       ├── page_1_to_50_done/
│   │       │   └── Large_Technical_Manual.pdf
│   │       └── needs_restart/
│   │           └── Failed_At_Page_127.pdf
│   └── ready_for_processing/
│       └── Next_Batch.pdf
```

<a href="/config/diagnostic#ignoring-directories" class="param-field-link">.runtime-ignore</a>
folders provide a flexible way to organize and isolate problematic PDFs during testing.
This example follows our [Test Cases](/analysis-and-iteration/test-cases) directory pattern,
making it easy to move files between folders while troubleshooting without breaking the subtrees.
