Roadmap

High

Docs: Run installation from git clone to validate correct behavior
Implement WRITE_PDF_CHANGES to save generated metadata and new filenames, with warning and interactive terminal prompt to prevent accidental run and destructive behavior
On every run:
- Write config.py settings to logfile
- Display config.py settings to terminal, when VERBOSE_TERM = True
Pass though metadata fields not in our extraction through to the destination files. For example, Series, Language, Pages, Dimensions, etc.
Adding PyMuPDF4LLM greatly improved results, particulary with title:subtitle splitting but nearly doubled average processing time per file from ~15 ~30 seconds. Need to now selectively process only the important matter pages to optimize speed
Enable pre-extraction validation scan to verify PDF integrity
Restore datetime parsing from prototype, set timestamp = datetime.now().strftime["%Y-%m-%d %H:%M:%S,%f"](:-3)
Move bugs to GitHub Issues

Docs: extract the .runtime-ignore folder example on Evaluation page into a dedicated guide
Unify regex constants
Half-title: greater weight than cover page, as it’s structured text, not OCR
Script currently discards 1st edition. Make a sensible default by wrapping this behavior in a DISCARD_FIRST_EDITION config setting, defaulting to False.

Customizable metadata template, ie:

[ISBN]- [Title], [Subtitle], [Author]- [Edition]- [ISBN].pdf
vs
[Title], [Subtitle], ([Author]), Publisher, ([Edition]), ([Year]).pdf

Other techniques for preprocessing images: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
Create PDF files for all test cases
Test 255 character limit. Might need to leave 4 characters for file extensions.
Refactor: Convert configuation variables in config.py to a class for type hinting and safety
Docs: Update Quickstart with instructions for Windows, Linux, other Unix-like systems
Docs: changelog.md in this format
Validate config.py settings on script start
Look into more accurate token estimation method which applies caching discounts where applicable.
Write script to restore exported annotation files for:
- exact CRC matches
- identical books but slightly mismatched metadata, like MacOS tags
Configurable data directory, instead of hardcoded [project-root]/data
Add support for alternative model providers:
- Metadata and filename generation: Gemini, Claude, Grok
- Vision analysis: AWS Bedrock, Google Gemini, hosted OSS VLM’s
Refactor: Replace PyPDF2 with PyMuPDF for metadata extraction. No need for 2 separate libraries
Benchmark performance with various open-source VLMs. See References in Meta and Ollama
Docs: Themes:
- Light and dark check
- Dark mode images
Investigate https://github.com/orasik/parsevision for inspecting OCR results, ensuring correct text is extracted

Store multiple publishers
Better way to set “entire document” processing, other than setting front.max_pages to an arbitrary large value
Refactor: Split WRITE_PDF_CHANGES into two flags: WRITE_METADATA and RENAME_FILES: running them independently is sometimes desired
Refactor: src/doc to a better name, doc is too vague
Docs: Mintlify, OpenAI FA icon
Refactor: Standardize ‘folder’ and ‘directory’ terminology

How do we handle Unicode characters in filenames?
Investigate ZOOM_FACTOR, unsure it’s having an effect
Consider merging Title and Subtitle, might be more trouble than it’s worth keeping them separate
What to do about missing images when reading mdx files on GitHub
Docs: Dedicated troubleshooting page https://mintlify.com/docs/api-playground/troubleshooting, or inline FAQ on each page?
Docs: Token costs: pricing table like https://docs.perplexity.ai/guides/usage-tiers
Docs: should we name the model pages by
- type - what they do, OCR, VLM, LLM generation
- provider - OpenAI, Ollama, Tesseract
- model name - gpt-4o, llama-3.2, LTSM
how to ignore log files from Cursor indexing, but still draggable into chat?
How to keep runtime-ignore filename in .gitignore in sync with config.py?