High

  • Docs: Run installation from git clone to validate correct behavior

  • Implement WRITE_PDF_CHANGES to save generated metadata and new filenames, with warning and interactive terminal prompt to prevent accidental run and destructive behavior

  • On every run:

    • Write config.py settings to logfile
    • Display config.py settings to terminal, when VERBOSE_TERM = True
  • Pass though metadata fields not in our extraction through to the destination files. For example, Series, Language, Pages, Dimensions, etc.

  • Adding PyMuPDF4LLM greatly improved results, particulary with title:subtitle splitting but nearly doubled average processing time per file from ~15 ~30 seconds. Need to now selectively process only the important matter pages to optimize speed

  • Enable pre-extraction validation scan to verify PDF integrity

  • Restore datetime parsing from prototype, set timestamp = datetime.now().strftime["%Y-%m-%d %H:%M:%S,%f"](:-3)

  • Move bugs to GitHub Issues

Medium

  • Docs: extract the .runtime-ignore folder example on Evaluation page into a dedicated guide

  • Unify regex constants

  • Half-title: greater weight than cover page, as it’s structured text, not OCR

  • Script currently discards 1st edition. Make a sensible default by wrapping this behavior in a DISCARD_FIRST_EDITION config setting, defaulting to False.

  • Customizable metadata template, ie:

    [ISBN]- [Title], [Subtitle], [Author]- [Edition]- [ISBN].pdf
    vs
    [Title], [Subtitle], ([Author]), Publisher, ([Edition]), ([Year]).pdf
    
  • Other techniques for preprocessing images: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md

  • Create PDF files for all test cases

  • Test 255 character limit. Might need to leave 4 characters for file extensions.

  • Refactor: Convert configuation variables in config.py to a class for type hinting and safety

  • Docs: Update Quickstart with instructions for Windows, Linux, other Unix-like systems

  • Docs: changelog.md in this format

  • Validate config.py settings on script start

  • Look into more accurate token estimation method which applies caching discounts where applicable.

  • Write script to restore exported annotation files for:

    • exact CRC matches
    • identical books but slightly mismatched metadata, like MacOS tags
  • Configurable data directory, instead of hardcoded [project-root]/data

  • Add support for alternative model providers:

    • Metadata and filename generation: Gemini, Claude, Grok
    • Vision analysis: AWS Bedrock, Google Gemini, hosted OSS VLM’s
  • Refactor: Replace PyPDF2 with PyMuPDF for metadata extraction. No need for 2 separate libraries

  • Benchmark performance with various open-source VLMs. See References in Meta and Ollama

  • Docs: Themes:

    • Light and dark check
    • Dark mode images
  • Investigate https://github.com/orasik/parsevision for inspecting OCR results, ensuring correct text is extracted

Low

  • Store multiple publishers

  • Better way to set “entire document” processing, other than setting front.max_pages to an arbitrary large value

  • Refactor: Split WRITE_PDF_CHANGES into two flags: WRITE_METADATA and RENAME_FILES: running them independently is sometimes desired

  • Refactor: src/doc to a better name, doc is too vague

  • Docs: Mintlify, OpenAI FA icon

  • Refactor: Standardize ‘folder’ and ‘directory’ terminology

Open Questions

  • How do we handle Unicode characters in filenames?

  • Investigate ZOOM_FACTOR, unsure it’s having an effect

  • Consider merging Title and Subtitle, might be more trouble than it’s worth keeping them separate

  • What to do about missing images when reading mdx files on GitHub

  • Docs: Dedicated troubleshooting page https://mintlify.com/docs/api-playground/troubleshooting, or inline FAQ on each page?

  • Docs: Token costs: pricing table like https://docs.perplexity.ai/guides/usage-tiers

  • Docs: should we name the model pages by

    • type - what they do, OCR, VLM, LLM generation
    • provider - OpenAI, Ollama, Tesseract
    • model name - gpt-4o, llama-3.2, LTSM
  • how to ignore log files from Cursor indexing, but still draggable into chat?

  • How to keep runtime-ignore filename in .gitignore in sync with config.py?