High

  • Docs: Run installation from git clone to validate correct behavior
  • Implement WRITE_PDF_CHANGES to save generated metadata and new filenames, with warning and interactive terminal prompt to prevent accidental run and destructive behavior
  • On every run:
    • Write config.py settings to logfile
    • Display config.py settings to terminal, when VERBOSE_TERM = True
  • Pass though metadata fields not in our extraction through to the destination files. For example, Series, Language, Pages, Dimensions, etc.
  • Adding PyMuPDF4LLM greatly improved results, particulary with title:subtitle splitting but nearly doubled average processing time per file from ~15 ~30 seconds. Need to now selectively process only the important matter pages to optimize speed
  • Enable pre-extraction validation scan to verify PDF integrity
  • Restore datetime parsing from prototype, set timestamp = datetime.now().strftime["%Y-%m-%d %H:%M:%S,%f"](:-3)
  • Move bugs to GitHub Issues

Medium

  • Docs: extract the .runtime-ignore folder example on Evaluation page into a dedicated guide
  • Unify regex constants
  • Half-title: greater weight than cover page, as it’s structured text, not OCR
  • Script currently discards 1st edition. Make a sensible default by wrapping this behavior in a DISCARD_FIRST_EDITION config setting, defaulting to False.
  • Customizable metadata template, ie:
    [ISBN]- [Title], [Subtitle], [Author]- [Edition]- [ISBN].pdf
    vs
    [Title], [Subtitle], ([Author]), Publisher, ([Edition]), ([Year]).pdf
    
  • Other techniques for preprocessing images: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
  • Create PDF files for all test cases
  • Test 255 character limit. Might need to leave 4 characters for file extensions.
  • Refactor: Convert configuation variables in config.py to a class for type hinting and safety
  • Docs: Update Quickstart with instructions for Windows, Linux, other Unix-like systems
  • Docs: changelog.md in this format
  • Validate config.py settings on script start
  • Look into more accurate token estimation method which applies caching discounts where applicable.
  • Write script to restore exported annotation files for:
    • exact CRC matches
    • identical books but slightly mismatched metadata, like MacOS tags
  • Configurable data directory, instead of hardcoded [project-root]/data
  • Add support for alternative model providers:
    • Metadata and filename generation: Gemini, Claude, Grok
    • Vision analysis: AWS Bedrock, Google Gemini, hosted OSS VLM’s
  • Refactor: Replace PyPDF2 with PyMuPDF for metadata extraction. No need for 2 separate libraries
  • Benchmark performance with various open-source VLMs. See References in Meta and Ollama
  • Docs: Themes:
    • Light and dark check
    • Dark mode images
  • Investigate https://github.com/orasik/parsevision for inspecting OCR results, ensuring correct text is extracted

Low

  • Store multiple publishers
  • Better way to set “entire document” processing, other than setting front.max_pages to an arbitrary large value
  • Refactor: Split WRITE_PDF_CHANGES into two flags: WRITE_METADATA and RENAME_FILES: running them independently is sometimes desired
  • Refactor: src/doc to a better name, doc is too vague
  • Docs: Mintlify, OpenAI FA icon
  • Refactor: Standardize ‘folder’ and ‘directory’ terminology

Open Questions

  • How do we handle Unicode characters in filenames?
  • Investigate ZOOM_FACTOR, unsure it’s having an effect
  • Consider merging Title and Subtitle, might be more trouble than it’s worth keeping them separate
  • What to do about missing images when reading mdx files on GitHub
  • Docs: Dedicated troubleshooting page https://mintlify.com/docs/api-playground/troubleshooting, or inline FAQ on each page?
  • Docs: Token costs: pricing table like https://docs.perplexity.ai/guides/usage-tiers
  • Docs: should we name the model pages by
    • type - what they do, OCR, VLM, LLM generation
    • provider - OpenAI, Ollama, Tesseract
    • model name - gpt-4o, llama-3.2, LTSM
  • how to ignore log files from Cursor indexing, but still draggable into chat?
  • How to keep runtime-ignore filename in .gitignore in sync with config.py?