High
-
Docs: Run installation from
git cloneto validate correct behavior -
Implement
WRITE_PDF_CHANGESto save generated metadata and new filenames, with warning and interactive terminal prompt to prevent accidental run and destructive behavior -
On every run:
- Write
config.pysettings to logfile - Display
config.pysettings to terminal, whenVERBOSE_TERM = True
- Write
-
Pass though metadata fields not in our extraction through to
the destination files. For example,
Series,Language,Pages,Dimensions, etc. - Adding PyMuPDF4LLM greatly improved results, particulary with title:subtitle splitting but nearly doubled average processing time per file from ~15 ~30 seconds. Need to now selectively process only the important matter pages to optimize speed
- Enable pre-extraction validation scan to verify PDF integrity
-
Restore datetime parsing from prototype, set
timestamp = datetime.now().strftime["%Y-%m-%d %H:%M:%S,%f"](:-3) - Move bugs to GitHub Issues
Medium
-
Docs: extract the
.runtime-ignorefolder example on Evaluation page into a dedicated guide - Unify regex constants
- Half-title: greater weight than cover page, as it’s structured text, not OCR
-
Script currently discards
1st edition. Make a sensible default by wrapping this behavior in aDISCARD_FIRST_EDITIONconfig setting, defaulting toFalse. -
Customizable metadata template, ie:
- Other techniques for preprocessing images: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
- Create PDF files for all test cases
- Test 255 character limit. Might need to leave 4 characters for file extensions.
-
Refactor: Convert configuation variables in
config.pyto a class for type hinting and safety - Docs: Update Quickstart with instructions for Windows, Linux, other Unix-like systems
-
Docs:
changelog.mdin this format -
Validate
config.pysettings on script start - Look into more accurate token estimation method which applies caching discounts where applicable.
-
Write script to restore exported annotation files for:
- exact CRC matches
- identical books but slightly mismatched metadata, like MacOS tags
-
Configurable data directory, instead of hardcoded
[project-root]/data -
Add support for alternative model providers:
- Metadata and filename generation: Gemini, Claude, Grok
- Vision analysis: AWS Bedrock, Google Gemini, hosted OSS VLM’s
-
Refactor: Replace
PyPDF2withPyMuPDFfor metadata extraction. No need for 2 separate libraries - Benchmark performance with various open-source VLMs. See References in Meta and Ollama
-
Docs: Themes:
- Light and dark check
- Dark mode images
- Investigate https://github.com/orasik/parsevision for inspecting OCR results, ensuring correct text is extracted
Low
- Store multiple publishers
-
Better way to set “entire document” processing, other than setting
front.max_pagesto an arbitrary large value -
Refactor: Split
WRITE_PDF_CHANGESinto two flags:WRITE_METADATAandRENAME_FILES: running them independently is sometimes desired -
Refactor:
src/docto a better name, doc is too vague - Docs: Mintlify, OpenAI FA icon
- Refactor: Standardize ‘folder’ and ‘directory’ terminology
Open Questions
- How do we handle Unicode characters in filenames?
-
Investigate
ZOOM_FACTOR, unsure it’s having an effect -
Consider merging
TitleandSubtitle, might be more trouble than it’s worth keeping them separate -
What to do about missing images when reading
mdxfiles on GitHub - Docs: Dedicated troubleshooting page https://mintlify.com/docs/api-playground/troubleshooting, or inline FAQ on each page?
- Docs: Token costs: pricing table like https://docs.perplexity.ai/guides/usage-tiers
-
Docs: should we name the model pages by
- type - what they do, OCR, VLM, LLM generation
- provider - OpenAI, Ollama, Tesseract
- model name - gpt-4o, llama-3.2, LTSM
- how to ignore log files from Cursor indexing, but still draggable into chat?
-
How to keep runtime-ignore filename in
.gitignorein sync with config.py?