Roadmap
Pending tasks, improvements, and development plans
High
-
Docs: Run installation from
git clone
to validate correct behavior -
Implement
WRITE_PDF_CHANGES
to save generated metadata and new filenames, with warning and interactive terminal prompt to prevent accidental run and destructive behavior -
On every run:
- Write
config.py
settings to logfile - Display
config.py
settings to terminal, whenVERBOSE_TERM = True
- Write
-
Pass though metadata fields not in our extraction through to the destination files. For example,
Series
,Language
,Pages
,Dimensions
, etc. -
Adding PyMuPDF4LLM greatly improved results, particulary with title:subtitle splitting but nearly doubled average processing time per file from ~15 ~30 seconds. Need to now selectively process only the important matter pages to optimize speed
-
Enable pre-extraction validation scan to verify PDF integrity
-
Restore datetime parsing from prototype, set
timestamp = datetime.now().strftime["%Y-%m-%d %H:%M:%S,%f"](:-3)
-
Move bugs to GitHub Issues
Medium
-
Docs: extract the
.runtime-ignore
folder example on Evaluation page into a dedicated guide -
Unify regex constants
-
Half-title: greater weight than cover page, as it’s structured text, not OCR
-
Script currently discards
1st edition
. Make a sensible default by wrapping this behavior in aDISCARD_FIRST_EDITION
config setting, defaulting toFalse
. -
Customizable metadata template, ie:
-
Other techniques for preprocessing images: https://github.com/tesseract-ocr/tessdoc/blob/main/ImproveQuality.md
-
Create PDF files for all test cases
-
Test 255 character limit. Might need to leave 4 characters for file extensions.
-
Refactor: Convert configuation variables in
config.py
to a class for type hinting and safety -
Docs: Update Quickstart with instructions for Windows, Linux, other Unix-like systems
-
Docs:
changelog.md
in this format -
Validate
config.py
settings on script start -
Look into more accurate token estimation method which applies caching discounts where applicable.
-
Write script to restore exported annotation files for:
- exact CRC matches
- identical books but slightly mismatched metadata, like MacOS tags
-
Configurable data directory, instead of hardcoded
[project-root]/data
-
Add support for alternative model providers:
- Metadata and filename generation: Gemini, Claude, Grok
- Vision analysis: AWS Bedrock, Google Gemini, hosted OSS VLM’s
-
Refactor: Replace
PyPDF2
withPyMuPDF
for metadata extraction. No need for 2 separate libraries -
Benchmark performance with various open-source VLMs. See References in Meta and Ollama
-
Docs: Themes:
- Light and dark check
- Dark mode images
-
Investigate https://github.com/orasik/parsevision for inspecting OCR results, ensuring correct text is extracted
Low
-
Store multiple publishers
-
Better way to set “entire document” processing, other than setting
front.max_pages
to an arbitrary large value -
Refactor: Split
WRITE_PDF_CHANGES
into two flags:WRITE_METADATA
andRENAME_FILES
: running them independently is sometimes desired -
Refactor:
src/doc
to a better name, doc is too vague -
Docs: Mintlify, OpenAI FA icon
-
Refactor: Standardize ‘folder’ and ‘directory’ terminology
Open Questions
-
How do we handle Unicode characters in filenames?
-
Investigate
ZOOM_FACTOR
, unsure it’s having an effect -
Consider merging
Title
andSubtitle
, might be more trouble than it’s worth keeping them separate -
What to do about missing images when reading
mdx
files on GitHub -
Docs: Dedicated troubleshooting page https://mintlify.com/docs/api-playground/troubleshooting, or inline FAQ on each page?
-
Docs: Token costs: pricing table like https://docs.perplexity.ai/guides/usage-tiers
-
Docs: should we name the model pages by
- type - what they do, OCR, VLM, LLM generation
- provider - OpenAI, Ollama, Tesseract
- model name - gpt-4o, llama-3.2, LTSM
-
how to ignore log files from Cursor indexing, but still draggable into chat?
-
How to keep runtime-ignore filename in
.gitignore
in sync with config.py?