Non-deterministic - Each processing run may produce slightly different results due
to how LLM’s work.
Field Stability - While core identifiers (ISBN, DOI, LOC) remain consistent,
interpretative fields like author names, subtitles, years, and publisher details can
vary between runs. What are the best use cases?
PDF Writing - May produce unexpected metadata changes or file corruption. See
Evaluation for testing strategies to avoid
unexpected results.
Filename Length - Limited to 255 characters to maintain compatibility across
different operating systems. Currently tested on macOS only. Windows and Linux support
is experimental - please
report any issues on GitHub.
Unicode Support - Special characters in filenames can trigger issues on certain
operating systems.
File Validation - Source PDF corruption scanning is not yet implemented.
Language Support - Primary language support is English. Limited support for
non-Latin character sets and right-to-left languages. Arabic, Chinese, Japanese and
Korean text may produce inconsistent results.
Confidence Floor - Text segments with OCR confidence below 30% are automatically
discarded. This is a subjective threshold and you’ll need to experiment with your own
documents to find the best balance.
Limited Coverage - Vision analysis is selective, only processing the cover page,
high-image-content pages (>90%), early pages with poor OCR, and mixed-content layouts.
Edition Filtering - First editions are automatically discarded from naming. This may
not be appropriate for all use cases. A configurable setting is planned for the next
release.
Metadata Transfer - Carrying over existing PDF metadata to the new file is not yet
implemented.
Test Coverage - Limited test suite focusing mainly on happy paths. Edge cases and
error conditions need more coverage. See
Test Cases for current scenarios.
Real Test Files - Need to create actual PDF files for each test case. For example,
books with many ISBN formats and various edition formats. If you have specific documents
please
submit a discussion.Setting up a benchmark would be helpful.
Feedback and ideas for this are greatly appreciated!
See the Roadmap for planned improvements to these limitations.