Duplicate files are a common issue with large PDF collections. You might have several copies across:

  • cloud file systems (iCloud, Google Drive, Dropbox, OneCloud)
  • local file systems (USB drives, hard drives, etc.)
  • external hard drives
  • social media downloads

Running the pipeline with the --with-stats-export argument creates a plain text file in the DIAGNOSTIC_FOLDER named stats.txt. It contains a list of the processed files, along with their filename, and metadata CRC32 hash.

Output from the Quickstart guide example:

stats.txt
Current:  grey_systems_analysis.pdf
Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf
Has Annotations: no
Tags: none
File CRC: 74140e38
Metadata: 70682d14

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 27s
	Tokens: Input: 3909, Output: 721
	$0.001019

Average per PDF:
	0m 27s
	Tokens: Input: 3909.00, Output: 721.00
	$0.001019

This allows you to easily process many source directories, then use git diff to identify identical files with different filename/metadata.

This script will not take any action to reconcile differences, but is a great starting point for the deduplication process.

macOS Tags

MacOS file tags and colors are also included:

Current:  research.pdf
Tags: Red, Important, Research