Key Concepts
Duplicate Detection
Duplicate files are a common issue with large PDF collections. You might have several copies across:
- cloud file systems (iCloud, Google Drive, Dropbox, OneCloud)
- local file systems (USB drives, hard drives, etc.)
- external hard drives
- social media downloads
Running the pipeline with the --with-stats-export
argument creates a plain text file in the
DIAGNOSTIC_FOLDER
named stats.txt
. It contains a list of the processed files, along with their filename,
and metadata CRC32 hash.
Output from the Quickstart guide example:
stats.txt
This allows you to easily process many source directories, then use git diff
to identify
identical files with different filename/metadata.
This script will not take any action to reconcile differences, but is a great starting point for the deduplication process.
macOS Tags
MacOS file tags and colors are also included: