> ## Documentation Index
> Fetch the complete documentation index at: https://cstreams.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Duplicate Detection

Duplicate files are a common issue with large PDF collections. You might have several
copies across:

* cloud file systems (iCloud, Google Drive, Dropbox, OneCloud)
* local file systems (USB drives, hard drives, etc.)
* external hard drives
* social media downloads

Running the pipeline with the `--with-stats-export` argument creates a plain text file in the
<a href="/config/diagnostic#diagnostic-folder" class="param-field-link">DIAGNOSTIC\_FOLDER</a>
named `stats.txt`. It contains a list of the processed files, along with their filename,
and metadata CRC32 hash.

Output from the [Quickstart guide](/getting-started/quickstart#review-the-results)
example:

```bash stats.txt {5-6} theme={null}
Current:  grey_systems_analysis.pdf
Proposed: Grey Systems Analysis, Methods, Models and Applications, (Sifeng Liu), Springer, (2nd Ed.), (2025).pdf
Has Annotations: no
Tags: none
File CRC: 74140e38
Metadata: 70682d14

Summary:
	Valid: 1
	Invalid: 0
	Successful: 1
	Failed: 0
	With Annotations: 0
	Files with API Timeouts: 0

Total:
	0m 27s
	Tokens: Input: 3909, Output: 721
	$0.001019

Average per PDF:
	0m 27s
	Tokens: Input: 3909.00, Output: 721.00
	$0.001019
```

This allows you to easily process many source directories, then use `git diff` to identify
identical files with different filename/metadata.

This script will not take any action to reconcile differences, but is a great starting
point for the deduplication process.

### macOS Tags

MacOS file tags and colors are also included:

```bash theme={null}
Current:  research.pdf
Tags: Red, Important, Research
```
