Overview

These are the main pipeline settings. Experiment to find the best values for your use case.

MATTER_CONFIG

The core pipeline object which determines which matter pages to process. What are matter pages?

Default configuration

MATTER_CONFIG = {
    "front": {
        "max_pages": 8
    },
    "back": {
        "mode": "never",
        "max_pages": 5,
        "fields": {
            "publisher": false,
            "year": false,
            "edition": false,
            "isbn": false,
            "doi": false,
            "loc": false
        }
    }
}

Why are title and author not in the fields section?

Title and at least one author are always required and are implicitly included. In other words, we need some baseline fields to name a file.

This will be made configurable in the next update.

Front Matter

front
object
required

Front Matter Examples

MATTER_CONFIG = {
    "front": {
        # Safe default for most documents
        "max_pages": 8,
    }
}

Back matter

Back matter generally has less metadata we need and is designed primarily as a fallback if the title and at least one author name was not found in the front matter.

There are certain situations where you may want to process back matter differently.

back
object
required

Back Matter Examples

# Never process back matter, regardless of what's found in front matter.
# Note that even though ISBN is set to `true`, it won't trigger a
# back matter search because back matter processing is disabled.
MATTER_CONFIG = {
    "back": {
        "mode": "never",
        "max_pages": 5, # no effect

        # no effect
        "fields": {
            "publisher": false,
            "year": false,
            "edition": false,
            "isbn": true,
            "doi": false,
            "loc": false
        }
    }
}

Counting Pages

PAGE_NUM_OFFSET
number
default:"1"
required

Determines how to express page numbering when naming diagnostic files. 1-based makes it easier to cross-reference page numbers to the source PDF page numbers.

PAGE_NUM_OFFSET = 1
MATTER_CONFIG = {
    "front": {
        "max_pages": 8, 
    }
}
Pages will be numbered 1-8. Front cover is page 1.

Writing Changes

Critical setting that controls whether generated metadata and filenames are written to PDFs in [project-root]/data/ which are not in a RUNTIME_IGNORE_DIR_NAME.

Only set this to True once you have run the script and confirmed suggestions are acceptable. Review Evaluation for a full guide.

Back up your files before setting this to True. Many edge cases may cause unexpected results. See Test Cases for cases we test for, and Known Issues for those we know about.

WRITE_PDF_CHANGES
boolean
default:"false"
required

Cap the filename character length for cross-platform compatibility. This has only been tested on a macOS APFS volume.

MAX_FILENAME_LENGTH
number
default:"255"
required

Data folder

Drop your PDFs in the [project-root]/data folder. Read about ignoring directories, a helpful feature for troubleshooting.

The [project-root]/data folder location is currently hardcoded. While this makes it easy to ignore with git, it could be made configurable.

This is planned for a future update. For now, you can create a symlink from your preferred location to [project-root]/data.