Overview

These are the main pipeline settings. Experiment to find the best values for your use case.

MATTER_CONFIG

The core pipeline object which determines which matter pages to process. What are matter pages?

Default configuration

MATTER_CONFIG = {
    "front": {
        "max_pages": 8
    },
    "back": {
        "mode": "never",
        "max_pages": 5,
        "fields": {
            "publisher": false,
            "year": false,
            "edition": false,
            "isbn": false,
            "doi": false,
            "loc": false
        }
    }
}

Why are title and author not in the fields section?

Title and at least one author are always required and are implicitly included. In other words, we need some baseline fields to name a file.

This will be made configurable in the next update.

Front Matter

front
object
required
max_pages
number
default:"8"
required

Number of pages to process from the start of document. Begins counting from PAGE_NUM_OFFSET

Front Matter Examples

MATTER_CONFIG = {
    "front": {
        # Safe default for most documents
        "max_pages": 8,
    }
}

Back matter

Back matter generally has less metadata we need and is designed primarily as a fallback if the title and at least one author name was not found in the front matter.

There are certain situations where you may want to process back matter differently.

back
object
required
mode
string
default:"never"

When to process back matter pages:

  • never: Skip back matter entirely, regardless of what’s found in front matter. Fastest and most cost-effective option. Best for fiction and short stories.

  • always: Process back matter regardless of what fields are found in front matter. Best for documents where back matter contains detailed appendices, like reports, technical papers, and research papers.

  • fallback: Only process back matter if front matter doesn’t have a title, at least one author name, and all of the fields set to true in MATTER_CONFIG.back.fields.

    Best balance of speed and completeness. Will check back matter only if key metadata is missing from front matter. This is the recommended setting for most document collections with mixed formats and structures.

max_pages
number
default:"5"
required

Number of last pages of the document to process.

fields
object
required

Which metadata fields to look for if not found in front matter, and back matter processing is enabled with MATTER_CONFIG.back.mode set to fallback or always.

These settings are ignored if MATTER_CONFIG.back.mode is never.

Author and Title are the minimum required fields for metadata validation and filename generation and are not configurable. In other words, the filename must have something to name the document.

This API shape is a work in progress and a bit unclear. Will refactor to be more intuitive in a future update.

Look at the examples below for clarification.

publisher
boolean
default:"false"
required
year
boolean
default:"false"
required

Publication year

edition
boolean
default:"false"
required
isbn
boolean
default:"false"
required

ISBN identifier

doi
boolean
default:"false"
required

Digital Object Identifier

loc
boolean
default:"false"
required

Library of Congress number

Back Matter Examples

# Never process back matter, regardless of what's found in front matter.
# Note that even though ISBN is set to `true`, it won't trigger a
# back matter search because back matter processing is disabled.
MATTER_CONFIG = {
    "back": {
        "mode": "never",
        "max_pages": 5, # no effect

        # no effect
        "fields": {
            "publisher": false,
            "year": false,
            "edition": false,
            "isbn": true,
            "doi": false,
            "loc": false
        }
    }
}

Counting Pages

PAGE_NUM_OFFSET
number
default:"1"
required

Determines how to express page numbering when naming diagnostic files. 1-based makes it easier to cross-reference page numbers to the source PDF page numbers.

PAGE_NUM_OFFSET = 1
MATTER_CONFIG = {
    "front": {
        "max_pages": 8, 
    }
}
Pages will be numbered 1-8. Front cover is page 1.

Writing Changes

Critical setting that controls whether generated metadata and filenames are written to PDFs in [project-root]/data/ which are not in a RUNTIME_IGNORE_DIR_NAME.

Only set this to True once you have run the script and confirmed suggestions are acceptable. Review Evaluation for a full guide.

Back up your files before setting this to True. Many edge cases may cause unexpected results. See Test Cases for cases we test for, and Known Issues for those we know about.

WRITE_PDF_CHANGES
boolean
default:"false"
required

Cap the filename character length for cross-platform compatibility. This has only been tested on a macOS APFS volume.

MAX_FILENAME_LENGTH
number
default:"255"
required

Data folder

Drop your PDFs in the [project-root]/data folder. Read about ignoring directories, a helpful feature for troubleshooting.

The [project-root]/data folder location is currently hardcoded. While this makes it easy to ignore with git, it could be made configurable.

This is planned for a future update. For now, you can create a symlink from your preferred location to [project-root]/data.