Metadata

Fields

The pipeline extracts, processes and generates these fields:

Required

Title
Author[s], first and last name are concatenated. Multiple authors separated by commas, listed in citation order.

Optional

Subtitle - Usually split from the title with a colon or dash
Publisher - Filler words, like “Co."" are removed
Edition - Converted to ordinal suffix (e.g. 2nd, 3rd). If 1st edition, this field is not generated.
Year - the latest year is chosen when multiple years are found
ISBN[s], International Standard Book Number, may be 10 or 13 digits long. Multiple may be present, for various formats (e.g. hardcover, paperback, etc.)
DOI, Digital Object Identifier , unique identifier for an electronic document like a journal article or book
LOC, Library of Congress Control Number, a unique identifier for books in the Library of Congress catalog

Field customization is planned. See Roadmap

What Happens to The Original Metadata?

All original DC (Dubin Core) fields are preserved in XMP (extended metadata) under the new/custom “dco:” namespace
Script generated (new) metadata is added to a new XMP “metadata:” namespace
Any existing XMP fields in other namespaces are preserved
The DC section is emptied in the final state

There is no transformation, merging or mapping done to the original metadata.

DC Fields

The 15 Dublin Core metadata elements are:

Title - Name of the resource
Creator - Primary creator/author of the resource
Subject - Topic of the resource content
Description - Account of the resource content
Publisher - Entity responsible for making the resource available
Contributor - Entity that made contributions to the resource
Date - Point or period of time associated with the resource
Type - Nature or genre of the resource
Format - File format, physical medium, or dimensions
Identifier - Unambiguous reference to the resource (e.g. ISBN)
Source - Related resource from which this one is derived
Language - Language of the resource
Relation - Related resource
Coverage - Spatial or temporal topic of the resource
Rights - Information about rights held in/over the resource

Example

Original Metadata:

{
    "DC": {
        "dc:title": "Data Science",
        "dc:creator": "Sarah Stats",
        "dc:publisher": "Data Press",
        "dc:date": "2023",
        "dc:identifier": [
            "isbn:9780123456789",
            "doi:10.1234/example.2023",
            "urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66"
        ],
        "dc:type": "text",
        "dc:language": "en"
    },
    "XMP": {
        "pdf": {
            "pageCount": 342
        },
        "xmp": {
            "CreatorTool": "OCR Processing Engine v2.3"
        }
    }
}

Script Output:

{
    "metadata": {
        "title": "Data Science",
        "author": "Sarah Stats",
        "publisher": "Data Press",
        "year": "2023",
        "isbn": [
            { "medium": "print", "number": "9780123456789" },
            { "medium": "ebook", "number": "9780123456790" }
        ],
        "doi": "10.1234/example.2023"
    }
}

Final State:

{
    "DC": {},
    "XMP": {
        "dco": {
            "title": "Data Science",
            "creator": "Sarah Stats",
            "publisher": "Data Press",
            "date": "2023",
            "identifier": [
                "isbn:9780123456789",
                "doi:10.1234/example.2023",
                "urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66"
            ],
            "type": "text",
            "language": "en"
        },
        "metadata": {
            "title": "Data Science",
            "author": "Sarah Stats",
            "publisher": "Data Press",
            "year": "2023",
            "isbn": [
                { "medium": "print", "number": "9780123456789" },
                { "medium": "ebook", "number": "9780123456790" }
            ],
            "doi": "10.1234/example.2023"
        },
        "pdf": {
            "pageCount": 342
        },
        "xmp": {
            "CreatorTool": "OCR Processing Engine v2.3"
        }
    }
}

FAQ

What are the different types of PDF metadata?

PDF supports multiple metadata standards:

Dublin Core (DC): A basic, widely-used standard for core document properties
XMP (Extensible Metadata Platform): A more flexible format that can handle complex data types and custom fields
PDF Info Dictionary: Legacy metadata storage used by older PDF readers

How does the tool write metadata to each PDF?

Instead of complex metadata merging and mapping, we use a simple approach:

Original DC fields are preserved under the dco: namespace in XMP
DC section is cleared, and the whichever of the 15 Dublin Core fields are present, are written to the PDF with the new metadata
New metadata is written under our custom metadata: namespace
Other XMP fields are preserved as-is

No metadata is lost during this process - it’s just reorganized. This approach:

Preserves information about different editions
Maintains a structured format
Keeps clear separation between old and new data
Ensures existing XMP fields remain untouched
Establishes clear precedence (new metadata over old)

How are PDF changes saved?

We use two save strategies:

Incremental Save (preferred)
- Appends changes to the PDF
- Preserves digital signatures
- Faster and safer
Full Save (fallback)
- Creates a temporary file
- Writes complete PDF content
- Replaces original file
- Used when incremental save isn’t possible

What safety measures are in place?

The tool implements several safeguards:

Save Strategy
- Attempts incremental save first (preserves PDF structure)
- Falls back to full save via temporary file if needed
- Ensures original file is not corrupted during updates
Validation
- Requires title and author fields
- Validates XML structure before writing
- Preserves original metadata as backup

What about PDF readers compatibility?

Different PDF readers handle metadata differently:

Basic Readers
- See only PDF info dictionary
- Display basic title/author
- Limited metadata support
Advanced Readers
- Access full XMP metadata
- Show all metadata fields
- Support custom namespaces
Library Systems
- Use DC metadata for catalogs
- Extract DOI/ISBN for linking
- Index full metadata content

This is why we maintain both basic PDF info and rich XMP metadata.

Getting Started

Key Concepts

Configuration

Analysis & Iteration

Project

Fields

Required

Optional

What Happens to The Original Metadata?

DC Fields

Example

FAQ

What are the different types of PDF metadata?

How does the tool write metadata to each PDF?

How are PDF changes saved?

What safety measures are in place?

What about PDF readers compatibility?

Getting Started

Key Concepts

Configuration

Analysis & Iteration

Project

​Fields

​Required

​Optional

​What Happens to The Original Metadata?

​DC Fields

​Example

​FAQ

​What are the different types of PDF metadata?

​How does the tool write metadata to each PDF?

​How are PDF changes saved?

​What safety measures are in place?

​What about PDF readers compatibility?

Fields

Required

Optional

What Happens to The Original Metadata?

DC Fields

Example

FAQ

What are the different types of PDF metadata?

How does the tool write metadata to each PDF?

How are PDF changes saved?

What safety measures are in place?

What about PDF readers compatibility?