Fields

The pipeline extracts, processes and generates these fields:

Required

  • Title
  • Author[s], first and last name are concatenated. Multiple authors separated by commas, listed in citation order.

Optional

  • Subtitle - Usually split from the title with a colon or dash
  • Publisher - Filler words, like “Co."" are removed
  • Edition - Converted to ordinal suffix (e.g. 2nd, 3rd). If 1st edition, this field is not generated.
  • Year - the latest year is chosen when multiple years are found
  • ISBN[s], International Standard Book Number, may be 10 or 13 digits long. Multiple may be present, for various formats (e.g. hardcover, paperback, etc.)
  • DOI, Digital Object Identifier , unique identifier for an electronic document like a journal article or book
  • LOC, Library of Congress Control Number, a unique identifier for books in the Library of Congress catalog
Field customization is planned. See Roadmap

What Happens to The Original Metadata?

  1. All original DC (Dubin Core) fields are preserved in XMP (extended metadata) under the new/custom “dco:” namespace
  2. Script generated (new) metadata is added to a new XMP “metadata:” namespace
  3. Any existing XMP fields in other namespaces are preserved
  4. The DC section is emptied in the final state

There is no transformation, merging or mapping done to the original metadata.

DC Fields

The 15 Dublin Core metadata elements are:

  1. Title - Name of the resource
  2. Creator - Primary creator/author of the resource
  3. Subject - Topic of the resource content
  4. Description - Account of the resource content
  5. Publisher - Entity responsible for making the resource available
  6. Contributor - Entity that made contributions to the resource
  7. Date - Point or period of time associated with the resource
  8. Type - Nature or genre of the resource
  9. Format - File format, physical medium, or dimensions
  10. Identifier - Unambiguous reference to the resource (e.g. ISBN)
  11. Source - Related resource from which this one is derived
  12. Language - Language of the resource
  13. Relation - Related resource
  14. Coverage - Spatial or temporal topic of the resource
  15. Rights - Information about rights held in/over the resource

Example

Original Metadata:

{
    "DC": {
        "dc:title": "Data Science",
        "dc:creator": "Sarah Stats",
        "dc:publisher": "Data Press",
        "dc:date": "2023",
        "dc:identifier": [
            "isbn:9780123456789",
            "doi:10.1234/example.2023",
            "urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66"
        ],
        "dc:type": "text",
        "dc:language": "en"
    },
    "XMP": {
        "pdf": {
            "pageCount": 342
        },
        "xmp": {
            "CreatorTool": "OCR Processing Engine v2.3"
        }
    }
}

Script Output:

{
    "metadata": {
        "title": "Data Science",
        "author": "Sarah Stats",
        "publisher": "Data Press",
        "year": "2023",
        "isbn": [
            { "medium": "print", "number": "9780123456789" },
            { "medium": "ebook", "number": "9780123456790" }
        ],
        "doi": "10.1234/example.2023"
    }
}

Final State:

{
    "DC": {},
    "XMP": {
        "dco": {
            "title": "Data Science",
            "creator": "Sarah Stats",
            "publisher": "Data Press",
            "date": "2023",
            "identifier": [
                "isbn:9780123456789",
                "doi:10.1234/example.2023",
                "urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66"
            ],
            "type": "text",
            "language": "en"
        },
        "metadata": {
            "title": "Data Science",
            "author": "Sarah Stats",
            "publisher": "Data Press",
            "year": "2023",
            "isbn": [
                { "medium": "print", "number": "9780123456789" },
                { "medium": "ebook", "number": "9780123456790" }
            ],
            "doi": "10.1234/example.2023"
        },
        "pdf": {
            "pageCount": 342
        },
        "xmp": {
            "CreatorTool": "OCR Processing Engine v2.3"
        }
    }
}

FAQ

What are the different types of PDF metadata?

PDF supports multiple metadata standards:

  1. Dublin Core (DC): A basic, widely-used standard for core document properties
  2. XMP (Extensible Metadata Platform): A more flexible format that can handle complex data types and custom fields
  3. PDF Info Dictionary: Legacy metadata storage used by older PDF readers

How does the tool write metadata to each PDF?

Instead of complex metadata merging and mapping, we use a simple approach:

  1. Original DC fields are preserved under the dco: namespace in XMP
  2. DC section is cleared, and the whichever of the 15 Dublin Core fields are present, are written to the PDF with the new metadata
  3. New metadata is written under our custom metadata: namespace
  4. Other XMP fields are preserved as-is

No metadata is lost during this process - it’s just reorganized. This approach:

  • Preserves information about different editions
  • Maintains a structured format
  • Keeps clear separation between old and new data
  • Ensures existing XMP fields remain untouched
  • Establishes clear precedence (new metadata over old)

How are PDF changes saved?

We use two save strategies:

  1. Incremental Save (preferred)

    • Appends changes to the PDF
    • Preserves digital signatures
    • Faster and safer
  2. Full Save (fallback)

    • Creates a temporary file
    • Writes complete PDF content
    • Replaces original file
    • Used when incremental save isn’t possible

What safety measures are in place?

The tool implements several safeguards:

  1. Save Strategy

    • Attempts incremental save first (preserves PDF structure)
    • Falls back to full save via temporary file if needed
    • Ensures original file is not corrupted during updates
  2. Validation

    • Requires title and author fields
    • Validates XML structure before writing
    • Preserves original metadata as backup

What about PDF readers compatibility?

Different PDF readers handle metadata differently:

  1. Basic Readers

    • See only PDF info dictionary
    • Display basic title/author
    • Limited metadata support
  2. Advanced Readers

    • Access full XMP metadata
    • Show all metadata fields
    • Support custom namespaces
  3. Library Systems

    • Use DC metadata for catalogs
    • Extract DOI/ISBN for linking
    • Index full metadata content

This is why we maintain both basic PDF info and rich XMP metadata.