Key Concepts
Metadata
Fields
The pipeline extracts, processes and generates these fields:
Required
- Title
- Author[s], first and last name are concatenated. Multiple authors separated by commas, listed in citation order.
Optional
- Subtitle - Usually split from the title with a colon or dash
- Publisher - Filler words, like “Co."" are removed
- Edition - Converted to ordinal suffix (e.g. 2nd, 3rd). If 1st edition, this field is not generated.
- Year - the latest year is chosen when multiple years are found
- ISBN[s], International Standard Book Number, may be 10 or 13 digits long. Multiple may be present, for various formats (e.g. hardcover, paperback, etc.)
- DOI, Digital Object Identifier , unique identifier for an electronic document like a journal article or book
- LOC, Library of Congress Control Number, a unique identifier for books in the Library of Congress catalog
Field customization is planned. See Roadmap
What Happens to The Original Metadata?
- All original DC (Dubin Core) fields are preserved in XMP (extended metadata) under the new/custom “dco:” namespace
- Script generated (new) metadata is added to a new XMP “metadata:” namespace
- Any existing XMP fields in other namespaces are preserved
- The DC section is emptied in the final state
There is no transformation, merging or mapping done to the original metadata.
DC Fields
The 15 Dublin Core metadata elements are:
- Title - Name of the resource
- Creator - Primary creator/author of the resource
- Subject - Topic of the resource content
- Description - Account of the resource content
- Publisher - Entity responsible for making the resource available
- Contributor - Entity that made contributions to the resource
- Date - Point or period of time associated with the resource
- Type - Nature or genre of the resource
- Format - File format, physical medium, or dimensions
- Identifier - Unambiguous reference to the resource (e.g. ISBN)
- Source - Related resource from which this one is derived
- Language - Language of the resource
- Relation - Related resource
- Coverage - Spatial or temporal topic of the resource
- Rights - Information about rights held in/over the resource
Example
Original Metadata:
Script Output:
Final State:
FAQ
What are the different types of PDF metadata?
PDF supports multiple metadata standards:
- Dublin Core (DC): A basic, widely-used standard for core document properties
- XMP (Extensible Metadata Platform): A more flexible format that can handle complex data types and custom fields
- PDF Info Dictionary: Legacy metadata storage used by older PDF readers
How does the tool write metadata to each PDF?
Instead of complex metadata merging and mapping, we use a simple approach:
- Original DC fields are preserved under the
dco:
namespace in XMP - DC section is cleared, and the whichever of the 15 Dublin Core fields are present, are written to the PDF with the new metadata
- New metadata is written under our custom
metadata:
namespace - Other XMP fields are preserved as-is
No metadata is lost during this process - it’s just reorganized. This approach:
- Preserves information about different editions
- Maintains a structured format
- Keeps clear separation between old and new data
- Ensures existing XMP fields remain untouched
- Establishes clear precedence (new metadata over old)
How are PDF changes saved?
We use two save strategies:
-
Incremental Save (preferred)
- Appends changes to the PDF
- Preserves digital signatures
- Faster and safer
-
Full Save (fallback)
- Creates a temporary file
- Writes complete PDF content
- Replaces original file
- Used when incremental save isn’t possible
What safety measures are in place?
The tool implements several safeguards:
-
Save Strategy
- Attempts incremental save first (preserves PDF structure)
- Falls back to full save via temporary file if needed
- Ensures original file is not corrupted during updates
-
Validation
- Requires title and author fields
- Validates XML structure before writing
- Preserves original metadata as backup
What about PDF readers compatibility?
Different PDF readers handle metadata differently:
-
Basic Readers
- See only PDF info dictionary
- Display basic title/author
- Limited metadata support
-
Advanced Readers
- Access full XMP metadata
- Show all metadata fields
- Support custom namespaces
-
Library Systems
- Use DC metadata for catalogs
- Extract DOI/ISBN for linking
- Index full metadata content
This is why we maintain both basic PDF info and rich XMP metadata.