> ## Documentation Index
> Fetch the complete documentation index at: https://cstreams.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Metadata

## Fields

The pipeline extracts, processes and generates these fields:

### Required

* **Title**
* **Author\[s]**, first and last name are concatenated. Multiple authors separated by
  commas, listed in citation order.

### Optional

* **Subtitle** - Usually split from the title with a colon or dash
* **Publisher** - Filler words, like "Co."" are removed
* **Edition** - Converted to ordinal suffix (e.g. 2nd, 3rd). If 1st edition,
  this field is not generated.
* **Year** - the latest year is chosen when multiple years are found
* **ISBN\[s]**, [International Standard Book Number](https://www.isbn.org), may be 10 or 13 digits long.
  Multiple may be present, for various formats (e.g. hardcover, paperback,
  etc.)
* **DOI**, [Digital Object Identifier](https://www.doi.org/the-identifier/what-is-a-doi)
  <i class="fa-solid fa-arrow-up-right-from-square" />, unique identifier for an electronic document
  like a journal article or book
* **LOC**, Library of Congress Control Number, a unique identifier for books in
  the Library of Congress catalog

<Note>Field customization is planned. See [Roadmap](/project/roadmap)</Note>

## What Happens to The Original Metadata?

1. All original DC (Dubin Core) fields are preserved in XMP (extended metadata) under the
   new/custom "dco:" namespace
2. Script generated (new) metadata is added to a new XMP "metadata:" namespace
3. Any existing XMP fields in other namespaces are preserved
4. The DC section is emptied in the final state

There is **no transformation, merging or mapping** done to the original metadata.

### DC Fields

The 15 Dublin Core metadata elements are:

1. Title - Name of the resource
2. Creator - Primary creator/author of the resource
3. Subject - Topic of the resource content
4. Description - Account of the resource content
5. Publisher - Entity responsible for making the resource available
6. Contributor - Entity that made contributions to the resource
7. Date - Point or period of time associated with the resource
8. Type - Nature or genre of the resource
9. Format - File format, physical medium, or dimensions
10. Identifier - Unambiguous reference to the resource (e.g. ISBN)
11. Source - Related resource from which this one is derived
12. Language - Language of the resource
13. Relation - Related resource
14. Coverage - Spatial or temporal topic of the resource
15. Rights - Information about rights held in/over the resource

## Example

**Original Metadata:**

```json theme={null}
{
    "DC": {
        "dc:title": "Data Science",
        "dc:creator": "Sarah Stats",
        "dc:publisher": "Data Press",
        "dc:date": "2023",
        "dc:identifier": [
            "isbn:9780123456789",
            "doi:10.1234/example.2023",
            "urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66"
        ],
        "dc:type": "text",
        "dc:language": "en"
    },
    "XMP": {
        "pdf": {
            "pageCount": 342
        },
        "xmp": {
            "CreatorTool": "OCR Processing Engine v2.3"
        }
    }
}
```

**Script Output:**

```json theme={null}
{
    "metadata": {
        "title": "Data Science",
        "author": "Sarah Stats",
        "publisher": "Data Press",
        "year": "2023",
        "isbn": [
            { "medium": "print", "number": "9780123456789" },
            { "medium": "ebook", "number": "9780123456790" }
        ],
        "doi": "10.1234/example.2023"
    }
}
```

**Final State:**

```json theme={null}
{
    "DC": {},
    "XMP": {
        "dco": {
            "title": "Data Science",
            "creator": "Sarah Stats",
            "publisher": "Data Press",
            "date": "2023",
            "identifier": [
                "isbn:9780123456789",
                "doi:10.1234/example.2023",
                "urn:uuid:6e8bc430-9c3a-11d9-9669-0800200c9a66"
            ],
            "type": "text",
            "language": "en"
        },
        "metadata": {
            "title": "Data Science",
            "author": "Sarah Stats",
            "publisher": "Data Press",
            "year": "2023",
            "isbn": [
                { "medium": "print", "number": "9780123456789" },
                { "medium": "ebook", "number": "9780123456790" }
            ],
            "doi": "10.1234/example.2023"
        },
        "pdf": {
            "pageCount": 342
        },
        "xmp": {
            "CreatorTool": "OCR Processing Engine v2.3"
        }
    }
}
```

## FAQ

### What are the different types of PDF metadata?

PDF supports multiple metadata standards:

1. **Dublin Core (DC)**: A basic, widely-used standard for core document properties
2. **XMP (Extensible Metadata Platform)**: A more flexible format that can handle complex
   data types and custom fields
3. **PDF Info Dictionary**: Legacy metadata storage used by older PDF readers

### How does the tool write metadata to each PDF?

Instead of complex metadata merging and mapping, we use a simple approach:

1. **Original DC fields** are preserved under the `dco:` namespace in XMP
2. **DC section** is cleared, and the whichever of the 15 Dublin Core fields are present,
   are written to the PDF with the new metadata
3. **New metadata** is written under our custom `metadata:` namespace
4. **Other XMP fields** are preserved as-is

No metadata is lost during this process - it's just reorganized. This approach:

* Preserves information about different editions
* Maintains a structured format
* Keeps clear separation between old and new data
* Ensures existing XMP fields remain untouched
* Establishes clear precedence (new metadata over old)

### How are PDF changes saved?

We use two save strategies:

1. **Incremental Save** (preferred)

   * Appends changes to the PDF
   * Preserves digital signatures
   * Faster and safer

2. **Full Save** (fallback)
   * Creates a temporary file
   * Writes complete PDF content
   * Replaces original file
   * Used when incremental save isn't possible

### What safety measures are in place?

The tool implements several safeguards:

1. **Save Strategy**

   * Attempts incremental save first (preserves PDF structure)
   * Falls back to full save via temporary file if needed
   * Ensures original file is not corrupted during updates

2. **Validation**

   * Requires title and author fields
   * Validates XML structure before writing
   * Preserves original metadata as backup

### What about PDF readers compatibility?

Different PDF readers handle metadata differently:

1. **Basic Readers**

   * See only PDF info dictionary
   * Display basic title/author
   * Limited metadata support

2. **Advanced Readers**

   * Access full XMP metadata
   * Show all metadata fields
   * Support custom namespaces

3. **Library Systems**
   * Use DC metadata for catalogs
   * Extract DOI/ISBN for linking
   * Index full metadata content

This is why we maintain both basic PDF info and rich XMP metadata.
