example matter pages from Machine Learning System Design

What are these pages?

Front and back matter sections refer to the first and last pages of a document.

These sections hold the key raw material we’ll feed to the LLM as grounded context for an accurate filename and metadata prediction.

We skip the body matter, the middle chunk of the document, as it rarely has the metadata we need.

Front Matter Pages

Commonly found pages in the first pages contain publication details and introductory content:

  • Cover - Title, author[s], publisher logo
  • Half-title - Title and subtitle only
  • Recommended - Other/similar books by the same author[s]/publisher
  • Title - Primary source for title and subtitle
  • Copyright - Publication year, edition, DOI, LOC, ISBN[s]
  • Letter from the author[s]
  • Acknowledgements
  • Preface
  • Table of contents - Document structure and scope

Configurable with the MATTER_CONFIG.front.max_pages setting.

Back Matter Pages

Commonly found pages in the last pages, contain supplementary info:

  • Bibliography - References and citations
  • Glossary - Key terms and definitions
  • Appendices - Additional material and data
  • Index/End notes - Subject coverage and annotations
  • Author[s] bios - Detailed author[s] information
  • Back cover - Marketing copy and additional metadata

Configurable with the MATTER_CONFIG.back.max_pages setting.

Matter Processing Flow