Core Architecture

PDF Marked Content: BDC & EMC Explained

Before you can build an accessible Heading hierarchy or a complex layered architectural blueprint, you need a way to wrap raw visual math inside a labeled box. Marked Content syntax acts as the physical brackets drawn inside the page stream that allow high-level logic to take control.

Quick Answer

Think of HTML tags like <span>...</span>. PDF Marked Content acts identically, but it uses the textual operators BDC (Begin Dictionary-driven Content) to open the bracket, and EMC (End Marked Content) to close the bracket. Anything drawn inside the bracket is legally grouped. If that bracket is given an MCID number, a disabled user's screen reader can finally target and read the text inside.

The Core Use Cases

Marked Content is the underlying mechanical foundation for three massive, distinct PDF systems:

  • Logical Structure (Accessibility): The StructTreeRoot states "I have a Heading Paragraph". It searches the page stream for a `BDC` block that contains a matching MCID (Marked Content ID), proving exactly what string of text on the page constitutes that Heading.
  • Optional Content (Layers): An architect puts all the blue pipes on a layer. In reality, the PDF engine draws the vectors inside a `BDC` bracket. The `BDC` bracket checks the Document Catalog. If the "Plumbing Layer" is set to "Invisible," the rendering engine skips compiling all vectors until it hits the `EMC` tag.
  • Artifact Designation: A repeating "Page 4" footer must not be read aloud by a screen reader. A `BDC` block tagged simply as /Artifact commands all accessibility parsers to physically ignore all text trapped inside the EMC boundary.

The Three Operators

OperatorNameUsage & Syntax
BMCBegin Marked Content/Tag_Name BMC. Very basic, it only takes a string name. Primarily used for low-level visual clipping paths or legacy grouping. It lacks the power of a property dictionary.
BDCBegin Dictionary Content/P << /MCID 0 >> BDC. The absolute standard. It accepts a highly complex properties dictionary, allowing you to attach metadata, semantic types, and layer dependencies directly to the visual ink.
EMCEnd Marked ContentEMC. Required to formally close whichever `BMC` or `BDC` was most recently opened. If missing, the syntax tree crashes.

Real-World Scenarios

♿ Web Content Accessibility

The "Empty" Missing PDF

An HR department uses a cheap 3rd-party library to generate IRS W-4 tax PDFs. The PDF visuals look immaculate. But a blind user opens the file, and their JAWS reader says "Document Empty." The issue? The cheap library physically drew the text (`(John Doe) Tj`), but completely failed to wrap the text in a /P << /MCID ### >> BDC block. Because the text wasn't "Marked," the accessibility tree couldn't find the ink.

🔍 Data Extraction

Beating the Watermark

An AI developer writes a Python script hoping to scrape article text from a scientific journal PDF. Unfortunately, the publisher stamped a giant diagonal "DRAFT DO NOT DISTRIBUTE" watermark across every page. The scraped text looks like "T h D R A i s F To c". However, an advanced developer notices the watermark was wrapped in a /Artifact BDC block. The script is rewritten to automatically ignore any text sitting inside an Artifact-Marked block, successfully extracting the clean paragraph text.

Syntax in the Page Content Stream

PDF PAGE STREAM — Nested BDC Logic
% Unmarked content. This is a purely visual black box. 
% Screen readers don't care about it.
0 0 0 rg
0 0 100 100 re f

% Example 1: Accessibility Tagging
% We define this as a Paragraph (/P), assign it MCID 2 so the Structural Tree can find it.
/P << /MCID 2 >> BDC
  BT 
  /F1 12 Tf 
  (This is a sentence that screen readers will announce.) Tj 
  ET
EMC % Close the Paragraph

% Example 2: Optional Content (Layers)
% This block points to a Layer Dictionary (OC1). If OC1 is toggled off, this text vanishes.
/OC /OC1 BDC
  BT 
  0.8 0 0 rg % Red Text
  (CONFIDENTIAL DRAFT WATERMARK) Tj 
  ET
EMC % Close the Layer

% Example 3: Artifacting
% We tell all scraping AI to completely ignore this page number at the bottom.
/Artifact BMC
  BT 
  (Page 42) Tj 
  ET
EMC % Close the Artifact Block

Common Tagging Failures

  • Intersecting Tags. A rigid PDF rule: A BDC block must be entirely closed before a new, independent block is opened. You cannot physically intersect Marked Content brackets. Attempting to start the next paragraph before the first paragraph fires `EMC` results in a Corrupt DOM tree.
  • Redundant MCIDs. The Marked Content ID (`/MCID 4`) integer must be completely unique *for that specific page*. If a script accidentally copies and pastes a paragraph, creating two blocks both holding `MCID 4`, the PDF Structure Dictionary will crash because it doesn't know which physical text block to assign to the Heading logic.
  • Over-Tagging. Wrapping every single individual character in a separate BDC/EMC block. While technically legal, this bloats the PDF file size by 300% and crashes parsing engines. Entire paragraphs or unified lines should be bundled into a single bracket.

Frequently Asked Questions

  • No. Marked Content relies purely on metadata operators. `BDC` does not draw ink on the page, it just assigns a logical property to the ink drawn inside its boundaries.

  • In Acrobat Pro, you can open the 'Tags' panel to see the translated structural tree. However, to see the actual `BDC/EMC` operators, you must use a low-level PDF diagnostic tool like "Acrobat Preflight > Explore Page Content" or open the uncompressed PDF in a hex editor.

  • Only for Logical Structure (Accessibility). If you are using a BDC simply to create a toggled "Layer" (Optional Content), or to create an `Artifact`, it does not need an MCID integer because it doesn't need to communicate with the master StructTreeRoot.

  • Compliance. A completely untagged vector line violates ISO PDF/UA standards causing validation failures. Wrapping that line explicitly in an `Artifact BDC` formally declares to the validator: "I am aware this exists, and I am telling you to ignore it."

  • No. The `BDC/EMC` bracket must open and strictly close within the boundary of a single Page Object's content stream. To link a paragraph that flows across a page break, the Logical Structure Tree must map two separate MCIDs (one on each page) under a single parent `<P>` node.

Analyze Document Structure

Need to extract raw tables or ensure your tags are compliant? Convert, strip, or rebuild complex PDF files perfectly with PDFlyst.

Edit PDF Code