What does BDC and EMC actually stand for?

`BDC` stands for 'Begin Dictionary-driven marked Content'. It starts a bracket that requires an properties dictionary. `EMC` stands for 'End Marked Content'. It closes the bracket. Anything drawn between the two is legally grouped together.

What is the difference between BMC and BDC?

`BMC` (Begin Marked Content) is a simple tag that takes a single string name (e.g., `/Header BMC`). It is rarely used today. `BDC` is much more powerful because it takes a full properties dictionary, allowing you to attach an MCID integer, assign it to an Optional Content group, or define it as an Artifact.

Can I nest Marked Content tags inside each other?

Yes, heavily. A standard architectural PDF might have an overarching `BDC` tag assigning everything to a 'Plumbing' Layer. Inside that bracket, there might be smaller nested `BDC` tags assigning specific pipe labels to a structural accessibility tag.

The Marked Content Identifier (MCID). It is an integer assigned inside a BDC dictionary. The Logical Structure Tree (the XML-like hierarchy sitting at the end of the file) uses exactly this integer to point to the physical text on the page. If the MCID is missing, the screen reader cannot read the text.

What happens if an EMC tag is missing?

The PDF is corrupt. PDF graphic streams rely on rigid state-machines. If a `BDC` is opened but never closed by an `EMC`, the parser throws a syntax error. Usually, Acrobat will still guess and draw the page, but accessibility scraping tools will crash entirely.

Do these tags change how text looks?

Never. Marked Content operators are entirely invisible metadata logic. They do not change fonts, colors, or positioning. They exist purely to provide structural hooks for assistive machines.

PDF Marked Content (BDC/EMC) Explained

Quick Answer

Think of HTML tags like <span>...</span>. PDF Marked Content acts identically, but it uses the textual operators BDC (Begin Dictionary-driven Content) to open the bracket, and EMC (End Marked Content) to close the bracket. Anything drawn inside the bracket is legally grouped. If that bracket is given an MCID number, a disabled user's screen reader can finally target and read the text inside.

The Core Use Cases

Marked Content is the underlying mechanical foundation for three massive, distinct PDF systems:

Logical Structure (Accessibility): The StructTreeRoot states "I have a Heading Paragraph". It searches the page stream for a `BDC` block that contains a matching MCID (Marked Content ID), proving exactly what string of text on the page constitutes that Heading.
Optional Content (Layers): An architect puts all the blue pipes on a layer. In reality, the PDF engine draws the vectors inside a `BDC` bracket. The `BDC` bracket checks the Document Catalog. If the "Plumbing Layer" is set to "Invisible," the rendering engine skips compiling all vectors until it hits the `EMC` tag.
Artifact Designation: A repeating "Page 4" footer must not be read aloud by a screen reader. A `BDC` block tagged simply as /Artifact commands all accessibility parsers to physically ignore all text trapped inside the EMC boundary.

The Three Operators

Operator	Name	Usage & Syntax
BMC	Begin Marked Content	`/Tag_Name BMC`. Very basic, it only takes a string name. Primarily used for low-level visual clipping paths or legacy grouping. It lacks the power of a property dictionary.
BDC	Begin Dictionary Content	`/P << /MCID 0 >> BDC`. The absolute standard. It accepts a highly complex properties dictionary, allowing you to attach metadata, semantic types, and layer dependencies directly to the visual ink.
EMC	End Marked Content	`EMC`. Required to formally close whichever `BMC` or `BDC` was most recently opened. If missing, the syntax tree crashes.

Real-World Scenarios

♿ Web Content Accessibility

The "Empty" Missing PDF

An HR department uses a cheap 3rd-party library to generate IRS W-4 tax PDFs. The PDF visuals look immaculate. But a blind user opens the file, and their JAWS reader says "Document Empty." The issue? The cheap library physically drew the text (`(John Doe) Tj`), but completely failed to wrap the text in a /P << /MCID ### >> BDC block. Because the text wasn't "Marked," the accessibility tree couldn't find the ink.

🔍 Data Extraction

Beating the Watermark

An AI developer writes a Python script hoping to scrape article text from a scientific journal PDF. Unfortunately, the publisher stamped a giant diagonal "DRAFT DO NOT DISTRIBUTE" watermark across every page. The scraped text looks like "T h D R A i s F To c". However, an advanced developer notices the watermark was wrapped in a /Artifact BDC block. The script is rewritten to automatically ignore any text sitting inside an Artifact-Marked block, successfully extracting the clean paragraph text.

Syntax in the Page Content Stream

PDF PAGE STREAM — Nested BDC Logic

% Unmarked content. This is a purely visual black box. 
% Screen readers don't care about it.
0 0 0 rg
0 0 100 100 re f

% Example 1: Accessibility Tagging
% We define this as a Paragraph (/P), assign it MCID 2 so the Structural Tree can find it.
/P << /MCID 2 >> BDC
  BT 
  /F1 12 Tf 
  (This is a sentence that screen readers will announce.) Tj 
  ET
EMC % Close the Paragraph

% Example 2: Optional Content (Layers)
% This block points to a Layer Dictionary (OC1). If OC1 is toggled off, this text vanishes.
/OC /OC1 BDC
  BT 
  0.8 0 0 rg % Red Text
  (CONFIDENTIAL DRAFT WATERMARK) Tj 
  ET
EMC % Close the Layer

% Example 3: Artifacting
% We tell all scraping AI to completely ignore this page number at the bottom.
/Artifact BMC
  BT 
  (Page 42) Tj 
  ET
EMC % Close the Artifact Block

Common Tagging Failures

Intersecting Tags. A rigid PDF rule: A BDC block must be entirely closed before a new, independent block is opened. You cannot physically intersect Marked Content brackets. Attempting to start the next paragraph before the first paragraph fires `EMC` results in a Corrupt DOM tree.
Redundant MCIDs. The Marked Content ID (`/MCID 4`) integer must be completely unique *for that specific page*. If a script accidentally copies and pastes a paragraph, creating two blocks both holding `MCID 4`, the PDF Structure Dictionary will crash because it doesn't know which physical text block to assign to the Heading logic.
Over-Tagging. Wrapping every single individual character in a separate BDC/EMC block. While technically legal, this bloats the PDF file size by 300% and crashes parsing engines. Entire paragraphs or unified lines should be bundled into a single bracket.

Frequently Asked Questions

No. Marked Content relies purely on metadata operators. `BDC` does not draw ink on the page, it just assigns a logical property to the ink drawn inside its boundaries.
In Acrobat Pro, you can open the 'Tags' panel to see the translated structural tree. However, to see the actual `BDC/EMC` operators, you must use a low-level PDF diagnostic tool like "Acrobat Preflight > Explore Page Content" or open the uncompressed PDF in a hex editor.
Only for Logical Structure (Accessibility). If you are using a BDC simply to create a toggled "Layer" (Optional Content), or to create an `Artifact`, it does not need an MCID integer because it doesn't need to communicate with the master StructTreeRoot.
Compliance. A completely untagged vector line violates ISO PDF/UA standards causing validation failures. Wrapping that line explicitly in an `Artifact BDC` formally declares to the validator: "I am aware this exists, and I am telling you to ignore it."
No. The `BDC/EMC` bracket must open and strictly close within the boundary of a single Page Object's content stream. To link a paragraph that flows across a page break, the Logical Structure Tree must map two separate MCIDs (one on each page) under a single parent `<P>` node.

Analyze Document Structure

Need to extract raw tables or ensure your tags are compliant? Convert, strip, or rebuild complex PDF files perfectly with PDFlyst.

Edit PDF Code

PDF Marked Content: BDC & EMC Explained