Think of HTML tags like <span>...</span>. PDF Marked Content acts identically, but it uses the textual operators BDC (Begin Dictionary-driven Content) to open the bracket, and EMC (End Marked Content) to close the bracket. Anything drawn inside the bracket is legally grouped. If that bracket is given an MCID number, a disabled user's screen reader can finally target and read the text inside.
The Core Use Cases
Marked Content is the underlying mechanical foundation for three massive, distinct PDF systems:
- Logical Structure (Accessibility): The
StructTreeRootstates "I have a Heading Paragraph". It searches the page stream for a `BDC` block that contains a matchingMCID(Marked Content ID), proving exactly what string of text on the page constitutes that Heading. - Optional Content (Layers): An architect puts all the blue pipes on a layer. In reality, the PDF engine draws the vectors inside a `BDC` bracket. The `BDC` bracket checks the Document Catalog. If the "Plumbing Layer" is set to "Invisible," the rendering engine skips compiling all vectors until it hits the `EMC` tag.
- Artifact Designation: A repeating "Page 4" footer must not be read aloud by a screen reader. A `BDC` block tagged simply as
/Artifactcommands all accessibility parsers to physically ignore all text trapped inside the EMC boundary.
The Three Operators
| Operator | Name | Usage & Syntax |
|---|---|---|
| BMC | Begin Marked Content | /Tag_Name BMC. Very basic, it only takes a string name. Primarily used for low-level visual clipping paths or legacy grouping. It lacks the power of a property dictionary. |
| BDC | Begin Dictionary Content | /P << /MCID 0 >> BDC. The absolute standard. It accepts a highly complex properties dictionary, allowing you to attach metadata, semantic types, and layer dependencies directly to the visual ink. |
| EMC | End Marked Content | EMC. Required to formally close whichever `BMC` or `BDC` was most recently opened. If missing, the syntax tree crashes. |
Real-World Scenarios
The "Empty" Missing PDF
An HR department uses a cheap 3rd-party library to generate IRS W-4 tax PDFs. The PDF visuals look immaculate. But a blind user opens the file, and their JAWS reader says "Document Empty." The issue? The cheap library physically drew the text (`(John Doe) Tj`), but completely failed to wrap the text in a /P << /MCID ### >> BDC block. Because the text wasn't "Marked," the accessibility tree couldn't find the ink.
Beating the Watermark
An AI developer writes a Python script hoping to scrape article text from a scientific journal PDF. Unfortunately, the publisher stamped a giant diagonal "DRAFT DO NOT DISTRIBUTE" watermark across every page. The scraped text looks like "T h D R A i s F To c". However, an advanced developer notices the watermark was wrapped in a /Artifact BDC block. The script is rewritten to automatically ignore any text sitting inside an Artifact-Marked block, successfully extracting the clean paragraph text.
Syntax in the Page Content Stream
% Unmarked content. This is a purely visual black box. % Screen readers don't care about it. 0 0 0 rg 0 0 100 100 re f % Example 1: Accessibility Tagging % We define this as a Paragraph (/P), assign it MCID 2 so the Structural Tree can find it. /P << /MCID 2 >> BDC BT /F1 12 Tf (This is a sentence that screen readers will announce.) Tj ET EMC % Close the Paragraph % Example 2: Optional Content (Layers) % This block points to a Layer Dictionary (OC1). If OC1 is toggled off, this text vanishes. /OC /OC1 BDC BT 0.8 0 0 rg % Red Text (CONFIDENTIAL DRAFT WATERMARK) Tj ET EMC % Close the Layer % Example 3: Artifacting % We tell all scraping AI to completely ignore this page number at the bottom. /Artifact BMC BT (Page 42) Tj ET EMC % Close the Artifact Block
Common Tagging Failures
- Intersecting Tags. A rigid PDF rule: A BDC block must be entirely closed before a new, independent block is opened. You cannot physically intersect Marked Content brackets. Attempting to start the next paragraph before the first paragraph fires `EMC` results in a Corrupt DOM tree.
- Redundant MCIDs. The Marked Content ID (`/MCID 4`) integer must be completely unique *for that specific page*. If a script accidentally copies and pastes a paragraph, creating two blocks both holding `MCID 4`, the PDF Structure Dictionary will crash because it doesn't know which physical text block to assign to the Heading logic.
- Over-Tagging. Wrapping every single individual character in a separate BDC/EMC block. While technically legal, this bloats the PDF file size by 300% and crashes parsing engines. Entire paragraphs or unified lines should be bundled into a single bracket.
Frequently Asked Questions
No. Marked Content relies purely on metadata operators. `BDC` does not draw ink on the page, it just assigns a logical property to the ink drawn inside its boundaries.
In Acrobat Pro, you can open the 'Tags' panel to see the translated structural tree. However, to see the actual `BDC/EMC` operators, you must use a low-level PDF diagnostic tool like "Acrobat Preflight > Explore Page Content" or open the uncompressed PDF in a hex editor.
Only for Logical Structure (Accessibility). If you are using a BDC simply to create a toggled "Layer" (Optional Content), or to create an `Artifact`, it does not need an MCID integer because it doesn't need to communicate with the master StructTreeRoot.
Compliance. A completely untagged vector line violates ISO PDF/UA standards causing validation failures. Wrapping that line explicitly in an `Artifact BDC` formally declares to the validator: "I am aware this exists, and I am telling you to ignore it."
No. The `BDC/EMC` bracket must open and strictly close within the boundary of a single Page Object's content stream. To link a paragraph that flows across a page break, the Logical Structure Tree must map two separate MCIDs (one on each page) under a single parent `<P>` node.
Analyze Document Structure
Need to extract raw tables or ensure your tags are compliant? Convert, strip, or rebuild complex PDF files perfectly with PDFlyst.
Edit PDF Code