WP_HTML_Processor{}└─ WP_HTML_Tag_Processor
Class used for parsing and modifying an HTML document.
Parses HTML code and allows modifications without breaking the markup. Uses its own HTML5 parser and works only with a "safe" set of tags and rules. If it encounters an unknown element or a complex case, it simply stops to avoid damaging the original code.
The class can:
- Read and change attributes and classes.
- Read and change the content inside a tag.
- Navigate through breadcrumbs.
- Set bookmarks and return to them.
- Move up and sideways in the tree (achieved non-standardly).
As of WordPress version 6.8, the class is still limited to what it can do, the same as WP_HTML_Tag_Processor + navigation through breadcrumbs.
Important notes:
-
Often it's better to use WP_HTML_Tag_Processor — it is much faster. For the difference, see the example below.
-
You can create a maximum of 100 bookmarks; if you create more, the parser will throw an error.
-
Works only with input strings in UTF-8 encoding.
- Stops parsing if it fails to parse HTML: upon encountering
<table>
, SVG/MathML, or a complex case where browsers need to "rearrange" nodes (fostering/adoption). This is done so that the class never "damages" the document.
Difference from WP_HTML_Tag_Processor:
-
WP_HTML_Processor — Knows the nesting of the DOM. Can modify the document — add/remove classes, attributes. Stops if it encounters SVG, complex tables, or other unsupported elements to avoid damaging the markup. Heavier (~ 10 times slower).
- WP_HTML_Tag_Processor — works as a streaming scanner: sees only the current tag and its attributes, does not store the tree and does not know the nesting. Ideal for quick checks, filtering, and reading attributes "on the fly" — minimal memory and CPU usage (~ 10 times faster). It simply skips unclear areas; modifying HTML (removing a node, adding a class) cannot be done with it.
Also read:
Usage
The constructor is closed, and calling it will return a message doing_it_wrong. The object is created using the static method WP_HTML_Processor::create_fragment().
$processor = WP_HTML_Processor::create_fragment( $html, $context, $encoding ); if( $processor && $processor->next_tag() ){ // do what is needed } $html = $processor->get_updated_html();
- $html(string) (required)
- HTML fragment (or the entire document) to work with. It must be valid HTML5.
- $context(string)
- The name of the root tag within which the fragment is mentally "unfolded." Needed for checking the validity of nesting. Must be a valid top-level element; usually
<body>
.
Default: '' - $encoding(string)
- The encoding of the input string. Only UTF-8 is supported, so if another encoding is used, the text should be re-encoded in advance.
Default: 'UTF-8'
You can also create the object using the static method WP_HTML_Processor::create_full_parser(). The difference is:
-
WP_HTML_Processor::create_fragment() — takes only the necessary fragment in the context of
<body>
. Ignores <!DOCTYPE>, <html>, <head>. Starts faster and uses less memory. - WP_HTML_Processor::create_full_parser() — parses the entire page from <!DOCTYPE> to </html>. Both <head> and the root <html> are accessible. Worse in terms of resources, but needed when editing meta tags, scripts, or sanitizing someone else's HTML before caching.
Breadcrumbs «Breadcrumbs»
Breadcrumbs «Breadcrumbs» — is a list of all tags from the root of the document to the location where the processor's cursor is.
The breadcrumbs reflect the nesting, so they can be represented as a CSS selector with >
(direct nesting): HTML > BODY > DIV > FIGURE > IMG
.
Why they are needed:
- Allow you to know "where we are" in the tree.
- Allow you to specify an exact path when searching: the processor will find the tag only if the entire chain of parents matches.
- Provide guarantees that the code will not "jump out" of the required container.
To get the current chain of tags where the cursor is now, use:
$path = $processor->get_breadcrumbs(); // will return an array ['HTML','BODY', …]
Implicit tags
If you create a processor through create_fragment()
, the fragment is automatically wrapped in virtual HTML and BODY - <html><body>…</body></html>
. Therefore, any tags will have HTML
and BODY
at the beginning of the path.
At the same time, this does not interfere with short queries: you can specify only the necessary "tail" part of the chain.
Short and full chains
['IMG']
— will match any <img> in the document.['P','IMG']
— will find <img> that is directly in <p>.['HTML','BODY','MAIN','UL','LI','A']
— exact path, will exclude partial matches.
Examples:
-
An image directly in figure:
// <figure><img src="pic.jpg"></figure> $processor->next_tag([ 'breadcrumbs' => ['FIGURE','IMG'] ]); // cursor will be on <img>
-
Searching for <em> inside <figcaption>:
// <figure><figcaption>A <em>nice</em> day</figcaption></figure> $processor->next_tag([ 'breadcrumbs' => ['FIGURE','FIGCAPTION','EM'] ]); // cursor will be on <em>
-
Searching for images that are directly in <body>:
// <div><img></div><img> $processor->next_tag([ 'breadcrumbs' => ['BODY','IMG'] ]); // will match only the second <img>
Parser specifics
This class implements only part of the HTML5 specification. If an unsupported element is encountered in the HTML, the HTML Processor completely stops working. This strict measure ensures that the Processor does not break HTML that it does not fully understand.
Limitations:
-
Does not output error messages
If an unsupported construct is encountered, initialization returnsnull
- no errors, exceptions, or logs. The error can be viewed through the method WP_HTML_Processor::get_last_error(). - Does not "merge" attributes from duplicate <html> / <body>
In a regular browser parser, an extra <body> may "gift" its attributes to the first one. This is not the case here: additional tags are simply ignored, and their attributes are lost.
The following are NOT subject to processing:
- Any tags inside <table>.
- Elements of "foreign" content — SVG, MathML, etc.
- Nodes that are parsed outside of in body mode (<!DOCTYPE>, <meta>, <link>, etc.).
- If node relocation is required for correct structure (fostering/adoption rules). For example, if a <div> accidentally ended up inside a <table>, the browser would move it, but the processor will stop working.
The parser still understands:
- Omitted mandatory tags:
<p>one<p>two
. - Extra closing tags:
<p>one </span> more</p>
. - "Self-closing" non-empty elements:
<div/>text</div>
. - Headers closing another level:
<h1>Title </h2>
. - Text resembling tags inside elements:
<title>The <img> is plaintext</title>
. - Code in
<script> / `<style>
, containing pseudo-HTML:
<script>document.write('<p>Hi</p>');</script>
. -
Escaped scripts:
<script><!-- document.write('<script>console.log("hi")</script>') --></script>
Examples
#1 Adding a class to an image inside a figure
The task is to add a class to <img>
inside <figure>
.
$html = ' <img src="pic2.jpg"> <figure><img src="pic.jpg"></figure> '; $pr = WP_HTML_Processor::create_fragment( $html ); if( $pr && $pr->next_tag( [ 'breadcrumbs' => [ 'FIGURE', 'IMG' ] ] ) ){ $pr->add_class( 'responsive' ); } echo $pr->get_updated_html(); /* <img src="pic2.jpg"> <figure><img class="responsive" src="pic.jpg"></figure> */
This task can also be solved using WP_HTML_Tag_Processor, but in this case, you will have to manually check that IMG
is indeed inside FIGURE
.
$html = ' <img src="pic2.jpg"> <figure><img src="pic.jpg"></figure> '; $pr = new WP_HTML_Tag_Processor( $html ); $inside_figure = false; while( $pr->next_tag( [ 'tag_closers' => 'visit' ] ) ){ // Entered <figure> if ( ! $pr->is_tag_closer() && 'FIGURE' === $pr->get_tag() ) { $inside_figure = true; } // Exited </figure> if ( $pr->is_tag_closer() && 'FIGURE' === $pr->get_tag() ) { $inside_figure = false; } // If we are still inside FIGURE and this is <img>, add class if ( $inside_figure && ! $pr->is_tag_closer() && 'IMG' === $pr->get_tag() ) { $pr->add_class( 'responsive' ); } } echo $pr->get_updated_html(); /* <img src="pic2.jpg"> <figure><img class="responsive" src="pic.jpg"></figure> */
An example of what will happen if you do not check the nesting:
$html = ' <img src="pic2.jpg"> <figure><img src="pic.jpg"></figure> '; $pr = new WP_HTML_Tag_Processor( $html ); while( $pr->next_tag( [ 'tag_name' => 'IMG' ] ) ){ $pr->add_class( 'responsive' ); } echo $pr->get_updated_html(); /* <img class="responsive" src="pic2.jpg"> <figure><img class="responsive" src="pic.jpg"></figure> */
#2 Removing the style attribute from all links
$html = ' <a href="https://example.com" style="color:red;">Example</a> <p> <a href="#top" style="text-decoration:none;">Back to top</a> </p> '; $pr = WP_HTML_Processor::create_fragment( $html ); while( $pr && $pr->next_tag( [ 'tag_name' => 'A' ] ) ){ $pr->remove_attribute( 'style' ); } echo $pr->get_updated_html(); /* <a href="https://example.com" >Example</a> <p> <a href="#top" >Back to top</a> </p> */
This is just an example. In practice, such a simple task is better solved using WP_HTML_Tag_Processor, which is 10 times faster. To do this, you just need to replace the processor:
$pr = WP_HTML_Processor::create_fragment( $html ); // with $pr = new WP_HTML_Tag_Processor( $html );
#3 Checking that the tag matches the chain "DIV > P > IMG"
$pr = WP_HTML_Processor::create_fragment( $html ); if( $pr && $pr->next_tag( [ 'breadcrumbs' => [ 'div', 'p', 'img' ] ] ) ){ // found an image inside a paragraph in DIV }
#4 Example of navigating through a tree
This example demonstrates navigation through the tree up, down, and sideways.
WP_HTML_Processor can "jump" through the tree — down, sideways, and back — but it does this not like a classic DOM-walker, but through two techniques:
- Breadcrumbs filter (
next_tag()
) — allows you to go down and sideways. - Bookmarks
set_bookmark() + seek()
— allow you to go up - "remember a point" and return to it.
Methods for navigating through the tree:
- next_tag(…) — iterates through tags forward; you can limit the path through breadcrumbs, thereby "going" deeper or staying at the current level.
- set_bookmark( $name ) — sets a bookmark at the current point.
- seek( $name ) — returns the cursor to a previously set bookmark.
Example — down, sideways, and back up:
$html = ' <figure> <img src="pic.jpg"> <figcaption>Caption</figcaption> </figure> '; $p = WP_HTML_Processor::create_fragment( $html ); // 1. Set a bookmark on FIGURE (return point "up") $p->next_tag( 'figure' ); $p->set_bookmark( 'figure' ); // 2. Go DOWN and look for FIGCAPTION if ( $p->next_tag( [ 'breadcrumbs' => [ 'FIGURE', 'FIGCAPTION' ] ] ) ) { // "Side" transition: find the nearest sibling tag <em> inside the caption $p->next_tag( [ 'breadcrumbs' => [ 'FIGURE', 'FIGCAPTION', 'EM' ] ] ); $p->add_class( 'highlight' ); } // 3. Jump UP to the original <figure> $p->seek( 'figure' ); $p->add_class( 'has-caption' ); echo $p->get_updated_html();
#5 Closing tags (</…>) are skipped by default
next_tag() by default stops only at opening tags.
To have the cursor also enter </…>
, you need to specify the option tag_closers = visit
(the value skip
or empty means "skip").
$html = '<div><span>Text</span></div>'; $p = WP_HTML_Processor::create_fragment( $html ); /* * Traverse the document, STOPPING at closing tags as well. * The argument 'tag_closers' => 'visit' explicitly states this. */ while( $p->next_tag( [ 'tag_closers' => 'visit' ] ) ){ printf( "%s%s\n", $p->get_tag(), $p->is_tag_closer() ? ' (closer)' : '' ); } /* Result: DIV SPAN SPAN (closer) DIV (closer) */
Why WordPress has its own parser
-
Safe server-side editing.
Gutenberg blocks and plugins increasingly require inserting <picture>, wrapping nodes, etc.; regex breaks on malformed HTML. -
Zero external dependencies.
Pure PHP, works even on minimal hosting without libxml, tidy, and others. -
"Do no harm" principle.
Only a safe subset of HTML5 is supported; in controversial constructs, the parser stops instead of damaging the document. - Unified HTML API.
After the fast WP_HTML_Tag_Processor (6.2), a tool was needed that could navigate the tree and edit content — this is WP_HTML_Processor (6.4).
Alternatives
Why the following alternatives did not work:
-
Regex and string functions
Pros: zero dependencies.
Not suitable: break on nested tags, scripts, and comments; code is hard to maintain. -
PHP DOMDocument (libxml extension)
Pros: full DOM, XPath, part of the standard PHP extension.
Not suitable: requires an extension on the server, consumes a lot of memory, poorly handles broken HTML, and does not know HTML5 rules. -
PHP tidy (tidy extension)
Pros: automatically fixes markup.
Not suitable: rarely installed on hosting, LGPL license. -
Masterminds HTML5-PHP
Pros: full modern HTML5 parser.
Not suitable: large library (~450 KB), slower; external dependency complicates security and Core updates. - Symfony DomCrawler / QueryPath / DiDom
Pros: convenient API with CSS selectors.
Not suitable: carry a full DOM and a number of external packages — for one function, it would require connecting half a framework.
Methods (commonly used)
Creating a processor:
- create_fragment() — parses an HTML fragment inside <body>.
- create_full_parser() — parses the full document from <!DOCTYPE> to </html>.
Tree navigation:
- next_tag() — move to the next tag (with filters).
- set_bookmark() — set a bookmark.
- seek() — return to a bookmark.
- get_breadcrumbs() — get the parent chain of the current node.
Reading information about the current node:
- get_tag() — the name of the tag under the cursor.
- is_tag_closer() — opening or closing tag.
- get_attribute() — the value of a specific attribute.
- has_class() — check for the presence of a class.
- class_list() — get the list of classes.
Modifying a node:
- add_class() — add a CSS class.
- remove_class() — remove a CSS class.
- set_attribute() — change an attribute.
- remove_attribute() — remove an attribute.
Diagnostics:
- get_last_error() — find out why the parser stopped or returned null.
Methods (all)
- public static create_fragment( $html, $context = '', $encoding = 'UTF-8' )
- public static create_full_parser( $html, $known_definite_encoding = 'UTF-8' )
- public __construct( $html, $use_the_static_create_methods_instead = null )
- public get_last_error()
- public get_unsupported_exception()
- public next_tag( $query = null )
- public next_token()
- public is_tag_closer()
- public matches_breadcrumbs( $breadcrumbs )
- public expects_closer( ?WP_HTML_Token $node = null )
- public step( $node_to_process = self::PROCESS_NEXT_NODE )
- public get_breadcrumbs()
- public get_current_depth()
- public static normalize( string $html )
- public serialize()
- public get_namespace()
- public get_tag()
- public has_self_closing_flag()
- public get_token_name()
- public get_token_type()
- public get_attribute( $name )
- public set_attribute( $name, $value )
- public remove_attribute( $name )
- public get_attribute_names_with_prefix( $prefix )
- public add_class( $class_name )
- public remove_class( $class_name )
- public has_class( $wanted_class )
- public class_list()
- public get_modifiable_text()
- public get_comment_type()
- public release_bookmark( $bookmark_name )
- public seek( $bookmark_name )
- public set_bookmark( $bookmark_name )
- public has_bookmark( $bookmark_name )
- public static is_special( $tag_name )
- public static is_void( $tag_name )
- private create_fragment_at_current_node( string $html )
- private bail( string $message )
- private next_visitable_token()
- private is_virtual()
- protected serialize_token()
- private step_initial()
- private step_before_html()
- private step_before_head()
- private step_in_head()
- private step_in_head_noscript()
- private step_after_head()
- private step_in_body()
- private step_in_table()
- private step_in_table_text()
- private step_in_caption()
- private step_in_column_group()
- private step_in_table_body()
- private step_in_row()
- private step_in_cell()
- private step_in_select()
- private step_in_select_in_table()
- private step_in_template()
- private step_after_body()
- private step_in_frameset()
- private step_after_frameset()
- private step_after_after_body()
- private step_after_after_frameset()
- private step_in_foreign_content()
- private bookmark_token()
- private close_a_p_element()
- private generate_implied_end_tags( ?string $except_for_this_element = null )
- private generate_implied_end_tags_thoroughly()
- private get_adjusted_current_node()
- private reconstruct_active_formatting_elements()
- private reset_insertion_mode_appropriately()
- private run_adoption_agency_algorithm()
- private close_cell()
- private insert_html_element( WP_HTML_Token $token )
- private insert_foreign_element( WP_HTML_Token $token, bool $only_add_to_element_stack )
- private insert_virtual_node( $token_name, $bookmark_name = null )
- private is_mathml_integration_point()
- private is_html_integration_point()
- protected static get_encoding( string $label )
Notes
Changelog
Since 6.4.0 | Introduced. |