WP_HTML_Tag_Processor::get_modifiable_text()publicWP 6.5.0

Returns the modifiable text for a matched token, or an empty string.

Modifiable text is text content that may be read and changed without changing the HTML structure of the document around it. This includes the contents of #text nodes in the HTML as well as the inner contents of HTML comments, Processing Instructions, and others, even though these nodes aren't part of a parsed DOM tree. They also contain the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any other section in an HTML document which cannot contain HTML markup (DATA).

If a token has no modifiable text then an empty string is returned to avoid needless crashing or type errors. An empty string does not mean that a token has modifiable text, and a token with modifiable text may have an empty string (e.g. a comment with no contents).

Limitations:

  • This function will not strip the leading newline appropriately
    after seeking into a LISTING or PRE element. To ensure that the
    newline is treated properly, seek to the LISTING or PRE opening
    tag instead of to the first text node inside the element.

Method of the class: WP_HTML_Tag_Processor{}

No Hooks.

Return

String.

Usage

$WP_HTML_Tag_Processor = new WP_HTML_Tag_Processor();
$WP_HTML_Tag_Processor->get_modifiable_text(): string;

Changelog

Since 6.5.0 Introduced.
Since 6.7.0 Replaces NULL bytes (U+0000) and newlines appropriately.

WP_HTML_Tag_Processor::get_modifiable_text() code WP 6.7.1

public function get_modifiable_text(): string {
	$has_enqueued_update = isset( $this->lexical_updates['modifiable text'] );

	if ( ! $has_enqueued_update && ( null === $this->text_starts_at || 0 === $this->text_length ) ) {
		return '';
	}

	$text = $has_enqueued_update
		? $this->lexical_updates['modifiable text']->text
		: substr( $this->html, $this->text_starts_at, $this->text_length );

	/*
	 * Pre-processing the input stream would normally happen before
	 * any parsing is done, but deferring it means it's possible to
	 * skip in most cases. When getting the modifiable text, however
	 * it's important to apply the pre-processing steps, which is
	 * normalizing newlines.
	 *
	 * @see https://html.spec.whatwg.org/#preprocessing-the-input-stream
	 * @see https://infra.spec.whatwg.org/#normalize-newlines
	 */
	$text = str_replace( "\r\n", "\n", $text );
	$text = str_replace( "\r", "\n", $text );

	// Comment data is not decoded.
	if (
		self::STATE_CDATA_NODE === $this->parser_state ||
		self::STATE_COMMENT === $this->parser_state ||
		self::STATE_DOCTYPE === $this->parser_state ||
		self::STATE_FUNKY_COMMENT === $this->parser_state
	) {
		return str_replace( "\x00", "\u{FFFD}", $text );
	}

	$tag_name = $this->get_token_name();
	if (
		// Script data is not decoded.
		'SCRIPT' === $tag_name ||

		// RAWTEXT data is not decoded.
		'IFRAME' === $tag_name ||
		'NOEMBED' === $tag_name ||
		'NOFRAMES' === $tag_name ||
		'STYLE' === $tag_name ||
		'XMP' === $tag_name
	) {
		return str_replace( "\x00", "\u{FFFD}", $text );
	}

	$decoded = WP_HTML_Decoder::decode_text_node( $text );

	/*
	 * Skip the first line feed after LISTING, PRE, and TEXTAREA opening tags.
	 *
	 * Note that this first newline may come in the form of a character
	 * reference, such as `
`, and so it's important to perform
	 * this transformation only after decoding the raw text content.
	 */
	if (
		( "\n" === ( $decoded[0] ?? '' ) ) &&
		( ( $this->skip_newline_at === $this->token_starts_at && '#text' === $tag_name ) || 'TEXTAREA' === $tag_name )
	) {
		$decoded = substr( $decoded, 1 );
	}

	/*
	 * Only in normative text nodes does the NULL byte (U+0000) get removed.
	 * In all other contexts it's replaced by the replacement character (U+FFFD)
	 * for security reasons (to avoid joining together strings that were safe
	 * when separated, but not when joined).
	 *
	 * @todo Inside HTML integration points and MathML integration points, the
	 *       text is processed according to the insertion mode, not according
	 *       to the foreign content rules. This should strip the NULL bytes.
	 */
	return ( '#text' === $tag_name && 'html' === $this->get_namespace() )
		? str_replace( "\x00", '', $decoded )
		: str_replace( "\x00", "\u{FFFD}", $decoded );
}