wp_check_invalid_utf8()
Checks for the presence of invalid UTF8 characters in a string.
The function does nothing if the option get_option( 'blog_charset' ) is not equal to one of utf8, utf-8, UTF8, UTF-8.
For testing, there is a special text with invalid UTF-8 characters. It contains examples of many UTF-8 violations, including single leading bytes, missing continuation bytes, excessively long sequences, etc.
Uses: is_utf8_charset()
Used By: sanitize_text_field(), esc_html()
No Hooks.
Returns
String. Checked text.
Usage
wp_check_invalid_utf8( $string, $strip );
- $string(string) (required)
- Text to be checked.
- $strip(true/false)
- Whether to attempt to remove invalid UTF-8 characters.
Default: false
Examples
#1 Checking the function
$examples = [
'Valid ASCII' => "a",
'Valid 2 Octet Sequence' => "\xc3\xb1",
'Valid 3 Octet Sequence' => "\xe2\x82\xa1",
'Valid 4 Octet Sequence' => "\xf0\x90\x8c\xbc",
'Valid 5 Octet Sequence (but not Unicode!)' => "\xf8\xa1\xa1\xa1\xa1",
'Valid 6 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1",
'Invalid 2 Octet Sequence' => "\xc3\x28",
'Invalid Sequence Identifier' => "\xa0\xa1",
'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x28\xa1",
'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xe2\x82\x28",
'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x28\x8c\xbc",
'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x90\x28\xbc",
'Invalid 4 Octet Sequence (in 4th Octet)' => "\xf0\x28\x8c\x28",
];
$result = [];
foreach ( $examples as $key => $value ) {
$result[ $key ] = wp_check_invalid_utf8( $value );
}
var_dump( $result );
Result:
array(13) {
["Valid ASCII"]=> string(1) "a"
["Valid 2 Octet Sequence"]=> string(2) "ñ"
["Valid 3 Octet Sequence"]=> string(3) "₡"
["Valid 4 Octet Sequence"]=> string(4) "?"
["Valid 5 Octet Sequence (but not Unicode!)"]=> string(0) ""
["Valid 6 Octet Sequence (but not Unicode!)"]=> string(0) ""
["Invalid 2 Octet Sequence"]=> string(0) ""
["Invalid Sequence Identifier"]=> string(0) ""
["Invalid 3 Octet Sequence (in 2nd Octet)"]=> string(0) ""
["Invalid 3 Octet Sequence (in 3rd Octet)"]=> string(0) ""
["Invalid 4 Octet Sequence (in 2nd Octet)"]=> string(0) ""
["Invalid 4 Octet Sequence (in 3rd Octet)"]=> string(0) ""
["Invalid 4 Octet Sequence (in 4th Octet)"]=> string(0) ""
}
Changelog
| Since 2.8.0 | Introduced. |
| Since 6.9.0 | Stripping replaces invalid byte sequences with the Unicode replacement character U+FFFD (�). |
wp_check_invalid_utf8() wp check invalid utf8 code WP 6.9
function wp_check_invalid_utf8( $text, $strip = false ) {
$text = (string) $text;
if ( 0 === strlen( $text ) ) {
return '';
}
// Store the site charset as a static to avoid multiple calls to get_option().
static $is_utf8 = null;
if ( ! isset( $is_utf8 ) ) {
$is_utf8 = is_utf8_charset();
}
if ( ! $is_utf8 || wp_is_valid_utf8( $text ) ) {
return $text;
}
return $strip
? wp_scrub_utf8( $text )
: '';
}