Introduce Token Map: An optimized static translation class.

This patch introduces a new class: `WP_Token_Map`, designed for efficient
lookup and translation of static mappings between string keys or tokens, and
string replacements (for example, HTML character references).

The Token Map imposes certain restrictions on the byte length of the lookup
tokens and their replacements, but is a highly-optimized data structure for
mappings with a very high number of tokens.

Developed in https://github.com/WordPress/wordpress-develop/pull/5373
Discussed in https://core.trac.wordpress.org/ticket/60698

Fixes #60698.
Props: dmsnell, gziolo, jonsurrell, jorbin.


git-svn-id: https://develop.svn.wordpress.org/trunk@58188 602fd350-edb4-49c9-b593-d223f7449a82
This commit is contained in:
Dennis Snell
2024-05-23 19:54:17 +00:00
parent 932895b827
commit bcd25b14ec
7 changed files with 4891 additions and 0 deletions

View File

@@ -0,0 +1,818 @@
<?php
/**
* Class for efficiently looking up and mapping string keys to string values, with limits.
*
* @package WordPress
* @since 6.6.0
*/
/**
* WP_Token_Map class.
*
* Use this class in specific circumstances with a static set of lookup keys which map to
* a static set of transformed values. For example, this class is used to map HTML named
* character references to their equivalent UTF-8 values.
*
* This class works differently than code calling `in_array()` and other methods. It
* internalizes lookup logic and provides helper interfaces to optimize lookup and
* transformation. It provides a method for precomputing the lookup tables and storing
* them as PHP source code.
*
* All tokens and substitutions must be shorter than 256 bytes.
*
* Example:
*
* $smilies = WP_Token_Map::from_array( array(
* '8O' => '😯',
* ':(' => '🙁',
* ':)' => '🙂',
* ':?' => '😕',
* ) );
*
* true === $smilies->contains( ':)' );
* false === $smilies->contains( 'simile' );
*
* '😕' === $smilies->read_token( 'Not sure :?.', 9, $length_of_smily_syntax );
* 2 === $length_of_smily_syntax;
*
* ## Precomputing the Token Map.
*
* Creating the class involves some work sorting and organizing the tokens and their
* replacement values. In order to skip this, it's possible for the class to export
* its state and be used as actual PHP source code.
*
* Example:
*
* // Export with four spaces as the indent, only for the sake of this docblock.
* // The default indent is a tab character.
* $indent = ' ';
* echo $smilies->precomputed_php_source_table( $indent );
*
* // Output, to be pasted into a PHP source file:
* WP_Token_Map::from_precomputed_table(
* array(
* "storage_version" => "6.6.0",
* "key_length" => 2,
* "groups" => "",
* "long_words" => array(),
* "small_words" => "8O\x00:)\x00:(\x00:?\x00",
* "small_mappings" => array( "😯", "🙂", "🙁", "😕" )
* )
* );
*
* ## Large vs. small words.
*
* This class uses a short prefix called the "key" to optimize lookup of its tokens.
* This means that some tokens may be shorter than or equal in length to that key.
* Those words that are longer than the key are called "large" while those shorter
* than or equal to the key length are called "small."
*
* This separation of large and small words is incidental to the way this class
* optimizes lookup, and should be considered an internal implementation detail
* of the class. It may still be important to be aware of it, however.
*
* ## Determining Key Length.
*
* The choice of the size of the key length should be based on the data being stored in
* the token map. It should divide the data as evenly as possible, but should not create
* so many groups that a large fraction of the groups only contain a single token.
*
* For the HTML5 named character references, a key length of 2 was found to provide a
* sufficient spread and should be a good default for relatively large sets of tokens.
*
* However, for some data sets this might be too long. For example, a list of smilies
* may be too small for a key length of 2. Perhaps 1 would be more appropriate. It's
* best to experiment and determine empirically which values are appropriate.
*
* ## Generate Pre-Computed Source Code.
*
* Since the `WP_Token_Map` is designed for relatively static lookups, it can be
* advantageous to precompute the values and instantiate a table that has already
* sorted and grouped the tokens and built the lookup strings.
*
* This can be done with `WP_Token_Map::precomputed_php_source_table()`.
*
* Note that if there is a leading character that all tokens need, such as `&` for
* HTML named character references, it can be beneficial to exclude this from the
* token map. Instead, find occurrences of the leading character and then use the
* token map to see if the following characters complete the token.
*
* Example:
*
* $map = WP_Token_Map::from_array( array( 'simple_smile:' => '🙂', 'sob:' => '😭', 'soba:' => '🍜' ) );
* echo $map->precomputed_php_source_table();
* // Output
* WP_Token_Map::from_precomputed_table(
* array(
* "storage_version" => "6.6.0",
* "key_length" => 2,
* "groups" => "si\x00so\x00",
* "long_words" => array(
* // simple_smile:[🙂].
* "\x0bmple_smile:\x04🙂",
* // soba:[🍜] sob:[😭].
* "\x03ba:\x04🍜\x02b:\x04😭",
* ),
* "short_words" => "",
* "short_mappings" => array()
* }
* );
*
* This precomputed value can be stored directly in source code and will skip the
* startup cost of generating the lookup strings. See `$html5_named_character_entities`.
*
* Note that any updates to the precomputed format should update the storage version
* constant. It would also be best to provide an update function to take older known
* versions and upgrade them in place when loading into `from_precomputed_table()`.
*
* ## Future Direction.
*
* It may be viable to dynamically increase the length limits such that there's no need to impose them.
* The limit appears because of the packing structure, which indicates how many bytes each segment of
* text in the lookup tables spans. If, however, care were taken to track the longest word length, then
* the packing structure could change its representation to allow for that. Each additional byte storing
* length, however, increases the memory overhead and lookup runtime.
*
* An alternative approach could be to borrow the UTF-8 variable-length encoding and store lengths of less
* than 127 as a single byte with the high bit unset, storing longer lengths as the combination of
* continuation bytes.
*
* Since it has not been shown during the development of this class that longer strings are required, this
* update is deferred until such a need is clear.
*
* @since 6.6.0
*/
class WP_Token_Map {
/**
* Denotes the version of the code which produces pre-computed source tables.
*
* This version will be used not only to verify pre-computed data, but also
* to upgrade pre-computed data from older versions. Choosing a name that
* corresponds to the WordPress release will help people identify where an
* old copy of data came from.
*/
const STORAGE_VERSION = '6.6.0-trunk';
/**
* Maximum length for each key and each transformed value in the table (in bytes).
*
* @since 6.6.0
*/
const MAX_LENGTH = 256;
/**
* How many bytes of each key are used to form a group key for lookup.
* This also determines whether a word is considered short or long.
*
* @since 6.6.0
*
* @var int
*/
private $key_length = 2;
/**
* Stores an optimized form of the word set, where words are grouped
* by a prefix of the `$key_length` and then collapsed into a string.
*
* In each group, the keys and lookups form a packed data structure.
* The keys in the string are stripped of their "group key," which is
* the prefix of length `$this->key_length` shared by all of the items
* in the group. Each word in the string is prefixed by a single byte
* whose raw unsigned integer value represents how many bytes follow.
*
* ┌────────────────┬───────────────┬─────────────────┬────────┐
* │ Length of rest │ Rest of key │ Length of value │ Value │
* │ of key (bytes) │ │ (bytes) │ │
* ├────────────────┼───────────────┼─────────────────┼────────┤
* │ 0x08 │ nterDot; │ 0x02 │ · │
* └────────────────┴───────────────┴─────────────────┴────────┘
*
* In this example, the key `CenterDot;` has a group key `Ce`, leaving
* eight bytes for the rest of the key, `nterDot;`, and two bytes for
* the transformed value `·` (or U+B7 or "\xC2\xB7").
*
* Example:
*
* // Stores array( 'CenterDot;' => '·', 'Cedilla;' => '¸' ).
* $groups = "Ce\x00";
* $large_words = array( "\x08nterDot;\x02·\x06dilla;\x02¸" )
*
* The prefixes appear in the `$groups` string, each followed by a null
* byte. This makes for quick lookup of where in the group string the key
* is found, and then a simple division converts that offset into the index
* in the `$large_words` array where the group string is to be found.
*
* This lookup data structure is designed to optimize cache locality and
* minimize indirect memory reads when matching strings in the set.
*
* @since 6.6.0
*
* @var array
*/
private $large_words = array();
/**
* Stores the group keys for sequential string lookup.
*
* The offset into this string where the group key appears corresponds with the index
* into the group array where the rest of the group string appears. This is an optimization
* to improve cache locality while searching and minimize indirect memory accesses.
*
* @since 6.6.0
*
* @var string
*/
private $groups = '';
/**
* Stores an optimized row of small words, where every entry is
* `$this->key_size + 1` bytes long and zero-extended.
*
* This packing allows for direct lookup of a short word followed
* by the null byte, if extended to `$this->key_size + 1`.
*
* Example:
*
* // Stores array( 'GT', 'LT', 'gt', 'lt' ).
* "GT\x00LT\x00gt\x00lt\x00"
*
* @since 6.6.0
*
* @var string
*/
private $small_words = '';
/**
* Replacements for the small words, in the same order they appear.
*
* With the position of a small word it's possible to index the translation
* directly, as its position in the `$small_words` string corresponds to
* the index of the replacement in the `$small_mapping` array.
*
* Example:
*
* array( '>', '<', '>', '<' )
*
* @since 6.6.0
*
* @var string[]
*/
private $small_mappings = array();
/**
* Create a token map using an associative array of key/value pairs as the input.
*
* Example:
*
* $smilies = WP_Token_Map::from_array( array(
* '8O' => '😯',
* ':(' => '🙁',
* ':)' => '🙂',
* ':?' => '😕',
* ) );
*
* @since 6.6.0
*
* @param array $mappings The keys transform into the values, both are strings.
* @param int $key_length Determines the group key length. Leave at the default value
* of 2 unless there's an empirical reason to change it.
*
* @return WP_Token_Map|null Token map, unless unable to create it.
*/
public static function from_array( $mappings, $key_length = 2 ) {
$map = new WP_Token_Map();
$map->key_length = $key_length;
// Start by grouping words.
$groups = array();
$shorts = array();
foreach ( $mappings as $word => $mapping ) {
if (
self::MAX_LENGTH <= strlen( $word ) ||
self::MAX_LENGTH <= strlen( $mapping )
) {
_doing_it_wrong(
__METHOD__,
sprintf(
/* translators: 1: maximum byte length (a count) */
__( 'Token Map tokens and substitutions must all be shorter than %1$d bytes.' ),
self::MAX_LENGTH
),
'6.6.0'
);
return null;
}
$length = strlen( $word );
if ( $key_length >= $length ) {
$shorts[] = $word;
} else {
$group = substr( $word, 0, $key_length );
if ( ! isset( $groups[ $group ] ) ) {
$groups[ $group ] = array();
}
$groups[ $group ][] = array( substr( $word, $key_length ), $mapping );
}
}
/*
* Sort the words to ensure that no smaller substring of a match masks the full match.
* For example, `Cap` should not match before `CapitalDifferentialD`.
*/
usort( $shorts, 'WP_Token_Map::longest_first_then_alphabetical' );
foreach ( $groups as $group_key => $group ) {
usort(
$groups[ $group_key ],
static function ( $a, $b ) {
return self::longest_first_then_alphabetical( $a[0], $b[0] );
}
);
}
// Finally construct the optimized lookups.
foreach ( $shorts as $word ) {
$map->small_words .= str_pad( $word, $key_length + 1, "\x00", STR_PAD_RIGHT );
$map->small_mappings[] = $mappings[ $word ];
}
$group_keys = array_keys( $groups );
sort( $group_keys );
foreach ( $group_keys as $group ) {
$map->groups .= "{$group}\x00";
$group_string = '';
foreach ( $groups[ $group ] as $group_word ) {
list( $word, $mapping ) = $group_word;
$word_length = pack( 'C', strlen( $word ) );
$mapping_length = pack( 'C', strlen( $mapping ) );
$group_string .= "{$word_length}{$word}{$mapping_length}{$mapping}";
}
$map->large_words[] = $group_string;
}
return $map;
}
/**
* Creates a token map from a pre-computed table.
* This skips the initialization cost of generating the table.
*
* This function should only be used to load data created with
* WP_Token_Map::precomputed_php_source_tag().
*
* @since 6.6.0
*
* @param array $state {
* Stores pre-computed state for directly loading into a Token Map.
*
* @type string $storage_version Which version of the code produced this state.
* @type int $key_length Group key length.
* @type string $groups Group lookup index.
* @type array $large_words Large word groups and packed strings.
* @type string $small_words Small words packed string.
* @type array $small_mappings Small word mappings.
* }
*
* @return WP_Token_Map Map with precomputed data loaded.
*/
public static function from_precomputed_table( $state ) {
$has_necessary_state = isset(
$state['storage_version'],
$state['key_length'],
$state['groups'],
$state['large_words'],
$state['small_words'],
$state['small_mappings']
);
if ( ! $has_necessary_state ) {
_doing_it_wrong(
__METHOD__,
__( 'Missing required inputs to pre-computed WP_Token_Map.' ),
'6.6.0'
);
return null;
}
if ( self::STORAGE_VERSION !== $state['storage_version'] ) {
_doing_it_wrong(
__METHOD__,
/* translators: 1: version string, 2: version string. */
sprintf( __( 'Loaded version \'%1$s\' incompatible with expected version \'%2$s\'.' ), $state['storage_version'], self::STORAGE_VERSION ),
'6.6.0'
);
return null;
}
$map = new WP_Token_Map();
$map->key_length = $state['key_length'];
$map->groups = $state['groups'];
$map->large_words = $state['large_words'];
$map->small_words = $state['small_words'];
$map->small_mappings = $state['small_mappings'];
return $map;
}
/**
* Indicates if a given word is a lookup key in the map.
*
* Example:
*
* true === $smilies->contains( ':)' );
* false === $smilies->contains( 'simile' );
*
* @since 6.6.0
*
* @param string $word Determine if this word is a lookup key in the map.
* @param ?string $case_sensitivity 'ascii-case-insensitive' to ignore ASCII case or default of 'case-sensitive'.
* @return bool Whether there's an entry for the given word in the map.
*/
public function contains( $word, $case_sensitivity = 'case-sensitive' ) {
$ignore_case = 'ascii-case-insensitive' === $case_sensitivity;
if ( $this->key_length >= strlen( $word ) ) {
if ( 0 === strlen( $this->small_words ) ) {
return false;
}
$term = str_pad( $word, $this->key_length + 1, "\x00", STR_PAD_RIGHT );
$word_at = $ignore_case ? stripos( $this->small_words, $term ) : strpos( $this->small_words, $term );
if ( false === $word_at ) {
return false;
}
return true;
}
$group_key = substr( $word, 0, $this->key_length );
$group_at = $ignore_case ? stripos( $this->groups, $group_key ) : strpos( $this->groups, $group_key );
if ( false === $group_at ) {
return false;
}
$group = $this->large_words[ $group_at / ( $this->key_length + 1 ) ];
$group_length = strlen( $group );
$slug = substr( $word, $this->key_length );
$length = strlen( $slug );
$at = 0;
while ( $at < $group_length ) {
$token_length = unpack( 'C', $group[ $at++ ] )[1];
$token_at = $at;
$at += $token_length;
$mapping_length = unpack( 'C', $group[ $at++ ] )[1];
$mapping_at = $at;
if ( $token_length === $length && 0 === substr_compare( $group, $slug, $token_at, $token_length, $ignore_case ) ) {
return true;
}
$at = $mapping_at + $mapping_length;
}
return false;
}
/**
* If the text starting at a given offset is a lookup key in the map,
* return the corresponding transformation from the map, else `false`.
*
* This function returns the translated string, but accepts an optional
* parameter `$matched_token_byte_length`, which communicates how many
* bytes long the lookup key was, if it found one. This can be used to
* advance a cursor in calling code if a lookup key was found.
*
* Example:
*
* false === $smilies->read_token( 'Not sure :?.', 0, $token_byte_length );
* '😕' === $smilies->read_token( 'Not sure :?.', 9, $token_byte_length );
* 2 === $token_byte_length;
*
* Example:
*
* while ( $at < strlen( $input ) ) {
* $next_at = strpos( $input, ':', $at );
* if ( false === $next_at ) {
* break;
* }
*
* $smily = $smilies->read_token( $input, $next_at, $token_byte_length );
* if ( false === $next_at ) {
* ++$at;
* continue;
* }
*
* $prefix = substr( $input, $at, $next_at - $at );
* $at += $token_byte_length;
* $output .= "{$prefix}{$smily}";
* }
*
* @since 6.6.0
*
* @param string $text String in which to search for a lookup key.
* @param ?int $offset How many bytes into the string where the lookup key ought to start.
* @param ?int &$matched_token_byte_length Holds byte-length of found token matched, otherwise not set.
* @param ?string $case_sensitivity 'ascii-case-insensitive' to ignore ASCII case or default of 'case-sensitive'.
* @return string|false Mapped value of lookup key if found, otherwise `false`.
*/
public function read_token( $text, $offset = 0, &$matched_token_byte_length = null, $case_sensitivity = 'case-sensitive' ) {
$ignore_case = 'ascii-case-insensitive' === $case_sensitivity;
$text_length = strlen( $text );
// Search for a long word first, if the text is long enough, and if that fails, a short one.
if ( $text_length > $this->key_length ) {
$group_key = substr( $text, $offset, $this->key_length );
$group_at = $ignore_case ? stripos( $this->groups, $group_key ) : strpos( $this->groups, $group_key );
if ( false === $group_at ) {
// Perhaps a short word then.
return strlen( $this->small_words ) > 0
? $this->read_small_token( $text, $offset, $matched_token_byte_length, $case_sensitivity )
: false;
}
$group = $this->large_words[ $group_at / ( $this->key_length + 1 ) ];
$group_length = strlen( $group );
$at = 0;
while ( $at < $group_length ) {
$token_length = unpack( 'C', $group[ $at++ ] )[1];
$token = substr( $group, $at, $token_length );
$at += $token_length;
$mapping_length = unpack( 'C', $group[ $at++ ] )[1];
$mapping_at = $at;
if ( 0 === substr_compare( $text, $token, $offset + $this->key_length, $token_length, $ignore_case ) ) {
$matched_token_byte_length = $this->key_length + $token_length;
return substr( $group, $mapping_at, $mapping_length );
}
$at = $mapping_at + $mapping_length;
}
}
// Perhaps a short word then.
return strlen( $this->small_words ) > 0
? $this->read_small_token( $text, $offset, $matched_token_byte_length, $case_sensitivity )
: false;
}
/**
* Finds a match for a short word at the index.
*
* @since 6.6.0.
*
* @param string $text String in which to search for a lookup key.
* @param ?int $offset How many bytes into the string where the lookup key ought to start.
* @param ?int &$matched_token_byte_length Holds byte-length of found lookup key if matched, otherwise not set.
* @param ?string $case_sensitivity 'ascii-case-insensitive' to ignore ASCII case or default of 'case-sensitive'.
* @return string|false Mapped value of lookup key if found, otherwise `false`.
*/
private function read_small_token( $text, $offset, &$matched_token_byte_length, $case_sensitivity = 'case-sensitive' ) {
$ignore_case = 'ascii-case-insensitive' === $case_sensitivity;
$small_length = strlen( $this->small_words );
$search_text = substr( $text, $offset, $this->key_length );
if ( $ignore_case ) {
$search_text = strtoupper( $search_text );
}
$starting_char = $search_text[0];
$at = 0;
while ( $at < $small_length ) {
if (
$starting_char !== $this->small_words[ $at ] &&
( ! $ignore_case || strtoupper( $this->small_words[ $at ] ) !== $starting_char )
) {
$at += $this->key_length + 1;
continue;
}
for ( $adjust = 1; $adjust < $this->key_length; $adjust++ ) {
if ( "\x00" === $this->small_words[ $at + $adjust ] ) {
$matched_token_byte_length = $adjust;
return $this->small_mappings[ $at / ( $this->key_length + 1 ) ];
}
if (
$search_text[ $adjust ] !== $this->small_words[ $at + $adjust ] &&
( ! $ignore_case || strtoupper( $this->small_words[ $at + $adjust ] !== $search_text[ $adjust ] ) )
) {
$at += $this->key_length + 1;
continue 2;
}
}
$matched_token_byte_length = $adjust;
return $this->small_mappings[ $at / ( $this->key_length + 1 ) ];
}
return false;
}
/**
* Exports the token map into an associate array of key/value pairs.
*
* Example:
*
* $smilies->to_array() === array(
* '8O' => '😯',
* ':(' => '🙁',
* ':)' => '🙂',
* ':?' => '😕',
* );
*
* @return array The lookup key/substitution values as an associate array.
*/
public function to_array() {
$tokens = array();
$at = 0;
$small_mapping = 0;
$small_length = strlen( $this->small_words );
while ( $at < $small_length ) {
$key = rtrim( substr( $this->small_words, $at, $this->key_length + 1 ), "\x00" );
$value = $this->small_mappings[ $small_mapping++ ];
$tokens[ $key ] = $value;
$at += $this->key_length + 1;
}
foreach ( $this->large_words as $index => $group ) {
$prefix = substr( $this->groups, $index * ( $this->key_length + 1 ), 2 );
$group_length = strlen( $group );
$at = 0;
while ( $at < $group_length ) {
$length = unpack( 'C', $group[ $at++ ] )[1];
$key = $prefix . substr( $group, $at, $length );
$at += $length;
$length = unpack( 'C', $group[ $at++ ] )[1];
$value = substr( $group, $at, $length );
$tokens[ $key ] = $value;
$at += $length;
}
}
return $tokens;
}
/**
* Export the token map for quick loading in PHP source code.
*
* This function has a specific purpose, to make loading of static token maps fast.
* It's used to ensure that the HTML character reference lookups add a minimal cost
* to initializing the PHP process.
*
* Example:
*
* echo $smilies->precomputed_php_source_table();
*
* // Output.
* WP_Token_Map::from_precomputed_table(
* array(
* "storage_version" => "6.6.0",
* "key_length" => 2,
* "groups" => "",
* "long_words" => array(),
* "small_words" => "8O\x00:)\x00:(\x00:?\x00",
* "small_mappings" => array( "😯", "🙂", "🙁", "😕" )
* )
* );
*
* @since 6.6.0
*
* @param ?string $indent Use this string for indentation, or rely on the default horizontal tab character.
* @return string Value which can be pasted into a PHP source file for quick loading of table.
*/
public function precomputed_php_source_table( $indent = "\t" ) {
$i1 = $indent;
$i2 = $i1 . $indent;
$i3 = $i2 . $indent;
$class_version = self::STORAGE_VERSION;
$output = self::class . "::from_precomputed_table(\n";
$output .= "{$i1}array(\n";
$output .= "{$i2}\"storage_version\" => \"{$class_version}\",\n";
$output .= "{$i2}\"key_length\" => {$this->key_length},\n";
$group_line = str_replace( "\x00", "\\x00", $this->groups );
$output .= "{$i2}\"groups\" => \"{$group_line}\",\n";
$output .= "{$i2}\"large_words\" => array(\n";
$prefixes = explode( "\x00", $this->groups );
foreach ( $prefixes as $index => $prefix ) {
if ( '' === $prefix ) {
break;
}
$group = $this->large_words[ $index ];
$group_length = strlen( $group );
$comment_line = "{$i3}//";
$data_line = "{$i3}\"";
$at = 0;
while ( $at < $group_length ) {
$token_length = unpack( 'C', $group[ $at++ ] )[1];
$token = substr( $group, $at, $token_length );
$at += $token_length;
$mapping_length = unpack( 'C', $group[ $at++ ] )[1];
$mapping = substr( $group, $at, $mapping_length );
$at += $mapping_length;
$token_digits = str_pad( dechex( $token_length ), 2, '0', STR_PAD_LEFT );
$mapping_digits = str_pad( dechex( $mapping_length ), 2, '0', STR_PAD_LEFT );
$mapping = preg_replace_callback(
"~[\\x00-\\x1f\\x22\\x5c]~",
static function ( $match_result ) {
switch ( $match_result[0] ) {
case '"':
return '\\"';
case '\\':
return '\\\\';
default:
$hex = dechex( ord( $match_result[0] ) );
return "\\x{$hex}";
}
},
$mapping
);
$comment_line .= " {$prefix}{$token}[{$mapping}]";
$data_line .= "\\x{$token_digits}{$token}\\x{$mapping_digits}{$mapping}";
}
$comment_line .= ".\n";
$data_line .= "\",\n";
$output .= $comment_line;
$output .= $data_line;
}
$output .= "{$i2}),\n";
$small_words = array();
$small_length = strlen( $this->small_words );
$at = 0;
while ( $at < $small_length ) {
$small_words[] = substr( $this->small_words, $at, $this->key_length + 1 );
$at += $this->key_length + 1;
}
$small_text = str_replace( "\x00", '\x00', implode( '', $small_words ) );
$output .= "{$i2}\"small_words\" => \"{$small_text}\",\n";
$output .= "{$i2}\"small_mappings\" => array(\n";
foreach ( $this->small_mappings as $mapping ) {
$output .= "{$i3}\"{$mapping}\",\n";
}
$output .= "{$i2})\n";
$output .= "{$i1})\n";
$output .= ')';
return $output;
}
/**
* Compares two strings, returning the longest, or whichever
* is first alphabetically if they are the same length.
*
* This is an important sort when building the token map because
* it should not form a match on a substring of a longer potential
* match. For example, it should not detect `Cap` when matching
* against the string `CapitalDifferentialD`.
*
* @since 6.6.0
*
* @param string $a First string to compare.
* @param string $b Second string to compare.
* @return int -1 or lower if `$a` is less than `$b`; 1 or greater if `$a` is greater than `$b`, and 0 if they are equal.
*/
private static function longest_first_then_alphabetical( $a, $b ) {
if ( $a === $b ) {
return 0;
}
$length_a = strlen( $a );
$length_b = strlen( $b );
// Longer strings are less-than for comparison's sake.
if ( $length_a !== $length_b ) {
return $length_b - $length_a;
}
return strcmp( $a, $b );
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -107,6 +107,7 @@ wp_set_lang_dir();
// Load early WordPress files.
require ABSPATH . WPINC . '/class-wp-list-util.php';
require ABSPATH . WPINC . '/class-wp-token-map.php';
require ABSPATH . WPINC . '/formatting.php';
require ABSPATH . WPINC . '/meta.php';
require ABSPATH . WPINC . '/functions.php';
@@ -248,6 +249,7 @@ require ABSPATH . WPINC . '/class-wp-oembed.php';
require ABSPATH . WPINC . '/class-wp-oembed-controller.php';
require ABSPATH . WPINC . '/media.php';
require ABSPATH . WPINC . '/http.php';
require ABSPATH . WPINC . '/html-api/html5-named-character-references.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-attribute-token.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-span.php';
require ABSPATH . WPINC . '/html-api/class-wp-html-text-replacement.php';

View File

@@ -0,0 +1,25 @@
# HTML5 Entities
This directory contains the listing of HTML5 named character references and a script that can be used
to create or update the optimized form for use in the HTML API.
The HTML5 specification asserts:
> This list is static and will not be expanded or changed in the future.
> - https://html.spec.whatwg.org/#named-character-references
The authoritative [`entities.json`](https://html.spec.whatwg.org/entities.json) file comes from the WHATWG server, and
is cached here in the test directory so that it doesn't need to be constantly re-downloaded.
## Updating the optimized lookup class.
The [`html5-named-character-references.php`][1] file contains an optimized lookup map for the entities in `entities.json`.
Run the [`generate-html5-named-character-references.php`][2] file to update the auto-generated Core module.
```bash
~$ php tests/phpunit/data/html5-entities/generate-html5-named-character-references.php
OK: Successfully generated optimized lookup class.
```
[1]: ../../../../src/wp-includes/html-api/html5-named-character-references.php
[2]: ./generate-html5-named-character-references.php

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,101 @@
<?php
require_once __DIR__ . '/../../../../src/wp-includes/class-wp-token-map.php';
/**
* Stores a mapping from HTML5 named character reference to its transformation metadata.
*
* Example:
*
* $entities['&copy;'] === array(
* 'codepoints' => array( 0xA9 ),
* 'characters' => '©',
* );
*
* @see https://html.spec.whatwg.org/entities.json
*
* @var array.
*/
$entities = json_decode(
file_get_contents( __DIR__ . '/entities.json' ),
JSON_OBJECT_AS_ARRAY
);
/**
* Direct mapping from character reference name to UTF-8 string.
*
* Example:
*
* $character_references['&copy;'] === '©';
*
* @var array.
*/
$character_references = array();
foreach ( $entities as $reference => $metadata ) {
$reference_without_ampersand_prefix = substr( $reference, 1 );
$character_references[ $reference_without_ampersand_prefix ] = $metadata['characters'];
}
$html5_map = WP_Token_Map::from_array( $character_references );
/**
* Contains the new contents for the auto-generated module.
*
* Note that in this template, the `$` is escaped with `\$` so that it
* comes through as a `$` in the output. Without escaping, PHP will look
* for a variable of the given name to interpolate into the template.
*
* @var string
*/
$module_contents = <<<EOF
<?php
/**
* Auto-generated class for looking up HTML named character references.
*
* ⚠️ !!! THIS ENTIRE FILE IS AUTOMATICALLY GENERATED !!! ⚠️
* Do not modify this file directly.
*
* To regenerate, run the generation script directly.
*
* Example:
*
* php tests/phpunit/data/html5-entities/generate-html5-named-character-references.php
*
* @package WordPress
* @since 6.6.0
*/
// phpcs:disable
global \$html5_named_character_references;
/**
* Set of named character references in the HTML5 specification.
*
* This list will never change, according to the spec. Each named
* character reference is case-sensitive and the presence or absence
* of the semicolon is significant. Without the semicolon, the rules
* for an ambiguous ampersand govern whether the following text is
* to be interpreted as a character reference or not.
*
* The list of entities is sourced directly from the WHATWG server
* and cached in the test directory to avoid needing to download it
* every time this file is updated.
*
* @link https://html.spec.whatwg.org/entities.json.
*/
\$html5_named_character_references = {$html5_map->precomputed_php_source_table()};
EOF;
file_put_contents(
__DIR__ . '/../../../../src/wp-includes/html-api/html5-named-character-references.php',
$module_contents
);
if ( posix_isatty( STDOUT ) ) {
echo "\e[1;32mOK\e[0;90m: \e[mSuccessfully generated optimized lookup class.\n";
} else {
echo "OK: Successfully generated optimized lookup class.\n";
}

View File

@@ -0,0 +1,399 @@
<?php
/**
* Unit tests covering WP_Token_Map functionality.
*
* @package WordPress
*
* @since 6.6.0
* @group html-api-token-map
*
* @coversDefaultClass WP_Token_Map
*/
class Tests_WpTokenMap extends WP_UnitTestCase {
/**
* Number of unique HTML5 named character references, including
* variations of a given name that don't require the trailing semicolon.
*
* The set of names is fixed by the specification,
* and can be found at the following link.
*
* @link https://html.spec.whatwg.org/entities.json
*/
const KNOWN_COUNT_OF_ALL_HTML5_NAMED_CHARACTER_REFERENCES = 2231;
/**
* Small test array matching names to Emoji.
*
* @var array.
*/
const ANIMAL_EMOJI = array(
'cat' => '🐈',
'dog' => '🐶',
'fish' => '🐟',
'mammoth' => '🦣',
'seal' => '🦭',
);
/**
* Returns an associative array whose keys are tokens to replace and
* whose values are the replacement strings for those tokens.
*
* This function is here to help avoid bloating this specific test file.
* For example, the HTML5 dataset is very large and best served as a
* separate file.
*
* The HTML5 named character reference list is pulled directly from the
* WHATWG spec and stored in the tests directory so it doesn't need to
* be downloaded on every test run. By specification, it cannot change
* and will not be updated.
*
* @param string $dataset_name Which dataset to return.
* @return array The dataset as an associative array.
*/
private static function get_test_input_array( $dataset_name ) {
static $html5_character_references = null;
switch ( $dataset_name ) {
case 'ANIMALS':
return self::ANIMAL_EMOJI;
case 'HTML5':
if ( ! isset( $html5_character_references ) ) {
$dataset = wp_json_file_decode(
__DIR__ . '/../../data/html5-entities/entities.json',
array( 'associative' => true )
);
$html5_character_references = array();
foreach ( $dataset as $name => $value ) {
$html5_character_references[ $name ] = $value['characters'];
}
}
return $html5_character_references;
}
}
/**
* Data provider.
*
* @return array[].
*/
public static function data_input_arrays() {
$dataset_names = array(
'ANIMALS',
'HTML5',
);
foreach ( $dataset_names as $dataset_name ) {
yield $dataset_name => array( self::get_test_input_array( $dataset_name ) );
}
}
/**
* Ensure the basic creation of a Token Map from an associative array.
*
* @ticket 60698
*
* @dataProvider data_input_arrays
*
* @param array $dataset Dataset to test.
*/
public function test_creates_map_from_array_containing_proper_values( $dataset ) {
$map = WP_Token_Map::from_array( $dataset );
foreach ( $dataset as $token => $replacement ) {
$this->assertTrue(
$map->contains( $token ),
"Map should have contained '{$token}' but didn't."
);
$skip_bytes = 0;
$response = $map->read_token( $token, 0, $skip_bytes );
$this->assertSame(
$replacement,
$response,
"Returned the wrong replacement value for '{$token}'."
);
$token_length = strlen( $token );
$this->assertSame(
$token_length,
$skip_bytes,
'Reported the wrong byte-length of the found token.'
);
}
}
/**
* Ensure that keys that are too long prevent the creation of a Token Map.
*
* If tokens or replacements are stored whose length is more than can be
* represented by a single byte, then the encoding scheme in the Token Map
* will fail and lead to corruption.
*
* @ticket 60698
*
* @expectedIncorrectUsage WP_Token_Map::from_array
*/
public function test_rejects_words_which_are_too_long() {
$normal_length = str_pad( '', 255, '.' );
$too_long_word = "{$normal_length}.";
$this->assertInstanceOf(
WP_Token_Map::class,
WP_Token_Map::from_array( array( $normal_length => 'just fine' ) ),
'Should have built Token Map containing long, but acceptable token length.'
);
$this->assertNull(
WP_Token_Map::from_array( array( $too_long_word => 'not good' ) ),
'Should have refused to build Token Map with key exceeding design limit.'
);
$this->assertInstanceOf(
WP_Token_Map::class,
WP_Token_Map::from_array( array( 'key' => $normal_length ) ),
'Should have build Token Map containing long, but acceptable replacement.'
);
$this->assertNull(
WP_Token_Map::from_array( array( 'key' => $too_long_word ) ),
'Should have refused to build Token Map with replacement exceeding design limit.'
);
}
/**
* Ensure isomorphic creation and export of a Token Map and associative arrays.
*
* @ticket 60698
*
* @dataProvider data_input_arrays
*
* @param array $dataset Dataset to test.
*/
public function test_round_trips_through_associative_array( $dataset ) {
$map = WP_Token_Map::from_array( $dataset );
$this->assertEqualsCanonicalizing(
$dataset,
$map->to_array(),
'Should have produced an identical array on output as was given on input.'
);
}
/**
* Ensure the basic creation of a Token Map from a precomputed source table.
*
* @ticket 60698
*
* @dataProvider data_input_arrays
*
* @param array $dataset Dataset to test.
*/
public function test_round_trips_through_precomputed_source_table( $dataset ) {
$seed = WP_Token_Map::from_array( $dataset );
$source_table = $seed->precomputed_php_source_table();
$map = eval( "return {$source_table};" ); // phpcs:ignore.
foreach ( $dataset as $token => $replacement ) {
$this->assertTrue(
$map->contains( $token ),
"Map should have contained '{$token}' but didn't."
);
$skip_bytes = 0;
$response = $map->read_token( $token, 0, $skip_bytes );
$this->assertSame(
$replacement,
$response,
'Returned the wrong replacement value'
);
$token_length = strlen( $token );
$this->assertSame(
$token_length,
$skip_bytes,
'Reported the wrong byte-length of the found token.'
);
}
}
/**
* Ensures that when two or more keys share a prefix that the longest
* is matched first, to prevent tokens masking each other.
*
* @ticket 60698
*/
public function test_finds_longest_match_first() {
$map = WP_Token_Map::from_array(
array(
'cat' => '1',
'caterpillar' => '2',
'caterpillar machines' => '3',
)
);
$skip_bytes = 0;
$text = 'cats like to meow';
$this->assertSame(
'1',
$map->read_token( $text, 0, $skip_bytes ),
"Should have matched 'cat' but matched '" . substr( $text, 0, $skip_bytes ) . "' instead."
);
$skip_bytes = 0;
$text = 'caterpillars turn into butterflies';
$this->assertSame(
'2',
$map->read_token( $text, 0, $skip_bytes ),
"Should have matched 'caterpillar' but matched '" . substr( $text, 0, $skip_bytes ) . "' instead."
);
$skip_bytes = 0;
$text = 'caterpillar machines are heavy duty equipment';
$this->assertSame(
'3',
$map->read_token( $text, 0, $skip_bytes ),
"Should have matched 'caterpillar machines' but matched '" . substr( $text, 0, $skip_bytes ) . "' instead."
);
}
/**
* Ensures that tokens shorter than the group key length are found.
*
* @ticket 60698
*
* @dataProvider data_short_substring_matches_of_each_other
*
* @param WP_Token_Map $map Token map containing appropriate mapping for test.
* @param string $search_document Document containing expected token at start of string.
* @param string $expected_token Which token should be found at start of search document.
*/
public function test_finds_short_matches_shorter_than_group_key_length( $map, $search_document, $expected_token ) {
$skip_bytes = 0;
$text = 'antarctica is a continent';
$this->assertSame(
'article',
$map->read_token( $text, 0, $skip_bytes ),
"Should have matched 'a' but matched '" . substr( $text, 0, $skip_bytes ) . "' instead."
);
}
/**
* Data provider.
*
* @return array[].
*/
public static function data_short_substring_matches_of_each_other() {
$map = WP_Token_Map::from_array(
array(
'a' => 'article',
'aa' => 'defensive weapon',
'ar' => 'country code',
'arizona' => 'state name',
)
);
return array(
'single character' => array( $map, 'antarctica is a continent', 'a' ),
'duplicate character' => array( $map, 'aaaaahhhh, he exclaimed', 'aa' ),
'different character' => array( $map, 'argentina is a country', 'ar' ),
'full word' => array( $map, 'arizona was full of copper', 'arizona' ),
);
}
/**
* Ensures that Token Map searches at appropriate starting offset.
*
* @ticket 60698
*
* @dataProvider data_html5_test_dataset
*
* @param string $token Token to find.
* @param string $replacement Replacement string for token.
*/
public function test_reads_token_at_given_offset( $token, $replacement ) {
$document = "& another {$token} & then some";
$map = self::get_html5_token_map();
$skip_bytes = 0;
$this->assertFalse(
$map->read_token( $document, 0, $skip_bytes ),
"Shouldn't have found token at start of document."
);
$response = $map->read_token( $document, 10, $skip_bytes );
$this->assertSame(
strlen( $token ),
$skip_bytes,
"Found the wrong length for token '{$token}'."
);
$this->assertSame(
$response,
$replacement,
'Found the wrong replacement value for the token.'
);
}
/**
* Ensures that all given tokens exist inside a constructed Token Map.
*
* @ticket 60698
*
* @dataProvider data_html5_test_dataset
*
* @param string $token Token to find.
* @param string $replacement Not used in this test.
*/
public function test_detects_all_tokens( $token, $replacement ) {
$map = self::get_html5_token_map();
$this->assertTrue(
$map->contains( $token ),
"Should have found '{$token}' inside the Token Map, but didn't."
);
$double_escaped_token = str_replace( '&', '&amp;', $token );
$this->assertFalse(
$map->contains( $double_escaped_token ),
"Should not have found '{$double_escaped_token}' in Token Map, but did."
);
}
/**
* Data provider.
*
* @return array[].
*/
public function data_html5_test_dataset() {
$html5 = self::get_test_input_array( 'HTML5' );
$this->assertSame(
self::KNOWN_COUNT_OF_ALL_HTML5_NAMED_CHARACTER_REFERENCES,
count( $html5 ),
'Found the wrong number of HTML5 named character references: confirm the entities.json file."'
);
foreach ( $html5 as $token => $replacement ) {
yield $token => array( $token, $replacement );
}
}
/**
* Returns a static copy of the Token Map for HTML5.
* This is a test performance optimization.
*
* @return WP_Token_Map
*/
private static function get_html5_token_map() {
static $html5_token_map = null;
if ( ! isset( $html5_token_map ) ) {
$html5_token_map = WP_Token_Map::from_array( self::get_test_input_array( 'HTML5' ) );
}
return $html5_token_map;
}
}