-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Token Map: Introduce an efficient lookup and translation class for string mappings. #5373
Conversation
c82d002
to
77b3a5b
Compare
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN:
To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
45c9bd9
to
c749413
Compare
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
c749413
to
502e4f5
Compare
Rebased from e98508f10d17b606cf3f8761c82bb0f8750b263d to 502e4f5 |
90ab28d
to
aa78d9b
Compare
This patch introduces a new class: `WP_Token_Map`, which is designed for efficient lookup and translation of static mappings between string keys or tokens, and string replacements (for example, HTML character references). The token map incorporates certain limitations on the string length of the tokens, but takes advantage of multiple optimizations to efficiently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some high level pass on the PR and left my feedback. I was looking at the code from the HTML API encoding/decoding angle. I don't have better insights to share than @aaronjorbin in the Trac ticket.
Overall, the code is very complex and very specific to the requirements that HTML API has so I'm not the best person to draw any conclusions on whether this class is general enough. However, based on the explanation provided by @dmsnell, it definitely is important part of the implementation in the terms of performance for the work on HTML entities encoding and decoding.
tests/phpunit/tests/html-api/generate-html5-named-character-references.php
Outdated
Show resolved
Hide resolved
aa78d9b
to
388c50d
Compare
Some CI jobs report that some static test helpers are private and it can’t access them. |
tests/phpunit/tests/html-api/generate-html5-named-character-references.php
Outdated
Show resolved
Hide resolved
|
||
// phpcs:disable | ||
|
||
global \$html5_named_character_references; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for other reviewers:
It took me a while to see that \$var
is escaping the $
in $var
. I kept seeing this as the namespace global space. In the generated file is just global $html5_named_character_references;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note in a comment above the template. Hope that helps
$this->assertSame( | ||
self::KNOWN_COUNT_OF_ALL_HTML5_NAMED_CHARACTER_REFERENCES, | ||
count( $html5 ), | ||
'Found the wrong number of HTML5 named character references: confirm the entities.json file."' | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems out of place to have an assertion in the data provider. Maybe another simple test could not annotate the data provider, but call it make an assertion about its length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ maybe, thought then the test isn't testing the unit under test but the test runner itself. it's weird in the data provider, but does it warrant its own separate test?
I thought about throw
ing in here too but then figured an assertion could be clearer.
I don't have strong feelings about this; the assertion inside the data provider seemed a fitting tool but I'm happy to reconsider.
tests/phpunit/data/html5-entities/generate-html5-named-character-references.php
Show resolved
Hide resolved
Co-authored-by: Jon Surrell <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, but there's a lot in here. I'm digging into the implementation more now and I'll finish a complete review tomorrow.
private $key_length = 2; | ||
|
||
/** | ||
* Stores an optimized form of the word set, where words are grouped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see the explicit rule for large vs. small words in their docs:
…where large words are strings whose strlen greater than is key_length
(that's almost certainly wrong, just an example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added class-level description in 31c57b0
private $groups = ''; | ||
|
||
/** | ||
* Stores an optimized row of small words, where every entry is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see the explicit rule for large vs. small words in their docs:
…where small words are strings whose strlen is less than or equal to key_length
(that's almost certainly wrong, just an example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a class-level description in 31c57b0
I think this is fine. I was searching for words like "mb" or "multibyte", but it seems like ASCII is correct 👍 I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?
That seems like exactly what we want, but I've looked around PHP source, the changelog, etc. and I haven't managed to find exactly what changed or examples of the different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been over this in detail and left a lot of feedback. Thanks for the patience @dmsnell. I've left what I believe are my final questions, comments, and suggestions, but I feel good about this change in general.
if ( $ignore_case ) { | ||
$search_text = strtoupper( $search_text ); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was comparing the performance of strtoupper
and strcasecmp
here. I thought that using strcasecmp
would outperform strtoupper
, but the results weren't conclusive. I suspect it's likely highly dependent on inputs.
EDIT: To be perfectly clear this is not a change request. I'm documenting that I explored alternatives in the implementation.
patch
diff --git before/src/wp-includes/class-wp-token-map.php after/src/wp-includes/class-wp-token-map.php
index 40c0520659..75beabe074 100644
--- before/src/wp-includes/class-wp-token-map.php
+++ after/src/wp-includes/class-wp-token-map.php
@@ -575,19 +575,16 @@ class WP_Token_Map {
* @return string|false Mapped value of lookup key if found, otherwise `false`.
*/
private function read_small_token( $text, $offset, &$skip_bytes, $case_sensitivity = 'case-sensitive' ) {
- $ignore_case = 'case-insensitive' === $case_sensitivity;
- $small_length = strlen( $this->small_words );
- $search_text = substr( $text, $offset, $this->key_length );
- if ( $ignore_case ) {
- $search_text = strtoupper( $search_text );
- }
+ $ignore_case = 'case-insensitive' === $case_sensitivity;
+ $small_length = strlen( $this->small_words );
+ $search_text = substr( $text, $offset, $this->key_length );
$starting_char = $search_text[0];
$at = 0;
while ( $at < $small_length ) {
if (
$starting_char !== $this->small_words[ $at ] &&
- ( ! $ignore_case || strtoupper( $this->small_words[ $at ] ) !== $starting_char )
+ ( ! $ignore_case || 0 !== strcasecmp( $this->small_words[ $at ], $starting_char ) )
) {
$at += $this->key_length + 1;
continue;
@@ -601,7 +598,7 @@ class WP_Token_Map {
if (
$search_text[ $adjust ] !== $this->small_words[ $at + $adjust ] &&
- ( ! $ignore_case || strtoupper( $this->small_words[ $at + $adjust ] !== $search_text[ $adjust ] ) )
+ ( ! $ignore_case || 0 !== strcasecmp( $this->small_words[ $at + $adjust ], $search_text[ $adjust ] ) )
) {
$at += $this->key_length + 1;
continue 2;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the alternative exploration. I am not sure it would be possible to use strcasecmp
like this here because we aren't looking for full matches, necessarily, in the small_words
array. we're really looking for a match in the $text
of a string in the small words array, but the words in the small words array are zero-extended.
for example, if we have ab
in $text
at the $offset
, and "a\x00\x00"
in the small words array, we will not match 0 === strcasecmp()
, but there is indeed a match.
so we could add a length to the comparison, but then if it didn't match, we'd need to backtrack and try matching with length - 1
and so on until length === 1
and there are no matches.
by crawling forward only in the document we can ensure that any prefix of the input text will match as expected on a word shorter than the longest small word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is still to do character by character comparison, not full string comparison, but instead of calling strtoupper
on the needle and on the characters, strcasecmp
does the work of ASCII case-insensitive matching.
* @param string $text String in which to search for a lookup key. | ||
* @param ?int $offset How many bytes into the string where the lookup key ought to start. | ||
* @param ?int &$skip_bytes Holds byte-length of found lookup key if matched, otherwise not set. | ||
* @param ?string $case_sensitivity 'case-insensitive' to ignore ASCII case or default of 'case-sensitive'. | ||
* @return string|false Mapped value of lookup key if found, otherwise `false`. | ||
*/ | ||
public function read_token( $text, $offset = 0, &$skip_bytes = null, $case_sensitivity = 'case-sensitive' ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is a reply to other comments - GitHub doesn't really thread well when it's part of a review)
For the record, it seems you can pass an uninitialized variable name to the function and it's not even a warning in this case:
// first place $skip_bytes appears
$map->read_token( 'x', 0, $skip_bytes, 'case-insensitive' );
var_dump($skip_bytes); // int(0)
* | ||
* @param string $text String in which to search for a lookup key. | ||
* @param ?int $offset How many bytes into the string where the lookup key ought to start. | ||
* @param ?int &$skip_bytes Holds byte-length of found lookup key if matched, otherwise not set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is very interesting. It seems very important, it's a bit like $matches
in preg_match
in that it's an output of the function and not an input.
I think the name skip_bytes
can be improved. That's very focused on what calling code would likelye do with the value and less what the value is. Maybe something like $matched_token_bytelength
or would be a descriptive name of what it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please. It was nearly impossible for me to understand the exact purpose of this $skip_bytes
param by looking at code examples. I'm still not sure why it would be useful for developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is that you're working with an input like this and using a token map to detect (and perform) replacements:
true ¬ false
We'd like to decode this by replacing entities, so we start by checking matches (this would be in a loop):
$m = WP_Token_Map::from_array([ '¬' => '¬' ]);
$s = 'true ¬ false';
// check here
// |
// v
// "true ¬ false"
$replacement = $m->read_token( $s, 0, $skip_bytes );
// $replacement = false
// $skip_bytes = null
// check here
// |
// v
// "true ¬ false"
$replacement = $m->read_token( $s, 1, $skip_bytes );
// $replacement = false
// $skip_bytes = null
// …
// check here
// |
// v
// "true ¬ false"
$replacement = $m->read_token( $s, 5, $skip_bytes );
// $replacement = "¬"
// $skip_bytes = 5
At this point, we've hit a match and would perform replacement, but the function just returned "¬"
which means "at index 5 there's a token matching ¬
" but we don't know what the token is so it's unclear how we should perform replacement. We need to know where it starts (we know the offset, we passed it in as an argument) what it matches (¬
was returned) but also where the token ends. That's where $skip_bytes
is involved, it lets us know the byte length of the matched token so we can replace (or resume looking after the end of the token).
// 5 here is the index in the string where we found the token, we know that from our
// original iteration
substr( $s, 0, 5 ) . $replacement . substr( $s, 5 + $skip_bytes ); // "true ¬ false"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so it’s the length of the token matched that we will ignore while constructing the replaced version. The coincidence that the offset and skip bytes were both equal to 5 made this puzzle more challenging, but I hope I got it right. I was mostly making the case that it was the only difficult part to reason in the PHPDoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds like I should pull in some of the wording from the dev note and move it into the docblocks…
https://make.wordpress.org/core/?p=113042&preview=1&_ppp=452181011b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it’s the length of the token matched
Yep! In this case it's strlen( "¬" )
: 5.
the coincidence that the offset and skip bytes were both equal to 5 made this puzzle more challenging.
🤦 yes, poor execution in my example 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed the argument to $matched_token_byte_length
. It's verbose, but more explanatory.
* false === $smilies->read_token( 'Not sure :?.', 0, $bytes_skipped ); | ||
* '😕' === $smilies->read_token( 'Not sure :?.', 9, $bytes_skipped ); | ||
* 2 === $bytes_skipped; | ||
* | ||
* Example: | ||
* | ||
* while ( $at < strlen( $input ) ) { | ||
* $next_at = strpos( $input, ':', $at ); | ||
* if ( false === $next_at ) { | ||
* break; | ||
* } | ||
* | ||
* $smily = $smilies->read_token( $input, $next_at, $bytes_skipped ); | ||
* if ( false === $next_at ) { | ||
* ++$at; | ||
* continue; | ||
* } | ||
* | ||
* $prefix = substr( $input, $at, $next_at - $at ); | ||
* $at += $bytes_skipped; | ||
* $output .= "{$prefix}{$smily}"; | ||
* } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this example uses bytes_skipped
instead of skip_bytes
, which is slightly confusing. It would be good to use the parameter name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they were different because of perspective. including matched_
in the name I think helps the wording on the internal side, but I think that calling code still has a valid reason to use a different name. maybe I can further clarify with something like $handle_length
I've struggled to test this with the HTML API. in effect, some system locales could cause, I believe, a few characters to be case-folded differently, but this is such a broken system I'm not of the opinion that it's a big risk here. I'm willing to let this sit as a hypothetical bug until we have a clearer understanding of it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. Thanks!
One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.
) { | ||
_doing_it_wrong( | ||
__METHOD__, | ||
__( 'Token Map tokens and substitutions must all be shorter than 256 bytes.' ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Rather than hardcoding 256, I think this should use self::MAX_LENGTH
on the chance it changes in the future.
__( 'Token Map tokens and substitutions must all be shorter than 256 bytes.' ), | |
/* translators: maximum byte length */ | |
sprintf( __( 'Token Map tokens and substitutions must all be shorter than %i bytes.' ), self::MAX_LENGTH ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! incorporated in 0e7ab4c
* @package WordPress | ||
* | ||
* @since 6.6.0 | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can add a new token-map
group. It also might be worth putting in the html-api
group since this is where the bigger effort lies
* be downloaded on every test run. By specification, it cannot change | ||
* and will not be updated. | ||
* | ||
* @ticket 60698 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a helper rather than a test, the ticket number isn't necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed in 0e7ab4c
$this->assertSame( | ||
$replacement, | ||
$response, | ||
'Returned the wrong replacement value' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Make the error a bit more descriptive
'Returned the wrong replacement value' | |
'Returned the wrong replacement value for ' . $token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
included the token in the message in 0e7ab4c
this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it). my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency. do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML. I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation? |
@aaronjorbin following the html5lib test suite, I added the |
@subgroup isn't an annotation that is a part of PHPUnit. I think this group makes sense, thanks! |
Upon reading https://github.com/whatwg/html/blob/main/FAQ.md#html-should-add-more-named-character-references I withdrawal this suggestion. I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me. |
Thanks again @aaronjorbin! It's already mentioned at the top of the README, in the character reference class (and as well in the autogenerator script where it comes from), the PR description, and the Trac ticket description. Is that good enough or was there somewhere else you wanted it? Maybe the way I wrote it wasn't as direct or clear as it should be? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me. Thanks for answering all my questions @dmsnell!
This patch introduces a new class: `WP_Token_Map`, designed for efficient lookup and translation of static mappings between string keys or tokens, and string replacements (for example, HTML character references). The Token Map imposes certain restrictions on the byte length of the lookup tokens and their replacements, but is a highly-optimized data structure for mappings with a very high number of tokens. Developed in #5373 Discussed in https://core.trac.wordpress.org/ticket/60698 Fixes #60698. Props: dmsnell, gziolo, jonsurrell, jorbin. git-svn-id: https://develop.svn.wordpress.org/trunk@58188 602fd350-edb4-49c9-b593-d223f7449a82
This patch introduces a new class: `WP_Token_Map`, designed for efficient lookup and translation of static mappings between string keys or tokens, and string replacements (for example, HTML character references). The Token Map imposes certain restrictions on the byte length of the lookup tokens and their replacements, but is a highly-optimized data structure for mappings with a very high number of tokens. Developed in WordPress/wordpress-develop#5373 Discussed in https://core.trac.wordpress.org/ticket/60698 Fixes #60698. Props: dmsnell, gziolo, jonsurrell, jorbin. Built from https://develop.svn.wordpress.org/trunk@58188 git-svn-id: http://core.svn.wordpress.org/trunk@57651 1a063a9b-81f0-0310-95a4-ce76da25c4cd
This patch introduces a new class: `WP_Token_Map`, designed for efficient lookup and translation of static mappings between string keys or tokens, and string replacements (for example, HTML character references). The Token Map imposes certain restrictions on the byte length of the lookup tokens and their replacements, but is a highly-optimized data structure for mappings with a very high number of tokens. Developed in WordPress/wordpress-develop#5373 Discussed in https://core.trac.wordpress.org/ticket/60698 Fixes #60698. Props: dmsnell, gziolo, jonsurrell, jorbin. Built from https://develop.svn.wordpress.org/trunk@58188 git-svn-id: https://core.svn.wordpress.org/trunk@57651 1a063a9b-81f0-0310-95a4-ce76da25c4cd
|
||
// Longer strings are less-than for comparison's sake. | ||
if ( $length_a !== $length_b ) { | ||
return $length_b - $length_a; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a perfect place for the spaceship operator comparison to return -1, 0, or 1 return $length_b <=> $length_a;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure about this?
we need to make sure that longer strings sort before shorter ones, so for example, ba
appears before a
-1 === longest_first_then_alphabetical( 'ba', 'a' );
1 === 'ba' <=> 'a';
1 === strcmp( 'ba', 'a' );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is correct. I meant specifically the length comparison could be return $length_b <=> $length_a;
. The result is basically the same, but it returns -1
, 0
, or 1
instead of <0
, 0
, or >0
.
demo
<?php
function longest_first_then_alphabetical( $a, $b ) {
if ( $a === $b ) {
return 0;
}
$length_a = strlen( $a );
$length_b = strlen( $b );
// Longer strings are less-than for comparison's sake.
if ( $length_a !== $length_b ) {
return $length_b - $length_a;
}
return strcmp( $a, $b );
}
function longest_first_then_alphabetical_spaceship( $a, $b ) {
if ( $a === $b ) {
return 0;
}
$length_a = strlen( $a );
$length_b = strlen( $b );
// Longer strings are less-than for comparison's sake.
if ( $length_a !== $length_b ) {
// 👇👇👇 This is the only different line 👇👇👇
return $length_b <=> $length_a;
}
return strcmp( $a, $b );
}
var_dump(
longest_first_then_alphabetical( 'baa', 'a'),
longest_first_then_alphabetical_spaceship( 'baa', 'a'),
"\n",
longest_first_then_alphabetical( 'abb', 'a'),
longest_first_then_alphabetical_spaceship( 'abb', 'a'),
"\n",
longest_first_then_alphabetical( 'a', 'abb'),
longest_first_then_alphabetical_spaceship( 'a', 'abb'),
"\n",
longest_first_then_alphabetical( 'a', 'abb'),
longest_first_then_alphabetical_spaceship( 'a', 'abb'),
"\n",
longest_first_then_alphabetical( 'a', 'baa'),
longest_first_then_alphabetical_spaceship( 'a', 'baa'),
"\n",
longest_first_then_alphabetical( 'a', 'baa'),
longest_first_then_alphabetical_spaceship( 'a', 'baa'),
"\n",
longest_first_then_alphabetical( 'a', 'a'),
longest_first_then_alphabetical_spaceship( 'a', 'a'),
"\n",
longest_first_then_alphabetical( 'b', 'a'),
longest_first_then_alphabetical_spaceship( 'b', 'a'),
"\n",
longest_first_then_alphabetical( 'a', 'b'),
longest_first_then_alphabetical_spaceship( 'a', 'b'),
"\n",
);
int(-2)
int(-1)
string(1) "
"
int(-2)
int(-1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(0)
int(0)
string(1) "
"
int(1)
int(1)
string(1) "
"
int(-1)
int(-1)
string(1) "
"
</details>
Trac ticket: Core-60698
(previously Core-60841)
Motivated by the need to properly transform HTML named character references (like &) I found the need for a new semantic class which can efficiently perform search and replacement of a set of static tokens. Existing patterns in the codebase are not sufficient for the HTML need, and I suspect there are other use-cases where this class would help.
In #6387 I have built a spec-compliant HTML5 text decoder utilizing this token map. The performance of the new decoder is approximately 20% slower than calling
html_entity_decode()
directly, except that it properly decodes what PHP can't. In fact, the performance bottleneck in that PR comes from converting into UTF-8 the sequence of digits in numeric character references, not in looking up named character references.This proposal is adding a new class
WP_Token_Map
providing at least two methods for normal use:contains( $token )
returns whether the passed string is in the set.read_token( $text, $offset = 0, $skip_bytes )
indicates if the character sequence starting at the given offset in the passed string forms a token in the set, and if so, returns the replacement for that token. It also sets &$skip_bytes to the length of the token so that calling code .It also provides utility functions for pre-computing these classes, as they are designed for relatively-static cases where the actual code is intended to be generated dynamically, but stay static over time. For example, HTML5 defines the set of named character references and indicates that the list shall not change or be expanded. HTML5 spec. Precomputing can save on the startup-up cost of building the optimized lookup tables.
WP_Token_Map::from_array( array $mappings )
generates a new token map from the given array of whose keys are tokens and whose values are the replacements.to_array()
dumps the set of mapping into an array suitable for passing back intofrom_array()
.WP_Token_Map::from_precomputed_table( ...$table )
instantiates a token set from a precomputed table, skipping the computation for building the table and sorting the tokens.precomputed_php_source_table()
generates PHP source code which can be loaded with the previous static method for maintenance of the core static token sets.Other potential uses