Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token Map: Introduce an efficient lookup and translation class for string mappings. #5373

Closed
wants to merge 29 commits into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Oct 3, 2023

Trac ticket: Core-60698
(previously Core-60841)

Motivated by the need to properly transform HTML named character references (like &) I found the need for a new semantic class which can efficiently perform search and replacement of a set of static tokens. Existing patterns in the codebase are not sufficient for the HTML need, and I suspect there are other use-cases where this class would help.


In #6387 I have built a spec-compliant HTML5 text decoder utilizing this token map. The performance of the new decoder is approximately 20% slower than calling html_entity_decode() directly, except that it properly decodes what PHP can't. In fact, the performance bottleneck in that PR comes from converting into UTF-8 the sequence of digits in numeric character references, not in looking up named character references.


This proposal is adding a new class WP_Token_Map providing at least two methods for normal use:

  • contains( $token ) returns whether the passed string is in the set.
  • read_token( $text, $offset = 0, $skip_bytes ) indicates if the character sequence starting at the given offset in the passed string forms a token in the set, and if so, returns the replacement for that token. It also sets &$skip_bytes to the length of the token so that calling code .

It also provides utility functions for pre-computing these classes, as they are designed for relatively-static cases where the actual code is intended to be generated dynamically, but stay static over time. For example, HTML5 defines the set of named character references and indicates that the list shall not change or be expanded. HTML5 spec. Precomputing can save on the startup-up cost of building the optimized lookup tables.

  • WP_Token_Map::from_array( array $mappings ) generates a new token map from the given array of whose keys are tokens and whose values are the replacements.
  • to_array() dumps the set of mapping into an array suitable for passing back into from_array().
  • WP_Token_Map::from_precomputed_table( ...$table ) instantiates a token set from a precomputed table, skipping the computation for building the table and sorting the tokens.
  • precomputed_php_source_table() generates PHP source code which can be loaded with the previous static method for maintenance of the core static token sets.

Other potential uses

  • Converting smilies like :)
  • Converting emoji sequences like :happy:
  • Determining if a given verb/action is allowed in an API call.

@dmsnell dmsnell force-pushed the experiment/add-wp-token-set branch from c82d002 to 77b3a5b Compare October 3, 2023 18:01
@dmsnell dmsnell marked this pull request as ready for review March 25, 2024 16:16
Copy link

github-actions bot commented Mar 25, 2024

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, gziolo, jorbin.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dmsnell dmsnell force-pushed the experiment/add-wp-token-set branch from 45c9bd9 to c749413 Compare March 25, 2024 17:22
Copy link

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance,
    it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

@dmsnell dmsnell force-pushed the experiment/add-wp-token-set branch from c749413 to 502e4f5 Compare May 8, 2024 23:17
@dmsnell
Copy link
Member Author

dmsnell commented May 8, 2024

Rebased from e98508f10d17b606cf3f8761c82bb0f8750b263d to 502e4f5

@dmsnell dmsnell changed the title Experiment: Add optimized set lookup class Token Map: Introduce an efficient lookup and translation class for string mappings. May 13, 2024
@dmsnell dmsnell force-pushed the experiment/add-wp-token-set branch 3 times, most recently from 90ab28d to aa78d9b Compare May 14, 2024 17:25
dmsnell added 2 commits May 14, 2024 20:21
This patch introduces a new class: `WP_Token_Map`, which is designed for
efficient lookup and translation of static mappings between string keys
or tokens, and string replacements (for example, HTML character
references).

The token map incorporates certain limitations on the string length of
the tokens, but takes advantage of multiple optimizations to efficiently
Copy link
Member

@gziolo gziolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some high level pass on the PR and left my feedback. I was looking at the code from the HTML API encoding/decoding angle. I don't have better insights to share than @aaronjorbin in the Trac ticket.

Overall, the code is very complex and very specific to the requirements that HTML API has so I'm not the best person to draw any conclusions on whether this class is general enough. However, based on the explanation provided by @dmsnell, it definitely is important part of the implementation in the terms of performance for the work on HTML entities encoding and decoding.

tests/phpunit/tests/wp-token-map/wpTokenMap.php Outdated Show resolved Hide resolved
tests/phpunit/tests/wp-token-map/wpTokenMap.php Outdated Show resolved Hide resolved
tests/phpunit/tests/wp-token-map/wpTokenMap.php Outdated Show resolved Hide resolved
src/wp-includes/class-wp-token-map.php Outdated Show resolved Hide resolved
src/wp-includes/class-wp-token-map.php Outdated Show resolved Hide resolved
@dmsnell dmsnell force-pushed the experiment/add-wp-token-set branch from aa78d9b to 388c50d Compare May 16, 2024 16:14
@gziolo
Copy link
Member

gziolo commented May 16, 2024

Some CI jobs report that some static test helpers are private and it can’t access them.


// phpcs:disable

global \$html5_named_character_references;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for other reviewers:

It took me a while to see that \$var is escaping the $ in $var. I kept seeing this as the namespace global space. In the generated file is just global $html5_named_character_references;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in a comment above the template. Hope that helps

Comment on lines +376 to +380
$this->assertSame(
self::KNOWN_COUNT_OF_ALL_HTML5_NAMED_CHARACTER_REFERENCES,
count( $html5 ),
'Found the wrong number of HTML5 named character references: confirm the entities.json file."'
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems out of place to have an assertion in the data provider. Maybe another simple test could not annotate the data provider, but call it make an assertion about its length?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ maybe, thought then the test isn't testing the unit under test but the test runner itself. it's weird in the data provider, but does it warrant its own separate test?

I thought about throwing in here too but then figured an assertion could be clearer.

I don't have strong feelings about this; the assertion inside the data provider seemed a fitting tool but I'm happy to reconsider.

tests/phpunit/tests/wp-token-map/wpTokenMap.php Outdated Show resolved Hide resolved
Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, but there's a lot in here. I'm digging into the implementation more now and I'll finish a complete review tomorrow.

private $key_length = 2;

/**
* Stores an optimized form of the word set, where words are grouped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the explicit rule for large vs. small words in their docs:

…where large words are strings whose strlen greater than is key_length

(that's almost certainly wrong, just an example)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added class-level description in 31c57b0

private $groups = '';

/**
* Stores an optimized row of small words, where every entry is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see the explicit rule for large vs. small words in their docs:

…where small words are strings whose strlen is less than or equal to key_length

(that's almost certainly wrong, just an example)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a class-level description in 31c57b0

@dmsnell
Copy link
Member Author

dmsnell commented May 17, 2024

It seems fine to me if we don't support case insensitive non-ascii token matching as long as we're clear about it.

Ha! I went to add these and I already noted this in the function docblock.

'case-insensitive' to ignore ASCII case or default of 'case-sensitive'

Screenshot 2024-05-17 at 11 28 38 AM

Text issues are apparently always on my mind. Let me know if you still think this is unclear. I'm worried that if we try and add more documentation when it's already there right at the cursor when calling the function, then it won't be any more helpful, just more noise.

@sirreal
Copy link
Member

sirreal commented May 20, 2024

I already noted this in the function docblock … Let me know if you still think this is unclear.

I think this is fine. I was searching for words like "mb" or "multibyte", but it seems like ASCII is correct 👍

I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?

8.2.0 Case conversion no longer depends on the locale set with setlocale(). Only ASCII characters will be converted.

That seems like exactly what we want, but I've looked around PHP source, the changelog, etc. and I haven't managed to find exactly what changed or examples of the different.

Copy link
Member

@sirreal sirreal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been over this in detail and left a lot of feedback. Thanks for the patience @dmsnell. I've left what I believe are my final questions, comments, and suggestions, but I feel good about this change in general.

src/wp-includes/class-wp-token-map.php Outdated Show resolved Hide resolved
Comment on lines +581 to +583
if ( $ignore_case ) {
$search_text = strtoupper( $search_text );
}
Copy link
Member

@sirreal sirreal May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was comparing the performance of strtoupper and strcasecmp here. I thought that using strcasecmp would outperform strtoupper, but the results weren't conclusive. I suspect it's likely highly dependent on inputs.

EDIT: To be perfectly clear this is not a change request. I'm documenting that I explored alternatives in the implementation.

patch
diff --git before/src/wp-includes/class-wp-token-map.php after/src/wp-includes/class-wp-token-map.php
index 40c0520659..75beabe074 100644
--- before/src/wp-includes/class-wp-token-map.php
+++ after/src/wp-includes/class-wp-token-map.php
@@ -575,19 +575,16 @@ class WP_Token_Map {
 	 * @return string|false Mapped value of lookup key if found, otherwise `false`.
 	 */
 	private function read_small_token( $text, $offset, &$skip_bytes, $case_sensitivity = 'case-sensitive' ) {
-		$ignore_case  = 'case-insensitive' === $case_sensitivity;
-		$small_length = strlen( $this->small_words );
-		$search_text  = substr( $text, $offset, $this->key_length );
-		if ( $ignore_case ) {
-			$search_text = strtoupper( $search_text );
-		}
+		$ignore_case   = 'case-insensitive' === $case_sensitivity;
+		$small_length  = strlen( $this->small_words );
+		$search_text   = substr( $text, $offset, $this->key_length );
 		$starting_char = $search_text[0];
 
 		$at = 0;
 		while ( $at < $small_length ) {
 			if (
 				$starting_char !== $this->small_words[ $at ] &&
-				( ! $ignore_case || strtoupper( $this->small_words[ $at ] ) !== $starting_char )
+				( ! $ignore_case || 0 !== strcasecmp( $this->small_words[ $at ], $starting_char ) )
 			) {
 				$at += $this->key_length + 1;
 				continue;
@@ -601,7 +598,7 @@ class WP_Token_Map {
 
 				if (
 					$search_text[ $adjust ] !== $this->small_words[ $at + $adjust ] &&
-					( ! $ignore_case || strtoupper( $this->small_words[ $at + $adjust ] !== $search_text[ $adjust ] ) )
+					( ! $ignore_case || 0 !== strcasecmp( $this->small_words[ $at + $adjust ], $search_text[ $adjust ] ) )
 				) {
 					$at += $this->key_length + 1;
 					continue 2;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the alternative exploration. I am not sure it would be possible to use strcasecmp like this here because we aren't looking for full matches, necessarily, in the small_words array. we're really looking for a match in the $text of a string in the small words array, but the words in the small words array are zero-extended.

for example, if we have ab in $text at the $offset, and "a\x00\x00" in the small words array, we will not match 0 === strcasecmp(), but there is indeed a match.

so we could add a length to the comparison, but then if it didn't match, we'd need to backtrack and try matching with length - 1 and so on until length === 1 and there are no matches.

by crawling forward only in the document we can ensure that any prefix of the input text will match as expected on a word shorter than the longest small word.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is still to do character by character comparison, not full string comparison, but instead of calling strtoupper on the needle and on the characters, strcasecmp does the work of ASCII case-insensitive matching.

Comment on lines 466 to 472
* @param string $text String in which to search for a lookup key.
* @param ?int $offset How many bytes into the string where the lookup key ought to start.
* @param ?int &$skip_bytes Holds byte-length of found lookup key if matched, otherwise not set.
* @param ?string $case_sensitivity 'case-insensitive' to ignore ASCII case or default of 'case-sensitive'.
* @return string|false Mapped value of lookup key if found, otherwise `false`.
*/
public function read_token( $text, $offset = 0, &$skip_bytes = null, $case_sensitivity = 'case-sensitive' ) {
Copy link
Member

@sirreal sirreal May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is a reply to other comments - GitHub doesn't really thread well when it's part of a review)

For the record, it seems you can pass an uninitialized variable name to the function and it's not even a warning in this case:

// first place $skip_bytes appears
$map->read_token( 'x', 0, $skip_bytes, 'case-insensitive' );
var_dump($skip_bytes); // int(0)

*
* @param string $text String in which to search for a lookup key.
* @param ?int $offset How many bytes into the string where the lookup key ought to start.
* @param ?int &$skip_bytes Holds byte-length of found lookup key if matched, otherwise not set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is very interesting. It seems very important, it's a bit like $matches in preg_match in that it's an output of the function and not an input.

I think the name skip_bytes can be improved. That's very focused on what calling code would likelye do with the value and less what the value is. Maybe something like $matched_token_bytelength or would be a descriptive name of what it is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please. It was nearly impossible for me to understand the exact purpose of this $skip_bytes param by looking at code examples. I'm still not sure why it would be useful for developers.

Copy link
Member

@sirreal sirreal May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that you're working with an input like this and using a token map to detect (and perform) replacements:

true &not; false

We'd like to decode this by replacing entities, so we start by checking matches (this would be in a loop):

$m = WP_Token_Map::from_array([ '&not;' => '¬' ]);
$s = 'true &not; false';

//  check here
//  |
//  v
// "true &not; false"
$replacement = $m->read_token( $s, 0, $skip_bytes );
// $replacement = false
// $skip_bytes = null

//   check here
//   |
//   v
// "true &not; false"
$replacement = $m->read_token( $s, 1, $skip_bytes );
// $replacement = false
// $skip_bytes = null

// …

//       check here
//       |
//       v
// "true &not; false"
$replacement = $m->read_token( $s, 5, $skip_bytes );
// $replacement = "¬"
// $skip_bytes = 5

At this point, we've hit a match and would perform replacement, but the function just returned "¬" which means "at index 5 there's a token matching ¬" but we don't know what the token is so it's unclear how we should perform replacement. We need to know where it starts (we know the offset, we passed it in as an argument) what it matches (¬ was returned) but also where the token ends. That's where $skip_bytes is involved, it lets us know the byte length of the matched token so we can replace (or resume looking after the end of the token).

// 5 here is the index in the string where we found the token, we know that from our
// original iteration
substr( $s, 0, 5 ) . $replacement . substr( $s, 5 + $skip_bytes ); // "true ¬ false"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it’s the length of the token matched that we will ignore while constructing the replaced version. The coincidence that the offset and skip bytes were both equal to 5 made this puzzle more challenging, but I hope I got it right. I was mostly making the case that it was the only difficult part to reason in the PHPDoc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like I should pull in some of the wording from the dev note and move it into the docblocks…

https://make.wordpress.org/core/?p=113042&preview=1&_ppp=452181011b

Copy link
Member

@sirreal sirreal May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s the length of the token matched

Yep! In this case it's strlen( "&not;" ): 5.

the coincidence that the offset and skip bytes were both equal to 5 made this puzzle more challenging.

🤦 yes, poor execution in my example 😓

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the argument to $matched_token_byte_length. It's verbose, but more explanatory.

Comment on lines 494 to 515
* false === $smilies->read_token( 'Not sure :?.', 0, $bytes_skipped );
* '😕' === $smilies->read_token( 'Not sure :?.', 9, $bytes_skipped );
* 2 === $bytes_skipped;
*
* Example:
*
* while ( $at < strlen( $input ) ) {
* $next_at = strpos( $input, ':', $at );
* if ( false === $next_at ) {
* break;
* }
*
* $smily = $smilies->read_token( $input, $next_at, $bytes_skipped );
* if ( false === $next_at ) {
* ++$at;
* continue;
* }
*
* $prefix = substr( $input, $at, $next_at - $at );
* $at += $bytes_skipped;
* $output .= "{$prefix}{$smily}";
* }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this example uses bytes_skipped instead of skip_bytes, which is slightly confusing. It would be good to use the parameter name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they were different because of perspective. including matched_ in the name I think helps the wording on the internal side, but I think that calling code still has a valid reason to use a different name. maybe I can further clarify with something like $handle_length

@dmsnell
Copy link
Member Author

dmsnell commented May 20, 2024

I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?

I've struggled to test this with the HTML API. in effect, some system locales could cause, I believe, a few characters to be case-folded differently, but this is such a broken system I'm not of the opinion that it's a big risk here. I'm willing to let this sit as a hypothetical bug until we have a clearer understanding of it.

Copy link
Member

@aaronjorbin aaronjorbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Thanks!

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

) {
_doing_it_wrong(
__METHOD__,
__( 'Token Map tokens and substitutions must all be shorter than 256 bytes.' ),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Rather than hardcoding 256, I think this should use self::MAX_LENGTH on the chance it changes in the future.

Suggested change
__( 'Token Map tokens and substitutions must all be shorter than 256 bytes.' ),
/* translators: maximum byte length */
sprintf( __( 'Token Map tokens and substitutions must all be shorter than %i bytes.' ), self::MAX_LENGTH ),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! incorporated in 0e7ab4c

* @package WordPress
*
* @since 6.6.0
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add a new token-map group. It also might be worth putting in the html-api group since this is where the bigger effort lies

* be downloaded on every test run. By specification, it cannot change
* and will not be updated.
*
* @ticket 60698
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is a helper rather than a test, the ticket number isn't necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in 0e7ab4c

$this->assertSame(
$replacement,
$response,
'Returned the wrong replacement value'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Make the error a bit more descriptive

Suggested change
'Returned the wrong replacement value'
'Returned the wrong replacement value for ' . $token

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

included the token in the message in 0e7ab4c

@dmsnell
Copy link
Member Author

dmsnell commented May 21, 2024

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it).

my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.

do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.

I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?

@dmsnell
Copy link
Member Author

dmsnell commented May 21, 2024

I think we can add a new token-map group. It also might be worth putting in the html-api group since this is where the bigger effort lies

@aaronjorbin following the html5lib test suite, I added the wpTokenMap test module to the @group html-api-token-map group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package and @subpackage but I was looking for @subgroup and didn't find any

@aaronjorbin
Copy link
Member

@aaronjorbin following the html5lib test suite, I added the wpTokenMap test module to the @group html-api-token-map group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package and @subpackage but I was looking for @subgroup and didn't find any

@subgroup isn't an annotation that is a part of PHPUnit. I think this group makes sense, thanks!

@aaronjorbin
Copy link
Member

One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.

this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it).

my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.

do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.

I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?

Upon reading https://github.com/whatwg/html/blob/main/FAQ.md#html-should-add-more-named-character-references I withdrawal this suggestion. I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.

@dmsnell
Copy link
Member Author

dmsnell commented May 22, 2024

I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.

Thanks again @aaronjorbin!

It's already mentioned at the top of the README, in the character reference class (and as well in the autogenerator script where it comes from), the PR description, and the Trac ticket description.

Is that good enough or was there somewhere else you wanted it? Maybe the way I wrote it wasn't as direct or clear as it should be?

Copy link
Member

@aaronjorbin aaronjorbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me. Thanks for answering all my questions @dmsnell!

pento pushed a commit that referenced this pull request May 23, 2024
This patch introduces a new class: `WP_Token_Map`, designed for efficient
lookup and translation of static mappings between string keys or tokens, and
string replacements (for example, HTML character references).

The Token Map imposes certain restrictions on the byte length of the lookup
tokens and their replacements, but is a highly-optimized data structure for
mappings with a very high number of tokens.

Developed in #5373
Discussed in https://core.trac.wordpress.org/ticket/60698

Fixes #60698.
Props: dmsnell, gziolo, jonsurrell, jorbin.


git-svn-id: https://develop.svn.wordpress.org/trunk@58188 602fd350-edb4-49c9-b593-d223f7449a82
markjaquith pushed a commit to markjaquith/WordPress that referenced this pull request May 23, 2024
This patch introduces a new class: `WP_Token_Map`, designed for efficient
lookup and translation of static mappings between string keys or tokens, and
string replacements (for example, HTML character references).

The Token Map imposes certain restrictions on the byte length of the lookup
tokens and their replacements, but is a highly-optimized data structure for
mappings with a very high number of tokens.

Developed in WordPress/wordpress-develop#5373
Discussed in https://core.trac.wordpress.org/ticket/60698

Fixes #60698.
Props: dmsnell, gziolo, jonsurrell, jorbin.

Built from https://develop.svn.wordpress.org/trunk@58188


git-svn-id: http://core.svn.wordpress.org/trunk@57651 1a063a9b-81f0-0310-95a4-ce76da25c4cd
github-actions bot pushed a commit to platformsh/wordpress-performance that referenced this pull request May 23, 2024
This patch introduces a new class: `WP_Token_Map`, designed for efficient
lookup and translation of static mappings between string keys or tokens, and
string replacements (for example, HTML character references).

The Token Map imposes certain restrictions on the byte length of the lookup
tokens and their replacements, but is a highly-optimized data structure for
mappings with a very high number of tokens.

Developed in WordPress/wordpress-develop#5373
Discussed in https://core.trac.wordpress.org/ticket/60698

Fixes #60698.
Props: dmsnell, gziolo, jonsurrell, jorbin.

Built from https://develop.svn.wordpress.org/trunk@58188


git-svn-id: https://core.svn.wordpress.org/trunk@57651 1a063a9b-81f0-0310-95a4-ce76da25c4cd
@dmsnell
Copy link
Member Author

dmsnell commented May 23, 2024

Merged in [58188]
bcd25b1

@dmsnell dmsnell closed this May 23, 2024
@dmsnell dmsnell deleted the experiment/add-wp-token-set branch May 23, 2024 20:04

// Longer strings are less-than for comparison's sake.
if ( $length_a !== $length_b ) {
return $length_b - $length_a;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a perfect place for the spaceship operator comparison to return -1, 0, or 1 return $length_b <=> $length_a;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure about this?

we need to make sure that longer strings sort before shorter ones, so for example, ba appears before a

-1 === longest_first_then_alphabetical( 'ba', 'a' );
 1 === 'ba' <=> 'a';
 1 === strcmp( 'ba', 'a' );

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this is correct. I meant specifically the length comparison could be return $length_b <=> $length_a;. The result is basically the same, but it returns -1, 0, or 1 instead of <0, 0, or >0.

demo
<?php

function longest_first_then_alphabetical( $a, $b ) {
	if ( $a === $b ) {
		return 0;
	}


	$length_a = strlen( $a );
	$length_b = strlen( $b );


	// Longer strings are less-than for comparison's sake.
	if ( $length_a !== $length_b ) {
		return $length_b - $length_a;
	}


	return strcmp( $a, $b );
}
	
function longest_first_then_alphabetical_spaceship( $a, $b ) {
	if ( $a === $b ) {
		return 0;
	}


	$length_a = strlen( $a );
	$length_b = strlen( $b );


	// Longer strings are less-than for comparison's sake.
	if ( $length_a !== $length_b ) {
		// 👇👇👇 This is the only different line 👇👇👇
		return $length_b <=> $length_a;
	}


	return strcmp( $a, $b );
}
	
	
var_dump(
    longest_first_then_alphabetical( 'baa', 'a'),
    longest_first_then_alphabetical_spaceship( 'baa', 'a'),
    "\n",

    longest_first_then_alphabetical( 'abb', 'a'),
    longest_first_then_alphabetical_spaceship( 'abb', 'a'),
    "\n",

    longest_first_then_alphabetical( 'a', 'abb'),
    longest_first_then_alphabetical_spaceship( 'a', 'abb'),
    "\n",

    longest_first_then_alphabetical( 'a', 'abb'),
    longest_first_then_alphabetical_spaceship( 'a', 'abb'),
    "\n",

    longest_first_then_alphabetical( 'a', 'baa'),
    longest_first_then_alphabetical_spaceship( 'a', 'baa'),
    "\n",

    longest_first_then_alphabetical( 'a', 'baa'),
    longest_first_then_alphabetical_spaceship( 'a', 'baa'),
    "\n",

    longest_first_then_alphabetical( 'a', 'a'),
    longest_first_then_alphabetical_spaceship( 'a', 'a'),
    "\n",

    longest_first_then_alphabetical( 'b', 'a'),
    longest_first_then_alphabetical_spaceship( 'b', 'a'),
    "\n",

    longest_first_then_alphabetical( 'a', 'b'),
    longest_first_then_alphabetical_spaceship( 'a', 'b'),
    "\n",

);

int(-2)
int(-1)
string(1) "
"
int(-2)
int(-1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(2)
int(1)
string(1) "
"
int(0)
int(0)
string(1) "
"
int(1)
int(1)
string(1) "
"
int(-1)
int(-1)
string(1) "
"


</details>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants