-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML-API: Introduce minimal HTML Processor. #4519
Conversation
To get a list of all tags mentioned in rules starting with function extractQuotedStrings(inputString) {
const regex = /"([^"]*)"/g;
let match;
let matches = [];
while ((match = regex.exec(inputString)) !== null) {
matches.push(match[1]);
}
return matches;
}
let startTagRules = Array.from(document.querySelectorAll('dt'))
.map(e=>e.innerText)
.filter(t=>t.includes('A start tag whose tag name is'))
.flatMap(extractQuotedStrings)
startTagRules = Array.from(new Set(startTagRules))
startTagRules.map(t=>t.toUpperCase()).sort().join(", ") It yields: A, ADDRESS, APPLET, AREA, ARTICLE, ASIDE, B, BASE, BASEFONT, BGSOUND, BIG, BLOCKQUOTE, BODY, BR, BUTTON, CAPTION, CENTER, CODE, COL, COLGROUP, COLOR, DD, DETAILS, DIALOG, DIR, DIV, DL, DT, EM, EMBED, FACE, FIELDSET, FIGCAPTION, FIGURE, FONT, FOOTER, FORM, FRAME, FRAMESET, H1, H2, H3, H4, H5, H6, HEAD, HEADER, HGROUP, HR, HTML, I, IFRAME, IMAGE, IMG, INPUT, KEYGEN, LI, LINK, LISTING, MAIN, MARQUEE, MATH, MENU, META, NAV, NOBR, NOEMBED, NOFRAMES, NOSCRIPT, OBJECT, OL, OPTGROUP, OPTION, P, PARAM, PLAINTEXT, PRE, RB, RP, RT, RTC, RUBY, S, SCRIPT, SEARCH, SECTION, SELECT, SIZE, SMALL, SOURCE, SPAN, STRIKE, STRONG, STYLE, SUB, SUMMARY, SUP, SVG, TABLE, TBODY, TD, TEMPLATE, TEXTAREA, TFOOT, TH, THEAD, TITLE, TR, TRACK, TT, U, UL, VAR, WBR, XMP I also did end tags for good measure: A, ADDRESS, APPLET, ARTICLE, ASIDE, B, BIG, BLOCKQUOTE, BODY, BR, BUTTON, CAPTION, CENTER, CODE, COL, COLGROUP, DD, DETAILS, DIALOG, DIR, DIV, DL, DT, EM, FIELDSET, FIGCAPTION, FIGURE, FONT, FOOTER, FORM, FRAMESET, H1, H2, H3, H4, H5, H6, HEAD, HEADER, HGROUP, HTML, I, LI, LISTING, MAIN, MARQUEE, MENU, NAV, NOBR, NOSCRIPT, OBJECT, OL, OPTGROUP, OPTION, P, PRE, S, SARCASM, SCRIPT, SEARCH, SECTION, SELECT, SMALL, STRIKE, STRONG, SUMMARY, TABLE, TBODY, TD, TEMPLATE, TFOOT, TH, THEAD, TR, TT, U, UL |
924961a
to
216bcfc
Compare
I've now explored public function set_inner_html( $html ) {
parent::set_bookmark( 'start' );
$end = $this->step_while_token_open( $this->current_token );
$this->lexical_updates[] = new WP_HTML_Text_Replacement(
$this->bookmarks['start']->end + ( $this->bookmarks['start']->start === $this->bookmarks['start']->end ? 0 : 1),
$this->bookmarks[ $end ]->start,
$html
);
}
public function step_while_token_open( $token ) {
if ( self::is_void( $token->node_name ) ) {
return $this->bookmark_tag();
}
$after_pop = $this->state->stack_of_open_elements->after_pop;
$removed_token = false;
$this->state->stack_of_open_elements->after_pop = function ( $bookmark_name ) use ( $after_pop, &$removed_token, $token ) {
$this->arc_dec( $bookmark_name );
if ( $bookmark_name === $token->bookmark_name ) {
$removed_token = true;
$this->state->stack_of_open_elements->after_pop = $after_pop;
}
};
while ( $this->step() ) {
if ( $removed_token ) {
if ( $this->get_tag() !== $token->node_name || ! $this->is_tag_closer() ) {
$name = $this->bookmark_tag();
$this->bookmarks[ $name ]->end = $this->bookmarks[ $name ]->start;
return $name;
}
return $this->bookmark_tag();
}
}
return null;
} After exploring this, however, I want to defer this functionality, primarily because of the way it interact with bookmarks and seeking. We might actually be fine with how it is, or we might want to impose some additional arbitrary limits to ensure safety, and so for now I'd rather just not have to focus on that. |
fd23c97
to
7d781a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @dmsnell! I've left mostly documentation feedback in this review, as well as some I18N feedback, and a question regarding assertions not called directly within test methods.
src/wp-includes/html-api/class-wp-html-active-formatting-elements.php
Outdated
Show resolved
Hide resolved
src/wp-includes/html-api/class-wp-html-active-formatting-elements.php
Outdated
Show resolved
Hide resolved
* | ||
* @param WP_HTML_Token $token Look for this node in the stack. | ||
* @return bool Whether the referenced node is in the stack of open elements. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* |
} | ||
|
||
/** | ||
* @ticket 58517 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Needs a test description 🔢 Applies elsewhere in the PR
- Needs an
@covers
annotation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question from above: given that that test name itself forms the test description, and that the vast majority of existing tests rely on this, do we really need a duplicated description in every test?
private function ensure_support_is_added_everywhere( $tag_name ) { | ||
$p = WP_HTML_Processor::createFragment( "<$tag_name>" ); | ||
|
||
$this->assertFalse( $p->step(), "Must support terminating elements in specific scope check before adding support for the {$tag_name} element." ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hellofromtonya Correct me if wrong, but should assertions only be performed inside the test methods / closures within test methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this would be a good place to examine, as this is a fairly novel kind of test file. I have explored a few situations. the point here is to actively warn someone who wants to add additional support to the HTML parser class.
given that we have hundreds of tests that all do the same thing, I figured a minor abstraction here would be valuable to have, despite generally agreeing that test abstractions are things to be wary of.
if there's a way to ensure this won't be called outside of a test method that works; this was my intent on not naming it with the test_
prefix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively we could create a new assertion, which is essentially what this does, so I will look into that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the more I look into this the more I'd rather special-case this from the linting rules than create a new test assertion. creating a new assertion is going to pull in more complexity and make things more convoluted than they already are to debug and understand. having this single function, while different than in other cases, at least is self-contained and straightforward.
so I'd rather we do whatever we need to do to make sure this doesn't get abstracted into something much greater in scope than is required and allow this unique class to be unique.
more important than enforcing this styling rule, I'd love some feedback on automating the support rules, to ensure that nobody unintentionally partially adds support for HTML tags that require additional support. for example, we don't want someone to simply add a new BUTTON
clause to the big switch statement in the html processor. if they do that, they will subvert other rules that have to be added to the adoption agency algorithm.
tests/phpunit/tests/html-api/wpHtmlSupportRequiredOpenElements.php
Outdated
Show resolved
Hide resolved
*/ | ||
public function push( $token ) { | ||
/* | ||
* > If there are already three elements in the list of active formatting elements after the last marker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed the * >
format through a lot of the inline comments in the PR. This commenting style only appears to exist within the HTML API, and should likely be removed for consistency with the rest of Core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are quotes from the HTML specification and are important for associating the logic with the spec document, for being able to quickly find where in the spec this particular code is providing its functionality.
How should docblocks represent quotes? I will update as long as we can retain the clarity that they are quotes and as long as we don't reword them, since that would break the ability to link the spec with the code.
Of note, not every part of the spec we reference has an associated link, unfortunately, so these are incredibly valuable for quickly cross-referencing between the two sides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I understand now. In that case, I'd suggest we either keep the >
, or remove them and start the quote with As stated by the HTML specification:
. I don't have a preference there, so if you want to keep it as-is, that works for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to continue exploring ways to present these, but I think the >
is succinct and helpful. Given how often we want to quote the spec I think As stated by the HTML specification:
could get a bit distracting.
Where possible we have a link to the spec. Unfortunately in some of these places there are large sections that we split up in the logic, where each subsection doesn't have a unique URL.
Gutenberg mini-documents instead of Markdown docblocks? Could eliminate all the hassle over ordering and styles.
public function remove_node( $token ) { | ||
foreach ( $this->walk_up() as $position => $item ) { | ||
if ( $token->bookmark_name === $item->bookmark_name ) { | ||
array_splice( $this->stack, $this->count() - $position - 1, 1 ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Recurring) nit -- could do negative offset instead of subtraction:
array_splice( $this->stack, $this->count() - $position - 1, 1 ); | |
array_splice( $this->stack, - $position - 1, 1 ); |
Not sure it's that much nicer 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm. maybe if others chime in. I'm confused more by negative positions, particularly when they also include math. I with the function would reject negative offsets so that subtraction and negative offsets couldn't overlap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a negative offset with - $position - 1
is harder to read. What about 1 - $position
? It would avoid the call to $this->count()
, and while it's personally easier for me to read, I'm not sure if it would be as clear to others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 - $position
is even less clear to me. I'm not worried about $this->count()
as a call. although in this case it's a function call, we could decide if we measure that it's a problem to switch it to count( $this->stack )
which is essentially the same as a normal variable lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rearranged this a bit to try and help. the loop index is now called $position_from_end
to better indicate that it's not the normal array index. we're also using a helper variable $position_from_start
to give a better name to the computation.
public function has_element_in_specific_scope( $tag_name, $termination_list ) { | ||
foreach ( $this->walk_up() as $node ) { | ||
if ( $node->node_name === $tag_name ) { | ||
return true; | ||
} | ||
} | ||
|
||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the $termination_list
arg for? Are we missing some code (or a TODO note)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need to copy a comment from another function. it's unsupported by in the spec, and will eventually be filled out. thanks for noticing.
/** | ||
* Returns whether a particular element is in BUTTON scope. | ||
* | ||
* @since 6.4.0 | ||
* | ||
* @see https://html.spec.whatwg.org/#has-an-element-in-button-scope | ||
* | ||
* @param string $tag_name Name of tag to check. | ||
* @return bool Whether given element is in scope. | ||
*/ | ||
public function has_element_in_button_scope( $tag_name ) { | ||
return $this->has_element_in_specific_scope( | ||
$tag_name, | ||
array( | ||
// No termination elements are currently supported. | ||
) | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I've totally grasped this, but shouldn't this be something like
/** | |
* Returns whether a particular element is in BUTTON scope. | |
* | |
* @since 6.4.0 | |
* | |
* @see https://html.spec.whatwg.org/#has-an-element-in-button-scope | |
* | |
* @param string $tag_name Name of tag to check. | |
* @return bool Whether given element is in scope. | |
*/ | |
public function has_element_in_button_scope( $tag_name ) { | |
return $this->has_element_in_specific_scope( | |
$tag_name, | |
array( | |
// No termination elements are currently supported. | |
) | |
); | |
} | |
/** | |
* Returns whether a particular element is in BUTTON scope. | |
* | |
* @since 6.4.0 | |
* | |
* @see https://html.spec.whatwg.org/#has-an-element-in-button-scope | |
* | |
* @return bool Whether given element is in BUTTON scope. | |
*/ | |
public function has_element_in_button_scope() { | |
return $this->has_element_in_specific_scope( | |
'BUTTON', | |
array( | |
// No termination elements are currently supported. | |
) | |
); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good guess; it's confusing. We want to check if a given element, for example, a P
, is in BUTTON
scope. The BUTTON
part is implied in the linked spec to define the set of termination elements, so BUTTON
would actually end up in the second parameter, the array.
As it stands, it's not possible for the processor to encounter this because if it encounters a BUTTON
it currently bails.
This odd tension between wanting something that will eventually support the full spec and something that can be reviewed leads to these oddities, and to things like the tests that ensure nobody adds BUTTON
support without adjusting the corresponding functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think it makes sense to add a PHPDoc @TODO
or something of that sort, or would that also become unwieldy quickly due to the nature of these classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is one of the things I've been trying to explore here. people don't seem to like @TODO
(I personally find them useful in cases like this), I have tried ### HTML Support
segments in docblocks, but those also don't seem great. this is just a small note attempting to indicate why the thing isn't wrong even though it appears wrong.
as you are pointing out, the wording is lacking, but that's a different issue IMO, something I will clean up
8be04d3
to
8e418d3
Compare
@costdev thanks for your thorough code style scrutiny. I appreciate it! given the state of this PR right now, I would also be extremely grateful for higher-level feedback about the design and implementation and overall structure given that lots of code might change between now and merge. I've done my best to try and get ahead of having a hundred comments about code style, but obviously I missed a lot of cases. still, all the comments about small styling differences can make it harder to review the parts that matter with respect to function, performance, security, design. would it help next time if I add something in the description to say this isn't ready for meticulous styling scrutiny? what would be a good way to communicate this? |
@dmsnell No problem and I totally understand! When folks are already weighing in on the design/implementation side of things in a PR, I tend to look for the rest so that the PR gets good coverage in reviews, but I get that when there's a lot of iteration anticipated, this can make things hard to follow. As we don't have three stages of PR (such as "Draft", "Ready for technical review" and "Ready for additional review"), it may be worth keeping a PR in "Draft" if a lot of iteration is anticipated prior to merge. That way, design discussions and such can take priority until the PR is marked "Ready for Review", then final reviews (including code style) can be carried out when it's marked "Ready for Review", so nothing gets missed before commit. What do you think? |
Maybe something to iterate on. I don't anticipate a lot of iteration because I have been careful and we went through a lot of internal review already, but if someone were to point out something at this time it could change things is more what I meant. I've added a note at the top. Thanks again for your thoroughness; it was actually that attention to detail with the Tag Processor that led most of the functions here to have the appropriate docblock ordering already. |
a03bd17
to
5ad776e
Compare
4c43bc2
to
a7a058e
Compare
a7a058e
to
b44869f
Compare
return false; | ||
} | ||
|
||
if ( ! parent::set_bookmark( ++$this->bookmark_counter ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably doesn't matter much, but to aid debugging, we could use a naming convention that includes the tag name? Something like
$tag_name = $this->get_tag;
$i = 0;
do {
$bookmark_name = $tag_name . '-' . $i++;
} while ( $this->has_bookmark( $bookmark_name ) );
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could add the tag name, but there's a tradeoff: when I've been debugging the number name is clear and terse while longer names take up more space, and if you have a slide of 20 DIVs the tag name isn't that helpful.
having a number was surprisingly helpful because the index corresponds to the tag in the sequence they appear.
what is your "oops" comment and link? is it the recommendation against programmatic names? that's not to warn against creating tag names in the bookmark name, but to warn against the very thing we're doing anyway with numbers.
I've taken the stance here that in this very controlled environment there's not a practical way to avoid doing that if we want to parse HTML without creating a second bookmarking system (which was explored). it's more like, "avoid doing this because it carries these risks. if you are aware of the risks and know what you are doing, fine."
case 'IMAGE': | ||
/* | ||
* > A start tag whose tag name is "image" | ||
* > Change the token's tag name to "img" and reprocess it. (Don't ask.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thing to keep in mind when coming across these is that they are direct quotes from the HTML spec. There are some good ones (look for An end tag whose tag name is "sarcasm")
$this->current_token = new WP_HTML_Token( | ||
$this->bookmark_tag(), | ||
$this->get_tag(), | ||
$this->is_tag_closer(), | ||
$this->release_internal_bookmark | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$this->current_token
seems to be internal state that's only ever set here (in step()
) for subsequent usage in step_in_body()
(which is private
, and only called from here); plus it is read in seek()
to determine direction.
Any chance we can avoid this? Getting all the bookkeeping right for $lexical_updates
and $classname_updates
in the Tag Processor was surprisingly complex (even though the concepts seemed deceptively simple), so I'd love if we could avoid introducing this piece of internal state here (and make more explicit who is going to be the consumer of current_token
). For step_in_body()
(and its prospective friends for other contexts), it seems easy enough to pass $current_token
as an arg. Can we also find some sort of workaround for seek()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a tough one, and I'm not sure I entirely follow the association with $lexical_updates
and $classname_updates
I've always had a version of the current token because so much of the parsing algorithm depends on properties of the current element/node/token. it has made sense to make it globally available within the processor.
of note, it only gets set in the one place that all scanning takes place (everything flows through $this->step()
) and ultimately it's just a bookmark.
I'd like to better understand the ways this could get out of sync. maybe it does when seeking backwards? I will try and think up a test to confirm that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in f67e051 I added a new test which I actually though would break, but it seems to be fine. I don't have solid confidence that it's not still possible to break though.
$this->assertTrue( $p->get_attribute( 'supported' ), 'Did not find required supported element.' ); | ||
$this->assertFalse( $p->step(), "Didn't properly reject unsupported markup: {$description}" ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart 👍
'Simple IMG tag' => array( '<img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ), | ||
'Two sibling IMG tags' => array( '<img><img target>', array( 'HTML', 'BODY', 'IMG' ), 2 ), | ||
'Three sibling IMG tags, an IMAGE in last place' => array( '<img><img><image target>', array( 'HTML', 'BODY', 'IMG' ), 3 ), | ||
'IMG inside a DIV' => array( '<div><img target></div>', array( 'HTML', 'BODY', 'DIV', 'IMG' ), 1 ), | ||
'DIV inside a DIV' => array( '<div><div target></div>', array( 'HTML', 'BODY', 'DIV', 'DIV' ), 1 ), | ||
'IMG inside many DIVS' => array( '<div><div><div><div><img target></div></div></div></div>', array( 'HTML', 'BODY', 'DIV', 'DIV', 'DIV', 'DIV', 'IMG' ), 1 ), | ||
'DIV inside DIV after IMG' => array( '<div><img><div target></div></div>', array( 'HTML', 'BODY', 'DIV', 'DIV' ), 1 ), | ||
'IMG after DIV' => array( '<div></div><img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ), | ||
'IMG after two DIVs' => array( '<div></div><div></div><img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ), | ||
'IMG after two DIVs with nesting' => array( '<div><div><img></div></div><div></div><img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ), | ||
'IMG after invalid DIV closer' => array( '</div><img target>', array( 'HTML', 'BODY', 'IMG' ), 1 ), | ||
'EM inside DIV' => array( '<div>The weather is <em target>beautiful</em>.</div>', array( 'HTML', 'BODY', 'DIV', 'EM' ), 1 ), | ||
'EM after closed EM' => array( '<em></em><em target></em>', array( 'HTML', 'BODY', 'EM' ), 2 ), | ||
'EM after closed EMs' => array( '<em></em><em><em></em></em><em></em><em></em><em target></em>', array( 'HTML', 'BODY', 'EM' ), 6 ), | ||
'EM after unclosed EM' => array( '<em><em target></em>', array( 'HTML', 'BODY', 'EM', 'EM' ), 1 ), | ||
'EM after unclosed EM after DIV' => array( '<em><div><em target>', array( 'HTML', 'BODY', 'EM', 'DIV', 'EM' ), 1 ), | ||
// This should work for all formatting elements, but if two work, the others probably do too. | ||
'CODE after unclosed CODE after DIV' => array( '<code><div><code target>', array( 'HTML', 'BODY', 'CODE', 'DIV', 'CODE' ), 1 ), | ||
'P after unclosed P' => array( '<p><p target>', array( 'HTML', 'BODY', 'P' ), 2 ), | ||
'Unclosed EM inside P after unclosed P' => array( '<em><p><p><em target>', array( 'HTML', 'BODY', 'EM', 'P', 'EM' ), 1 ), | ||
'P after closed P' => array( '<p><i>something</i></p><p target>This one</p>', array( 'HTML', 'BODY', 'P' ), 2 ), | ||
'A after unclosed A' => array( '<a><a target>', array( 'HTML', 'BODY', 'A' ), 2 ), | ||
'A after unclosed A, after a P' => array( '<p><a><a target>', array( 'HTML', 'BODY', 'P', 'A' ), 2 ), | ||
'Large HTML document with deep P' => array( | ||
'<div><div><div><div><div><div><div><div><p></p><p></p><p><div><strong><em><code></code></em></strong></div></p></div></div></div></div></div></div></div></div><div><div><div><div><div><div><div><div><p></p><p></p><p><div><strong><em><code target></code></em></strong></div></p></div></div></div></div></div></div></div></div>', | ||
array( 'HTML', 'BODY', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'DIV', 'STRONG', 'EM', 'CODE' ), | ||
2, | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this deserves a comment. I intentionally want to push in the tests situations that are deeper than specified in some rules relating to adoption and reconstruction, even if they don't relate specifically to each tag in the test.
with all the tests otherwise being small I figured we might miss certain things on a real document if we don't push the stack depth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment in 639a01f
tests/phpunit/tests/html-api/wpHtmlSupportRequiredOpenElements.php
Outdated
Show resolved
Hide resolved
tests/phpunit/tests/html-api/wpHtmlSupportRequiredOpenElements.php
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a number of comments, but nothing major.
This is an impressive feat. The inline documentation and comments help a great deal, but I will say that it's still not trivial to grasp the underlying logic always -- which is probably natural due to the inherent complexity of the spec (that I'm still not that intimately familiar with). Also nice job on the naming of helper methods (reconstruct_active_formatting_elements
, run_adoption_agency_algorithm
, etc), and on the code structure of step_in_body
which uses all those and is quite straight-forward to read.
Short of attempting to fully understand algorithms such as Adoption Agency and how it's implemented, it's the impressive test coverage that gives me the confidence to give this my 👍 .
As is reflected in some of my earlier comments, I've struggled a bit with understanding some of the methods that are still in stub state -- possibly since I started by reviewing the helper classes first -- notably e.g. has_element_in_button_scope
, whose implementation currently makes no reference to button
at all 😅
I'm happy to give the benefit of the doubt that they will make more sense once the handling of the relevant element is implemented, and there's plenty of documentation in the HTML Processor class and test, but I wonder if a cursory reader of e.g. the Open Elements class might feel similarly puzzled as I did. I'm not sure I have a perfect solution to this -- I'd maybe just suggest even more inline docs/comments/@todo
s/@see
s.
Anyway: LGTM 👍
fe81fae
to
63ea501
Compare
@ockham I think your comments are all valid. How about we get this in Thanks for all your help with this. |
63ea501
to
b6a517b
Compare
Committed to Core |
Hey, feel free to ignore it, and I'm sure this is not the bottleneck, but wouldn't it be preferable from a performance point of view to order the tags by how common they are? return (
'ADDRESS' === $tag_name ||
'APPLET' === $tag_name ||
'AREA' === $tag_name ||
'ARTICLE' === $tag_name ||
// ... switch ( $op ) {
case '+BLOCKQUOTE':
case '+DIV':
case '+FIGCAPTION':
case '+FIGURE':
case '+P':
// ... Maybe also store those tag name strings in variables? (I'm not so sure if that one makes any difference). |
I don't know. Once this is spec-complete I'd like to run some benchmarks to better understand. A
PHP interns string literals so there's no benefit to storing them in variables. Originally I had a more performant solution here, which involved a I'm still not sure how fast or slow this is because it's hard to create a meaningful test document that's fully supported. My guess is that this is marginally slower than the Tag Processor. |
Trac ticket: 58517-trac
Prior work:
Html api/add html processor dmsnell/wordpress-develop#6
Wp html processor text based adamziel/wordpress-develop#2
WIP: Introduce class for sourcing block attributes from HTML gutenberg#46345
For review, please focus on significant matters about the design and implementation of this new subsystem. Many things could change between now and merge. Here are some things that could use your insightful feedback:
Status
private
because the tag processor constructor isn't private_doing_it_wrong()
to warn peopleset_inner_html()
orset_outer_html()
for release because it introduces new problems (mostly around seeking and bookmarking)breadcrumbs
to find a tag, basically an extension to the Tag Processor's query language.Could we finally create an internal bookmark class, that when it disappears from scope can release the bookmark?If we do this, I think we would need to mirror the bookmarking system, but it would result in natural garbage collection, or would we have to hook in to the bookmarking so that the outer class calls into the bookmarks.WP_HTML_Token
class and release the bookmark in the destructor.after_push
andafter_pop
, as they will be valuable for turning nested iteration into a "free" flaglast_error
property to determine why a failed run failed.Lines of code counts as of 76c4ab8: