WIP: Introduce class for sourcing block attributes from HTML #46345

dmsnell · 2022-12-07T04:30:58Z

What?

Part of #44410.

Makes it possible to source some block attributes from a chunk of HTML given the attribute definitions (as passed in registerBlock())

Why?

Who doesn't want sourced block attributes from the PHP side?

How?

Uses the WP_HTML_Tag_Processor and some basic assumptions about HTML structure to source attributes iff:

it can confidently parse the CSS selector in the attribute definition
it can construct a query to get that information out of the html

adamziel · 2022-12-07T10:36:43Z

lib/experimental/html/class-wp-html-attribute-sourcer.php

+	}
+
+	public static function parse_selector( $s, $at = 0 ) {
+		$selectors = explode( ',', $s );


I'd treat the separator more carefully in case this gets applied to a selector like a[href*=","]

Also, the explode will be executed every time this function runs, including the call in line 254 below where we already expect to deal with a single selector. only.

yep, thanks. still very rough, and in fact I want to move a lot of code around here, but I wanted to get some basic structure together to see how to build this

I've moved support for this into the main parser function and am aborting on unsupported attribute queries, so we shouldn't have this bug anymore

adamziel · 2022-12-07T10:43:35Z

lib/experimental/html/class-wp-html-attribute-sourcer.php

+		}
+
+		// @TODO: Hashes don't have to start with `nmstart` so this might reject valid hash names.
+		$identifier = self::parse_css_identifier( $s, $at );


Looks like it only consumes the first css identifier at the moment, like figure in figure video.

looks right 😉

though to be fair if you look up at line 275 above it, it's acknowledged that we don't support the descendant selector yet

adamziel · 2022-12-07T10:54:56Z

lib/experimental/html/class-wp-html-attribute-sourcer.php

+			while ( $tags->next_tag() ) {
+				foreach ( $selector as $s ) {
+					if ( 'element' === $s['type'] && $tags->get_tag() === strtoupper( $s['identifier'] ) ) {
+						return $tags;
+					}
+
+					// @TODO: $tags->has_class() would be _really_ handy here.
+					if ( 'class' === $s['type'] && preg_match( "~\b{$s['identifier']}\b~", $tags->get_attribute( 'class' ) ) ) {
+						return $tags;
+					}
+
+					if ( 'hash' === $s['type'] && $s['identifier'] === $tags->get_attribute( 'id' ) ) {
+						return $tags;
+					}
+				}
+			}


This effectively reimplements $tags->next_tag. I'm tempted to say it could be included in the next_tag API as follows:

$criteria = []; foreach ( $selectors as $s ) { $criterion = []; if ( 'element' === $s['type'] ) { $criteria[] = [ 'tag_name' => strtoupper( $s['identifier'] ) ]; } else if ( 'class' === $s['type'] ) { $criteria[] = [ 'class_name' => $s['identifier'] ]; } else if ( 'hash' === $s['type'] ) { $criteria[] = [ 'id' => $s['identifier'] ]; } } $tags->next_tag( [ 'or' => $criteria ] );

However, I can see how this may need more CSS-specific matchers that don't make sense in the core API.

At the same time, this method has duplicate logic for a special case when only one selector is received. I'd do this instead:

public static function select( $selector, $html ) { $tags = new WP_HTML_Naive_Processor( $html ); if ( ! array_is_list( $selector ) ) { $selector = [ $selector ]; } while ( $tags->next_tag() ) { foreach ( $selector as $s ) { // ... } } return null; }

on one note, it's not duplicate logic, as in the case of the individual selector we can tell the tag processor to jump to the next tag you specify.

on another note, if you squint…

there's something very potentially powerful here that could be moved into the tag processor. spent some time today reading the CSS selector spec…

for another reason I've actually collapsed it to your suggestion; getting has_attribute support (such as with a[download]) was going to start getting cumbersome, so now it's just one big loop.

adamziel · 2022-12-14T09:59:20Z

lib/experimental/html/class-wp-html-attribute-sourcer.php

+					}
+					list( $inner_selector, $next_at ) = $inner;
+					$inner_selector['combinator'] = $combinator;
+					$selector['then'] = $inner_selector;


Is the output like a linked list of selectors?

that's right. we need to represent selectors in some way that captures the recursive or monadic behavior of the combinators. we have to use the first segment, or simple selector sequence, to find a given location in the document, and then from that starting point and in whatever relation to it is appropriate given the combinator, find the next one.

we could probably do this as a list, but I already had set the top level as a list to represent the comma-separated selectors_group, which is special because the OR operation here only exists at the top level.

recursing seemed easiest since I could presumably pass a single object to the query generator and it would contain the next simple selector sequence in the chain, or nothing at all.

haven't built that part of the query generator yet though.

I was thinking about maybe a list of lists, as in

[ [$segment1, $segment2, $segment3], [$segment1, $segment2, $segment3] ]

We'll still need to use nesting, though, for things like :not(span, a), so maybe better to embrace it right away.

Not yet sold on supporting :not() at this stage.
The list works, but the nesting is convenient because we don't have to track where in the list we are.

What do you see as the benefit of the flat list?

adamziel · 2022-12-14T10:00:38Z

lib/experimental/html/class-wp-html-attribute-sourcer.php

+	 *     nonascii     [\240-\377]
+	 *     escape       {unicode}|\\[^\r\n\f0-9a-f]
+	 *     unicode      \\{h}{1,6}(\r\n|[ \t\r\n\f])?
+	 *     h            [0-9a-f]


h -> hex? For readability – I got confused for a moment

this is just a direct copy from the spec. yes, h is there for a hex digit

adamziel · 2022-12-14T10:02:53Z

lib/experimental/html/class-wp-html-processor.php

+ * not, that `<p>` itself is a special element.
+ */
+
+class WP_HTML_Processor extends WP_HTML_Tag_Processor {


This also works with tags, it's even in the method names – I'd be slightly confused to see two classes with such similar names doing similar things. If there isn't anything Unsafe here, should we just bring these methods to the parent class?

it's not unsafe right now because the functions have been renamed to imply they are making assumptions (and those names might change)

but there's still a difference between this and the tag processor that I wrestle with. this is "safe" in that it will preserve the HTML syntax by relying on the tag processor. this is "unsafe" in that it allows one to break the semantic of the HTML by performing operations that would require multiple steps in a DOM, such as removing a range of HTML tags or tag openers or tag closers. it's also "unsafe" in that even though it preserves the syntax, it may not do what you expect it to, even in the renamed functions.

<p>Before<p>Inside</p>

For this input there's nothing that will trap the first paragraph and any balanced tag operation will fail after reaching the end of input. Hopefully that will change in time, and we'd properly find <p>Before as one outer content and <p>Inside</p> as another, but even then our performance domain has shifted greatly from the tag processor.

I'm thinking that as this class develops people may not generally call the tag processor and it will remain as a low-level library. That's why I renamed it to WP_HTML_Processor, because I think this could end up as the "entry point" interface for working with HTML. Sistering this is the as-of-yet-still-un-renamed-to WP_PHPQ (for lack of a better name) which might go inside here or still might live separately as a class for turning a CSS selector into an HTML query. The HTML class could expose that query instead of our next_tag() interface.

$p = new WP_HTML_Processor( $html ); while ( $p->find( 'section + h2 > a[href^="#"]' ) ) { $title = $p->get_inner_text(); }

It's still not a DOM, we can't call ->parent_node() or walk backwards (even though we can manually do this with bookmarking), but we can jump to specific places in the document, mangle the attributes, and mangle the content inside it.

it's also "unsafe" in that even though it preserves the syntax, it may not do what you expect it to, even in the renamed functions.

And then it allows you to find an element by a class name, even if it's table, and then insert a div inside it – which the browsers can handle, but is likely not what you want.

That's why I renamed it to WP_HTML_Processor, because I think this could end up as the "entry point" interface for working with HTML.

Oh I see, with this insight I'd probably rename the Tag_Processor to something else just eo avoid the confusion – but at this point I don't think it's worth it. 🤷

The HTML class could expose that query instead of our next_tag() interface.

I like that!

I'd probably rename the Tag_Processor

I don't know about this. I think based on how careful we were with naming that it will retain the meaning we gave it.

What do you find contradictory about that class, if we introduce an HTML processor?

It's not as much that I see a contradiction, as that I worry about confusing everyone with two very similarly named classes (in a similar way as folks get confused about slice vs splice in JS)

let's keep thinking about it then. I want to keep the classes separate because of the different performance implications, but I don't know how to address what you're suggesting (apart from Unsafe which I'm currently skeptical of using)

ockham · 2022-12-14T17:04:15Z

👋 I'm now using get_content_inside_balanced_tags over at WordPress/block-interactivity-experiments#118 (comment), and it's working great!

One thing that caught me a bit by surprise is that get_content_inside_balanced_tags moved the ... pointer thingy? past the closing tag. Could we maybe consider resetting it to where it started? I'm currently doing that manually, but it's a bit verbose -- especially given that get_content_inside_balanced_tags keeps track of the start and end internally (so it should be fairly easy to reset after getting the content).

ockham · 2022-12-14T17:12:12Z

Given that WP_HTML_Processor's get_content_inside_balanced_tags is pretty handy in its own right, whereas the CSS selector logic might need a bit more work -- would it make sense to file a separate PR just for WP_HTML_Processor (and its tests)?

I thought that this might also unblock work on a set_content_inside_balanced_tags method, which we'll need for the Block Interactivity SSR stuff (see WordPress/block-interactivity-experiments#118) 😊

(I can file that PR if you're okay with this plan @dmsnell @adamziel)

dmsnell · 2022-12-14T19:32:14Z

(I can file that PR if you're okay with this plan @dmsnell @adamziel)

@ockham consider everything in this PR tentative at best and more-than-likely to change. I don't even know if we'll end up with `get_content_inside_balanced_tags() or have something else instead.

Feel free to write your own PRs and explore different approaches and share here. This PR is more like my own notes as I work through the problem and try to understand where things belong and what the different parts of the system are.

ockham · 2022-12-15T08:47:21Z

lib/experimental/html/class-wp-html-processor.php

@@ -87,6 +87,7 @@ public function get_content_inside_balanced_tags() {
 		}

 		$content = $this->content_inside_bookmarks( $start_name, $end_name );
+		$this->seek( $start_name );


adamziel · 2022-12-16T15:40:08Z

lib/experimental/html/class-wp-html-attribute-sourcer.php

+			$tags->seek( 'start' );
+			$max = 100;
+			while ( --$max > 0 ) {
+				$next = $s;


Here's an idea – take a selector like #main > .wrapper a span.

This class is now going top to bottom and matching #main, then the immediate descendant .wrapper, then a descendant a and then a descendant span.

What if it went bottom to top? So we start by looking for the first span in the document and then investigate the parents stack to see if any matches a. Now we don't need to jump around the document to decide if it's a match.

There's still the issue of ~ and + combinators, but we could keep track of a tree of relevant siblings. For example in a HTML like

<div id="main"> <div class="wrapper"> <a> <div></div> </a> <a> <span></span> <span></span> <--- the cursor is here </a> </div> </div>

We'd only store the following data structure (without the children of the initial a)

div#main div.wrapper a a span span

If we want to support matching arbitrary attributes beyond ids and classes, we could keep a bookmark for each parent and backtrack if needed.

That's what the browsers do, by the way.

Yeah this seems worth a try. If we aren't going to fully parse the HTML then I don't see the same value add because quickly eliminating potential matches would happen before diving deep into the structure (and maybe that can still happen). If on the other hand, we're going to do a full document parse (and carry the weight of all that implies) then this method seems approachable.

The one major difference I see at this point is that if we're looking for the inner-most element first then we have to run $tags->matches() on every node and run a full check on every node. In the approach I've started taking we prune out most nodes out the door and are able to iterate with less careful checking, because at that point we're mostly just checking tag name and whether it's an opener or closer. In practice it may not matter, but I did want to point out something different from our HTML-based approach vs. what might be happening in the DOM land in the browser.

… selector

…ted selector

Tests are broken because the attribute sourcer needs rework given this different approach for navigating tags.

For sibling navigation we have to continue processing outside of balanced_next().

github-actions · 2023-01-06T21:54:43Z

Flaky tests detected in 6103139.
Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3858581592
📝 Reported issues:

[Flaky Test] results in two remaining blocks and positions the caret at the end of the second block #39805 in specs/editor/various/block-deletion.test.js

dmsnell · 2023-01-09T21:34:19Z

All the test cases that I thought up are passing as of 79c24bd, but if you pressed me to explain how or why I wouldn't be able to say 🙃

Current thought is to continue exploring cases and see how this now interacts with other tag processor operations and modifications, see how deeply we can nest. See about how this does when given example Core block output.

Please for all who review this consider it a draft that's likely to be unlike a final version. Would like to convert the iteration into recursion and remove the goto statements, would like to extract the WP_HTML_Processor() outside the select() function and allow for passing it around through recursion but also for prepping it on an input document.

sirreal · 2024-12-10T14:39:32Z

See WordPress/wordpress-develop#7857 for selector interfaces for the HTML API.

adamziel reviewed Dec 7, 2022

View reviewed changes

dmsnell force-pushed the add/attribute-sourcer branch from 14a5626 to d81f912 Compare December 9, 2022 22:22

zteil mentioned this pull request Dec 10, 2022

Site Editor & Templates Roadmap #41241

Closed

43 tasks

ockham mentioned this pull request Dec 13, 2022

Experiment with WP_HTML_Tag_Processor WordPress/block-interactivity-experiments#118

Closed

dmsnell force-pushed the add/attribute-sourcer branch from ab3591a to 89a3b32 Compare December 13, 2022 23:45

adamziel reviewed Dec 14, 2022

View reviewed changes

ockham reviewed Dec 15, 2022

View reviewed changes

adamziel reviewed Dec 16, 2022

View reviewed changes

dmsnell force-pushed the add/attribute-sourcer branch from 9e752e5 to 01d31ac Compare December 21, 2022 22:44

ockham mentioned this pull request Dec 22, 2022

SSR: Experiment with WP_HTML_Tag_Processor WordPress/block-interactivity-experiments#125

Closed

dmsnell added 13 commits January 6, 2023 13:17

WIP: Introduce class for sourcing block attributes from HTML

ee934c3

Rearrange some functions and add support for presence of attribute in…

5396d3b

… selector

Add support for child and descendant selection within the HTML processor

ac30fa4

Add additional test for child selection

fdbb658

Add basic comment header to file

97629ed

Add more complete CSS selector parsing, add failing test case for nes…

ebe3afa

…ted selector

Restore tag processor pointer when getting content

3c63e24

Make things worse, but lead towards making them better; GOTO

3f7c419

Scribble notes

02e3e5c

Revise HTML navigation to use balanced_next with a depth tracker

ac7861d

Tests are broken because the attribute sourcer needs rework given this different approach for navigating tags.

Remove bail depth since we can't track a negative-sized stack

4050282

For sibling navigation we have to continue processing outside of balanced_next().

Exit the enclosing tag in balanced_next, add tests

c51a66e

Getting closer

89bb635

dmsnell added 5 commits January 6, 2023 13:17

more tests

e5cfd30

Closer still

eb1367d

Passing tests!

30172ab

Remove first draft select function

f415bdb

More test cases, a couple new failures

6103139

dmsnell force-pushed the add/attribute-sourcer branch from 679dce7 to 6103139 Compare January 6, 2023 21:17

Only traverse children of non-closing or non-void elements.

79c24bd

ockham mentioned this pull request Jan 10, 2023

WP_HTML_Processor: Add set_content_inside_balanced_tags #47036

Closed

2 tasks

This was referenced Jan 26, 2023

SSR: Add wp-show attribute directive processor WordPress/block-interactivity-experiments#141

Closed

Tracking Issue: Server-side rendering of directives WordPress/block-interactivity-experiments#120

Closed

ockham added a commit that referenced this pull request Jan 30, 2023

HTML Processor: Add from #46345

e6dc26d

ockham mentioned this pull request Jan 30, 2023

Add WP_HTML_Processor #47573

Closed

4 tasks

gziolo added the [Feature] Block API API that allows to express the block paradigm. label Jan 31, 2023

This was referenced Feb 10, 2023

WP_HTML_Tag_Processor overview issue #44410

Closed

Block API #41236

Open

gziolo mentioned this pull request Mar 29, 2023

Blocks: Optional support that optimizes handling blocks for usage as structured data #49437

Open

gziolo added [Feature] HTML API An API for updating HTML attributes in markup [Type] Technical Prototype Offers a technical exploration into an idea as an example of what's possible labels Apr 18, 2023

dmsnell mentioned this pull request Jun 17, 2023

HTML-API: Introduce minimal HTML Processor. WordPress/wordpress-develop#4519

Closed

7 tasks

dmsnell mentioned this pull request Apr 5, 2024

HTML API: Return elements pushed and popped rather than tags read. WordPress/wordpress-develop#6348

Closed

2 tasks

dmsnell mentioned this pull request Apr 13, 2024

HTML API: Add attribute sourcing WordPress/wordpress-develop#6388

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Introduce class for sourcing block attributes from HTML #46345

WIP: Introduce class for sourcing block attributes from HTML #46345

dmsnell commented Dec 7, 2022 •

edited

Loading

adamziel Dec 7, 2022

adamziel Dec 7, 2022

dmsnell Dec 7, 2022

dmsnell Dec 10, 2022

adamziel Dec 7, 2022

dmsnell Dec 7, 2022

adamziel Dec 7, 2022

dmsnell Dec 7, 2022

dmsnell Dec 10, 2022

adamziel Dec 14, 2022

dmsnell Dec 14, 2022

adamziel Dec 16, 2022 •

edited

Loading

dmsnell Dec 16, 2022

adamziel Dec 14, 2022

dmsnell Dec 14, 2022

adamziel Dec 14, 2022 •

edited

Loading

dmsnell Dec 14, 2022

adamziel Dec 16, 2022

dmsnell Dec 16, 2022

adamziel Dec 19, 2022

dmsnell Dec 22, 2022

ockham commented Dec 14, 2022

ockham commented Dec 14, 2022

dmsnell commented Dec 14, 2022

ockham Dec 15, 2022

adamziel Dec 16, 2022 •

edited

Loading

adamziel Dec 16, 2022

dmsnell Dec 16, 2022

github-actions bot commented Jan 6, 2023

dmsnell commented Jan 9, 2023

sirreal commented Dec 10, 2024

WIP: Introduce class for sourcing block attributes from HTML #46345

Are you sure you want to change the base?

WIP: Introduce class for sourcing block attributes from HTML #46345

Conversation

dmsnell commented Dec 7, 2022 • edited Loading

What?

Why?

How?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Dec 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ockham commented Dec 14, 2022

ockham commented Dec 14, 2022

dmsnell commented Dec 14, 2022

Choose a reason for hiding this comment

adamziel Dec 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 6, 2023

dmsnell commented Jan 9, 2023

sirreal commented Dec 10, 2024

dmsnell commented Dec 7, 2022 •

edited

Loading

adamziel Dec 16, 2022 •

edited

Loading

adamziel Dec 14, 2022 •

edited

Loading

adamziel Dec 16, 2022 •

edited

Loading