-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dennis' list of broad and interesting things. #62437
Comments
I hope it's OK to comment here, I'm particularly interested in this section: Rewriting Core to take advantage of the HTML API.
There's a nexus of issues involving the current stack of content filters that are applied to the HTML before and after blocks are rendered. I've been bitten by this as a block author and previously with shortcodes, where the output is corrupted unexpectedly - for example, HTML Here are the places where similar sets of content filters are applied, including add_filter( 'the_content', 'do_blocks', 9 );
add_filter( 'the_content', 'wptexturize' );
add_filter( 'the_content', 'convert_smilies', 20 );
add_filter( 'the_content', 'wpautop' );
add_filter( 'the_content', 'shortcode_unautop' );
add_filter( 'the_content', 'prepend_attachment' );
add_filter( 'the_content', 'wp_replace_insecure_home_url' );
add_filter( 'the_content', 'do_shortcode', 11 ); // AFTER wpautop().
add_filter( 'the_content', 'wp_filter_content_tags', 12 ); // Runs after do_shortcode(). $content = $wp_embed->run_shortcode( $_wp_current_template_content );
$content = $wp_embed->autoembed( $content );
$content = shortcode_unautop( $content );
$content = do_shortcode( $content );
...
$content = do_blocks( $content );
$content = wptexturize( $content );
$content = convert_smilies( $content );
$content = wp_filter_content_tags( $content, 'template' );
$content = str_replace( ']]>', ']]>', $content ); // Run through the actions that are typically taken on the_content.
$content = shortcode_unautop( $content );
$content = do_shortcode( $content );
$seen_ids[ $template_part_id ] = true;
$content = do_blocks( $content );
unset( $seen_ids[ $template_part_id ] );
$content = wptexturize( $content );
$content = convert_smilies( $content );
$content = wp_filter_content_tags( $content, "template_part_{$area}" );
...
$content = $wp_embed->autoembed( $content ); $content = apply_filters( 'the_content', str_replace( ']]>', ']]>', $content ) ); Here are some of the issues related to these content filters.
They lead back to the same root cause that can be solved with an HTML processing pipeline. Instead of content "blobs", search & replacing strings with regular expressions, we would be working with well-defined data structure down to the tags, attributes, text nodes, tokens. To quote from a prior comment:
It's encouraging to see where the HTML API is headed, as a streaming parser and single-pass processor that could unite and replace existing content filters with something that understands the semantic structure of the HTML document.
This could be like a sword that cuts through the knot of existing complected content filters into a consistent and extensible HTML processor. Interesting how shortcodes appear in these threads, since they're a parallel mini-language embedded in HTML. I'm curious to see how "bits" (dynamic tokens) evolve to become a better shortcode system. (Was thinking "blit" might be a possible naming, like blocks and blits, since bit already has a common meaning in programming. Well I guess blit does too: "blit - bit block transfer: A logical operation in which a block of data is rapidly moved or copied in memory, most commonly used to animate two-dimensional graphics.") |
..On a tangent, I wonder about processing HTML not only in a single pass but to walk the tree in a single flat loop without recursive function calls. I'm guessing this is how a streaming parser is meant to be used. In pseudo-code, a processor with nested calls might look like: function processHtmlNodesRecursively(nodes: HtmlNode[]): string {
let content = ''
for (const node of nodes) {
// ..Process node
const { type, tag, attributes, children = [] } = node
if (type==='tag') {
// content += openTag + attributes
content += processHtmlNodesRecursively(children) // :(
// content += closeTag
continue
}
// text, comment..
}
return content
} Instead a flat loop with iteration: function processHtmlNodesIteratively(nodes: HtmlNode[]): string {
+ const stack = [...nodes]
let content = ''
let node
+ while (node = stack.shift()) {
// ..Process node
const { type, tag, attributes, children = [] } = node
if (type==='tag') {
// content += openTag + attributes
+ stack.unshift(...children, closeTag) // :)
continue
}
// text, comment..
}
return content
} Mm that isn't streaming but the basic idea applies, I think, if it used a parser instance to pull each parsed node, and process the nested tree of nodes iteratively or recursively.
What I'm picturing is a streaming processor and renderer, which evaluates the parsed nodes dynamically back into a string (valid HTML document/fragment). It reminds me of a Lisp evaluator with tail-call elimination. (From: The Make-A-Lisp Process - Step 5: Tail call optimization) This topic is on my mind because I once ran into a stack overflow (or error with max number of recursion) while parsing a large HTML document with deeply nested nodes, due to how the parse function recursively called itself. Since then I think about this Related:
I like the idea of a language-agnostic spec and schema, for example implemented in PHP and TypeScript. Then the same data structures can be shared and processed by WordPress backend and Block Editor. |
Thanks for the collection of links @eliot-akira!
The good news is that I think the HTML API already does what you're asking. It never even builds the DOM tree; instead, it walks through the document textually and tracks the location of the currently-processed node in the document structure. That is, while you may not have a link to some tree object for The HTML API doesn't call itself recursively like how you've indicated in your comments, though there is a minimal recursive part that’s not particularly relevant. It's built around the concept of “the processing loop” which visits the start and end of each element in order. $processor = WP_HTML_Processor::create_fragment( $html );
while ( $processor->next_token() ) {
echo $processor->get_token_name() . "\n";
} You can see the official documentation for more basic usage instruction. |
I see, thank you for pointing to the docs, I'll dig deeper and start using the HTML processor more to understand how it works. The streaming design of the parser is so well thought-out. I have an existing implementation of HTML templates using another parser (that builds the whole tree in memory) which I plan to replace with this better one in Core. Hopefully I didn't derail the issue thread with the topic of I'm curious to see if there will be a TypeScript implemenation of the HTML processor to run in the browser. Not sure what specific purpose it would serve, but it seems like the Block Editor could make use of parsing and rendering content and templates, for example to replace dynamic inline tokens with fresh values from the server. Anyway, will be following its progress with interest! |
sounds great. feel free to reach out in WordPress.org Slack in the #core-html-api channel if you want to discuss it, or in individual PRs.
Not at all. It's a big goal of mine to (a) replace existing naive parsers in Core and (b) to see what we can do to provide a unified single pass for lots of transformations that don't require more context. For example,
I'm on some long-term vacation at the moment but I was hoping to get a browser version runnable too. From an earlier exploration I found the Tag Processor to be roughly four times faster than creating a new DOM node and setting @ellatrix has discussed its possible use with RichText where it brings a number of benefits, including that the scanning nature lines up naturally with converting HTML to attributed text (plaintext with an array of ranges where formats appear). It can be used to quickly determine if a given tag is in a string (e.g. “are there any footnotes in this text?”) without making allocations or spending the time to fully parse the content. |
Overall values and goals.
Performance guidelines:
If it's not measured, it's neither faster nor slower.
Modern CPUs are incredible machines. Take advantage of every abstraction leak. PHP does not run the way it looks like it should.
Defer where possible.
step()
ornext_thing()
functions which communicate where they find their match and how long the match is. These functions can appear inside a loop to do a full parse, but they can also be used for finding the first of a thing in a document, or analyze a document with low overhead.array()
. This carries the added benefit that it's possible to add semantics and avoid pushing out internal details to all of the call sites for a given thing. For example,WP_HTML_Decoder::attribute_starts_with()
is much more efficient thanstr_starts_with( WP_HTML_Decoder::decode( ) )
because it stops parsing as soon as it finds the given prefix or asserts that it cannot be there. This can save processing and allocating megabytes of data when applied on data URLs which are thesrc
of images pasted from other applications.Static structures are much faster than
array()
, and they provide inline documentation too!Block Parser
Replace the default everything-at-once block parser with a lazy low-overhead parser.
next_delimiter()
as a low-level utility. [#6760]The current block parser has served WordPress well, but it demands that it parses the entire document into a block tree in memory all at once, and it's not particularly efficient. In one damaged post that was 3 MB in size, it took 14 GB to fully parse the document. This should not happen.
Core needs to be able to view blocks in isolation and only store in memory as much as it needs to properly render and process blocks. The need for less block structure has been highlighted by projects and needs such as:
Block API
block.json
file.Block Hooks
HTML API
Overall Roadmap for the HTML API
There is no end in sight to the development of the HTML API, but development work largely falls into two categories: developing the API itself; and rewriting Core to take advantage of what the HTML API offers.
Further developing the HTML API.
New features and functionality.
Introduce safe-by-default HTML templating. [#5949]
Properly parse and normalize URLs. [#6666]
Introduce Bits, for server-replacement of dynamic tokens. [Make, Discussion]
Encoding and Decoding of Text Spans
There is so much in Core that would benefit from clarifying all of these boundaries, or of creating a clear point of demarcation between encoded and decoded content.
attribute_starts_with()
which is akin tostr_starts_with()
but only for attributes.Decoding GET and POST args.
There is almost no consistency in how code decodes the values from
$_GET
and$_POST
. Yet, there is and can be incredible confusion over some basic transformations that occur:Prior art
The HTML API can help here in coordination with other changes in core. Notably:
FORM
elements add theaccept-charset="utf-8"
argument, which overrides a user-preferred charset for a webpage (meaning that this is still necessary even if the<meta charset=utf-8>
tag is present).With these new specifications, the HTML API can ensure that whatever is decoded from
$_GET
and$_POST
are what was intended to be communicated from a browser or other HTTP request. In addition, they can provide helpers not present with existing WordPress idioms, like default values.Rewriting Core to take advantage of the HTML API.
Big Picture Changes
Create a final pass over the fully-rendered HTML for global filtering and processing. [#5662]
Mandate HTML5 and UTF-8 output everywhere. [#6536]
<meta charset="…">
that besides UTF-8. All escaping and encoding should occur as needed for HTML5. XML parsing, encoding, and decoding must take a completely different path. [See the section on the XML API].Create a new fundamental Search infrastructure for WordPress.
Confusion of encoded and decoded text.
There's a dual nature to encoded text in HTML. WordPress itself frequently conflates the encoded domain and the decoded domain.
Consider, for example,
wp_space_regexp()
, which by default returns the following pattern:[\r\n\t ]|\xC2\xA0|
. There are multiple things about this pattern that reflect the legacy of conflation:
. So if the text is encoded we may find either, but if the text is decoded then this pattern will erroneously match on
which presumably started as&nbsp;
and might have been someone trying to write about the non-breaking space.Parsing and performance.
In addition to confused and corrupted content, Core also stands to make significant performance improvements by adopting the values of the HTML API and the streaming parser interfaces. Some functions are themselves extremely susceptible to catastrophic backtracking or memory bloat.
convert_smilies()
. [#6762]force_balance_tags()
. [#5562]normalize()
method for constructing fully-normative HTML. But even this may not be necessary given the fact that the HTML Processor can properly navigate through a document structurally.wp_html_split()
. [#6651]wp_kses_hair()
and friends. [#6572]wp_replace_in_html_tags()
. [#6651]wp_strip_tags()
.wp_strip_all_tags()
. [#6196]wp_targeted_link_rel()
. [#5590]wp_kses_hair()
, and passes around PCRE results.Database
mysql_real_escape_string()
? This calls a C function inside of MySQL that examines the currently set character sets for the connection/session/table/database. If we could reliably escape content from PHP then we could eliminate a database round-trip per placeholder in prepared statements.Sync Protocol
WordPress needs the ability to reliably synchronize data with other WordPresses and internal services. This depends on having two things:
While this works to synchronize resources between WordPresses, it also serves interesting purposes within a single WordPress, for any number of processes that rely on invalidating data or caches:
XML API
Overall Roadmap for the XML API
While less prominent than the HTML API, WordPress also needs to reliably read, modify, and write XML. XML handling appears in a number of places:
The text was updated successfully, but these errors were encountered: