-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gutenberg still parsing HTML with regular expressions #5967
Comments
Hi @gschoppe, Thanks for the issue.
Can you share where you originally gained this impression?
Can you share more detail on your concerns and your suggestions? |
The goal of the PEG parser in Gutenberg is to provide a formal post grammar that can be parsed as a headless tree. This is what it does in JavaScript. As such, falling back to using regular expressions (which cannot parse HTML as a formal grammar, due to the two formats being at different chomsky levels) You can see the discussion by @dmsnell and @nylen here about the need to have the PHP and JS parsers match in functionality, and you can see the comment by @dmsnell here where he states that as a project, Gutenberg chose to not parse via regex. As for the issues, this regex carries over the issues of shortcode parsing. A simple example is attempting to parse a block that contains another copy of itself as nested content. |
One important clarification: the PEG parser still exists in PHP and is exposed to be used when full-tree consumption is needed. The decision to not use it for typical rendering was because we were unnecessary parsing every block into a tree only to reassemble again, while the only needed operation by default is to transform dynamic blocks (those that have render callbacks assigned) at render time. There's a discussion on how to accomplish this still using the same grammar spec by introducing another token for dynamic blocks in #4591 That's pending finding some time to work through the requirements. (We also really need a collection of posts to test the various heuristics.) |
Thanks @gschoppe for paying attention to the code! I just wanted to chime in and provide some extra context. While I too like to roll my eyes when I see a regular expression trying to parse HTML, what I'm really trying to express is my disapproval of trying to parse non-regular languages with a regular grammar (in the Chomsky sense: recursion, context-free rules, etc…). It's true we're using the In some unrelated conversation today we tossed around some code which uses the existing grammar to create the same kind of parse the referenced function already does. That is, the specification will remain formal but the implementation may not (as long as it properly implements the spec). Hope that helps! If you look at any well-established formal tested performant robust parser system then they will likely allow for the use of regular expressions to tokenize, which is the gist of what this function is doing. |
@mtias & @dmsnell I understand the speed implications of not parsing the full tree at run time, but at the moment, the dynamic blocks are never passed through the PEG parser at all. All the parsing, not just tokenization, is being done based on a regular expression. This may be somewhat safe as long as dynamic blocks are not allowed to have content or nested content, but I believe those are artificial constraints on the API that are going to cause problems eventually. I also believe that assuming that only dynamic blocks will ever need to be parsed at run-time is an artificial limitation on the capabilities of the API. I can think of many reasons that themes would want to parse specific static blocks, before rendering (such as switching a columns block to use bootstrap-compatible html) There was talk in #4591 of a light parser that could be run on every block, just to get the name and identify its position in the post, then could check to see if any hooks had been registered on that block. if no hooks and no render function, just output the HTML content. If a short-circuit filter exists, parse the block and pass the name, content, and attributes to the filter. If a render_callback exists, parse and run the render function with attributes, content, and name. There was some argument in that ticket as to whether the full parser could just be run by plugins as needed, but that introduces problems of its own. It means that for each plugin that needs to fully parse a block at runtime, an additional complete run of the parser is added. This means that a function that was considered too slow to run on standard page loads may get run an unknown number of times on each page load, depending on the combinations of themes and plugins in use. In general, there is an ideological issue with structured data being handled as unstructured data on page load, and parsing with a regular expression is a symptom thereof. If the data is in a PEG-defined format, it needs to be parsed through that system on load, so that plugins can make use of the structure. |
Good thoughts!
The performance was not the main driver for avoiding a parsing step on the server. The reason is more that we should not add layers of potential performance degradations for speculative cases (given a default state without a plugin wanting access to a given block in the tree-shape format). That should be something that happens when necessary, not something that affects all cases irrespective of the context even if it's performant. The nested issue that prompted the alternative implementation was merely a sign that the server doesn't know how to put together a post from a full-tree structure, and it highlights that unless you are operating on the whole set of blocks (example, moving blocks around, merging blocks, etc), you don't need to generate the full-tree structure for the entire post, only to operate within a node. That is just wasted effort and added complexity for the server at the moment. I do think there are some valid cases for getting a portion of the post in tree format — just the block sub-tree you care about. This might be something we can add to the The problem with a full tree representation is that it might affect block nodes that you don't know how to put together because you lack the block specific implementation that's part of the editor interface. Also, it's important we solidify the grammar as a canonical description of how posts are constructed (maybe as a document) so that different and potentially better parsers are written against the same set of tests to determine the same output. The tokenization (which is really what the regex in that other PR is doing for dynamic blocks) is the most trivial part of the process since our syntax is regular. |
The problem with this is that if there is not a guaranteed common parse of the data to tree format, any plugin that does need that format will need to add an additional run of the parser, which will create a situation where each plugin that needs to parse the post has a geometric impact on performance. Three plugins that need to deal with the tree state would mean three full parses of the post on load.
This could be solved by limiting wrapping blocks from altering the implementation of the blocks they contain. For example, with columns, if the columns were defined as separate divs that contain blocks, rather than a modification of the contained blocks themselves, then each block's implementation could be atomic. That way, you would simply pass the inner HTML of a block to the render function, along with everything else, and if there is no render callback or filter on the block, if would just return the HTML. In the case of wrapping blocks, you would treat the HTML that wraps each "child-block area" as an array of HTML strings, that will be interleaved with the set of rendered contained blocks, when that block is rendered, unless a callback or filter steps in. The issue of rendering from tree-view on server side needs to be solved in some way, whether that full parse happens by default or not, or else the ability for plugins to parse to tree-view becomes academic. What good is their ability to run the PEG parser, if the result is a lossy version of the data, that cannot be turned back into a post?
This only holds true so long as the grammar is defined in such a way that nesting blocks cannot contain an instance of themselves. Before that pull request modified the definition of the render callback, there was no such limitation on blocks, and the PEG parser itself places no such limitation. I posted a couple examples of block structures that would benefit from self-nesting dynamic blocks. I really don't believe they should be culled from the capabilities of the language, just to make a regex viable. |
You are saying to use the render_callback as a heuristic for whether to return tree-shaped content with inner blocks or just return the HTML when no callback is present? I think that could work, and is in the spirit of what I wrote before — that you should get access to the tree if you know what you are doing. The larger question is whether, by default, WordPress should be assembling the post from a tree when the majority of the content doesn't have that need.
Yes, I agree this needs to happen. The question is more whether it needs to happen by default. The level of flexibility in nested groups also means that the representation of inner areas is not always trivial. We have discussed having a |
To expand on the latter, given the shape of: Array [
Object {
"attrs": Object {},
"blockName": "core/block",
"innerBlocks": Array [],
"innerHTML": "",
}
] We could represent "innerHTML": Array [ "html", innerBlocksPlacement, "html" ], And the assumption would be that when there is a single element in the array, there are no block children. This follows some of the original discussion when defining support for nested groups and the importance of not duplicating nested content in the HTML representation. It still stands that we don't want to do this concatenation freely unless there is a need for the structure. Thanks for the conversation! |
Personally, I believe there is a need to digest structured data if you send structured data, just to provide consistency to the behavior of the PHP and JS APIs, and to provide locations for filters and actions to be added. After all, there are a lot of developers that need to adopt blocks and start using them, and if they see the default case to be "eh, that just has to do with the editor's UI", they are less likely to learn the system deeply, than if it is a uniform fully-integrated process on both ends. If it is a deal-breaking issue, and a full parse just can't happen on page load, by default, then there should be a flag to trigger a single canonical parse, via a short-circuit filter. That way, at least, if 5 plugins each need to perform a full parse, they would each only be setting a flag, rather than running their own independent parses that don't integrate with each other.
That looks like an excellent way to handle parsing atomic blocks, but it would require making that distinction. The current implementation of the columns block is purposefully non-atomic (it sets a column class designator directly on each of the child blocks), so it could not work with this system. Personally I have no idea how that could work well with placing a dynamic block in a column, since its html is rendered in php, so I think making blocks canonically atomic is a good idea, but there will be dissenting voices regarding additional semantic markup for column containers. |
I was thinking about this more, and considered saying that the token placeholder is unnecessary, since inner blocks will always be separated by blocks of HTML, so you would only need to have an array of the boundary HTML strings, and you could blindly interleave them (i.e. if you have an array of HTML strings that is 4 elements long, you know it represents HTML | Blocks | HTML | Blocks | HTML | Blocks | HTML ). However, it struck me that you could define nested block areas as a sort of container pseudo-block (only a pseudo-block as it wouldn't have its own render function, instead being rendered by the callback of the parent element), thereby allowing each nested area to contain its own attributes. For complex column logic, or complex conditional logic, or repeaters like accordions and tabs that separation of data might make parsing attributes a lot simpler. To demonstrate in code, it might look like:
For a static nested block like columns, it would look more like this:
As an example of a more common situation that tree-parsing could assist in, I think parsing the image block on load could have some serious benefits, with regard to handling image assets that have been deleted or renamed, bringing It would also open the door to updating image tags as new features gain better support. For example, the webp format in a I think if the possibility to parse these static blocks is provided by default, a lot of common situations like that will be found organically, as it encourages developers to consider the post as a set of blocks, even when not in the context of the editor. |
Has this issue been resolved? I haven't seen any merges that would show the PEG parser being used in |
Sorry didn't add more context, the main change has been all the work gone into #8083 which given the drastic performance improvements should allow these operations to go through it. |
gutenberg/lib/blocks.php
Line 144 in ea9416b
I was under the impression that one of the reasons that HTML comments were considered an acceptable structure, was that they would be parsed with a formal grammar (peg). In PHP, blocks are still being parsed with regular expressions, which are insufficient for processing html safely.
The text was updated successfully, but these errors were encountered: