Parser: Start tokenizing HTML within the blocks #2286

dmsnell · 2017-08-08T10:57:16Z

This change adds a behavior which was present in the original parser but
removed for some reason when it was pulled directly into Gutenberg. That
behavior is producing a tree of tokenized HTML inside the blocks so that
we can work with the simpler data structure than having to repeatedly
parse and validate HTML strings from within the components and
throughout the different paths in the editor.

This change adds a new field to the parse output: children, which
contains the tree as a list of objects, each child being able to reflect
the structure of its parent with the same tree data structure. HTML tags
appear as type: HTML_Tag with name equal to the tag name. Otherwise
the HTML tags mimick the structure of blocks at the outermost level.

Future work will replicate this pattern to add in nested blocks (also in
the original parser but removed when it was brought into Gutenberg).
Nested blocks would appear in the children array and contain full
block structure, nested intuitively and without having an exceptional
structure or indication.

It's my hope that components, queries, and validation can take place in the
easier/faster tree data structure instead of via the parsed HTML strings.
This could mean that attribute access is achieved with expressions as
basic as block.children[ 0 ].attrs.href instead of writing the equivalent
query. Further, I hope that one day we can add children not by rendering
HTML strings, but by working with the data structure itself and serializing
back to the HTML string only when we need to. This would be a very good
step if we wanted to open up the ability to store on non-stringy backends
(store elsewhere than post_content via a plugin: Simperium, JSON-in-post_meta,
a database table directly, some custom external service…). Should also pave
the way for other editors conforming to the Gutenberg spec which may not
have as easy of a time parsing HTML as the browser does.

cc: @jleandroperez

Testing

Since so many of our tests are testing for identical equality of output against hand-written fixtures (instead of testing for the relevant properties of some behavior under test) then this PR is breaking a lot of tests. At the moment I don't really have the will or the time to fix them 🙃. Can review next week.

You can open the editor and work as usual. Since we're just adding a new children property to the parser output there should be no breakage once the data structure is in Gutenberg.

youknowriad · 2017-08-08T11:12:10Z

Having some trouble testing this branch because of the PHP parser. Do we need to regenerate it or something?

This change adds a behavior which was present in the original parser but removed for some reason when it was pulled directly into Gutenberg. That behavior is producing a tree of tokenized HTML inside the blocks so that we can work with the simpler data structure than having to repeatedly parse and validate HTML strings from within the components and throughout the different paths in the editor. This change adds a new field to the parse output: `children`, which contains the tree as a list of objects, each child being able to reflect the structure of its parent with the same tree data structure. HTML tags appear as `type: HTML_Tag` with `name` equal to the tag name. Otherwise the HTML tags mimick the structure of blocks at the outermost level. Future work will replicate this pattern to add in nested blocks (also in the original parser but removed when it was brought into Gutenberg). Nested blocks would appear in the `children` array and contain full block structure, nested intuitively and without having an exceptional structure or indication.

dmsnell · 2017-08-08T12:38:56Z

Having some trouble testing this branch because of the PHP parser. Do we need to regenerate it or something?

It seems like the PHP parser isn't working with text() - @nylen is there any insight you can provide here? Is php-pegjs just lagging behind the pegjs project or am I missing something big here?

Fatal error: Call to undefined function text() in /var/www/html/wp-content/plugins/gutenberg/lib/parser.php on line 304

nylen · 2017-08-08T20:59:58Z

phpegjs ~~does not support text() yet~~ has a $this->text() function but it is probably broken.

You can regenerate all the test fixtures by npm run fixtures:regenerate. Part of the task of this PR should be to inspect the changes and verify that they are as expected.

dmsnell · 2017-08-09T12:40:14Z

php-pegjs does not support text() yet. Also it should probably be implemented as $this->text().

Thanks for letting me know! In case it's not clear, the entire purpose is to return the parsed data structure as well as the source text which led to the parse.

npm run fixtures:regenerate

Thanks again!

nylen · 2017-08-10T19:28:15Z

This is something I've wanted to see for a while... the need is a bit less immediate due to #1929 and follow-up changes, but I expect this can pave the way forward for parsing nested blocks including shortcodes in a more well-defined manner.

I worry that with 10+ years of history in post_content it will get complicated quickly (it already kind of is...) and that it may not perform as well as the existing parser. Have you run any performance testing? (RUN_SLOW_TESTS=y SHOW_PERFORMANCE_INFO=y phpunit --filter large_post for PHP)

We'll need a new release of phpegjs to make text() work properly - see nylen/phpegjs#2.

dmsnell · 2017-08-10T21:29:16Z

Have you run any performance testing?

no and I tried naively to run the command you provided but it failed on account of something in /tmp/wordpress-tests-lib/ being unavailable. is that from the official WordPress unit test suite?

nylen · 2017-08-10T22:22:17Z

That should be covered under https://github.com/WordPress/gutenberg/blob/master/CONTRIBUTING.md#php-testing

dmsnell · 2017-09-21T05:45:54Z

Closing in favor of #2743 which is simpler, more reliable, and implements the ideas I discussed with @mtias and @aduth for the nesting API

dmsnell requested review from nylen, mtias and aduth August 8, 2017 10:57

dmsnell force-pushed the parser/tokenize-html branch 2 times, most recently from b92c159 to 942cfa3 Compare August 8, 2017 10:58

dmsnell force-pushed the parser/tokenize-html branch 2 times, most recently from 3570d9a to cee22dc Compare August 8, 2017 12:24

dmsnell force-pushed the parser/tokenize-html branch from cee22dc to e2db809 Compare August 8, 2017 12:37

WordPress deleted a comment from youknowriad Aug 8, 2017

Use $this->text() instead of test() and update fixtures

c14e05b

dmsnell closed this Sep 21, 2017

dmsnell deleted the parser/tokenize-html branch September 21, 2017 05:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parser: Start tokenizing HTML within the blocks #2286

Parser: Start tokenizing HTML within the blocks #2286

dmsnell commented Aug 8, 2017

youknowriad commented Aug 8, 2017

dmsnell commented Aug 8, 2017

nylen commented Aug 8, 2017 •

edited

Loading

dmsnell commented Aug 9, 2017

nylen commented Aug 10, 2017

dmsnell commented Aug 10, 2017

nylen commented Aug 10, 2017

dmsnell commented Sep 21, 2017

Parser: Start tokenizing HTML within the blocks #2286

Parser: Start tokenizing HTML within the blocks #2286

Conversation

dmsnell commented Aug 8, 2017

youknowriad commented Aug 8, 2017

dmsnell commented Aug 8, 2017

nylen commented Aug 8, 2017 • edited Loading

dmsnell commented Aug 9, 2017

nylen commented Aug 10, 2017

dmsnell commented Aug 10, 2017

nylen commented Aug 10, 2017

dmsnell commented Sep 21, 2017

nylen commented Aug 8, 2017 •

edited

Loading