Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser: Start tokenizing HTML within the blocks #2286

Closed
wants to merge 2 commits into from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Aug 8, 2017

related: @see #2210

This change adds a behavior which was present in the original parser but
removed for some reason when it was pulled directly into Gutenberg. That
behavior is producing a tree of tokenized HTML inside the blocks so that
we can work with the simpler data structure than having to repeatedly
parse and validate HTML strings from within the components and
throughout the different paths in the editor.

This change adds a new field to the parse output: children, which
contains the tree as a list of objects, each child being able to reflect
the structure of its parent with the same tree data structure. HTML tags
appear as type: HTML_Tag with name equal to the tag name. Otherwise
the HTML tags mimick the structure of blocks at the outermost level.

Future work will replicate this pattern to add in nested blocks (also in
the original parser but removed when it was brought into Gutenberg).
Nested blocks would appear in the children array and contain full
block structure, nested intuitively and without having an exceptional
structure or indication.

It's my hope that components, queries, and validation can take place in the
easier/faster tree data structure instead of via the parsed HTML strings.
This could mean that attribute access is achieved with expressions as
basic as block.children[ 0 ].attrs.href instead of writing the equivalent
query. Further, I hope that one day we can add children not by rendering
HTML strings, but by working with the data structure itself and serializing
back to the HTML string only when we need to. This would be a very good
step if we wanted to open up the ability to store on non-stringy backends
(store elsewhere than post_content via a plugin: Simperium, JSON-in-post_meta,
a database table directly, some custom external service…). Should also pave
the way for other editors conforming to the Gutenberg spec which may not
have as easy of a time parsing HTML as the browser does.

cc: @jleandroperez

Testing

Since so many of our tests are testing for identical equality of output against hand-written fixtures (instead of testing for the relevant properties of some behavior under test) then this PR is breaking a lot of tests. At the moment I don't really have the will or the time to fix them 🙃. Can review next week.

You can open the editor and work as usual. Since we're just adding a new children property to the parser output there should be no breakage once the data structure is in Gutenberg.

@dmsnell dmsnell requested review from nylen, mtias and aduth August 8, 2017 10:57
@dmsnell dmsnell force-pushed the parser/tokenize-html branch 2 times, most recently from b92c159 to 942cfa3 Compare August 8, 2017 10:58
@youknowriad
Copy link
Contributor

Having some trouble testing this branch because of the PHP parser. Do we need to regenerate it or something?

@dmsnell dmsnell force-pushed the parser/tokenize-html branch 2 times, most recently from 3570d9a to cee22dc Compare August 8, 2017 12:24
This change adds a behavior which was present in the original parser but
removed for some reason when it was pulled directly into Gutenberg. That
behavior is producing a tree of tokenized HTML inside the blocks so that
we can work with the simpler data structure than having to repeatedly
parse and validate HTML strings from within the components and
throughout the different paths in the editor.

This change adds a new field to the parse output: `children`, which
contains the tree as a list of objects, each child being able to reflect
the structure of its parent with the same tree data structure. HTML tags
appear as `type: HTML_Tag` with `name` equal to the tag name. Otherwise
the HTML tags mimick the structure of blocks at the outermost level.

Future work will replicate this pattern to add in nested blocks (also in
the original parser but removed when it was brought into Gutenberg).
Nested blocks would appear in the `children` array and contain full
block structure, nested intuitively and without having an exceptional
structure or indication.
@dmsnell dmsnell force-pushed the parser/tokenize-html branch from cee22dc to e2db809 Compare August 8, 2017 12:37
@WordPress WordPress deleted a comment from youknowriad Aug 8, 2017
@WordPress WordPress deleted a comment from youknowriad Aug 8, 2017
@dmsnell
Copy link
Member Author

dmsnell commented Aug 8, 2017

Having some trouble testing this branch because of the PHP parser. Do we need to regenerate it or something?

It seems like the PHP parser isn't working with text() - @nylen is there any insight you can provide here? Is php-pegjs just lagging behind the pegjs project or am I missing something big here?

Fatal error: Call to undefined function text() in /var/www/html/wp-content/plugins/gutenberg/lib/parser.php on line 304

@nylen
Copy link
Member

nylen commented Aug 8, 2017

phpegjs does not support text() yet has a $this->text() function but it is probably broken.

You can regenerate all the test fixtures by npm run fixtures:regenerate. Part of the task of this PR should be to inspect the changes and verify that they are as expected.

@dmsnell
Copy link
Member Author

dmsnell commented Aug 9, 2017

php-pegjs does not support text() yet. Also it should probably be implemented as $this->text().

Thanks for letting me know! In case it's not clear, the entire purpose is to return the parsed data structure as well as the source text which led to the parse.

npm run fixtures:regenerate

Thanks again!

@nylen
Copy link
Member

nylen commented Aug 10, 2017

This is something I've wanted to see for a while... the need is a bit less immediate due to #1929 and follow-up changes, but I expect this can pave the way forward for parsing nested blocks including shortcodes in a more well-defined manner.

I worry that with 10+ years of history in post_content it will get complicated quickly (it already kind of is...) and that it may not perform as well as the existing parser. Have you run any performance testing? (RUN_SLOW_TESTS=y SHOW_PERFORMANCE_INFO=y phpunit --filter large_post for PHP)

We'll need a new release of phpegjs to make text() work properly - see nylen/phpegjs#2.

@dmsnell
Copy link
Member Author

dmsnell commented Aug 10, 2017

Have you run any performance testing?

no and I tried naively to run the command you provided but it failed on account of something in /tmp/wordpress-tests-lib/ being unavailable. is that from the official WordPress unit test suite?

@nylen
Copy link
Member

nylen commented Aug 10, 2017

@dmsnell
Copy link
Member Author

dmsnell commented Sep 21, 2017

Closing in favor of #2743 which is simpler, more reliable, and implements the ideas I discussed with @mtias and @aduth for the nesting API

@dmsnell dmsnell closed this Sep 21, 2017
@dmsnell dmsnell deleted the parser/tokenize-html branch September 21, 2017 05:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants