Block parsing and serialization/deserialization #391

nylen · 2017-04-10T17:07:09Z

I think we need an improved design for block parsing and serialization/deserialization. This can be thought of as a follow-up / companion issue to #390 which discusses data storage.

I think this is an important set of technical decisions to get right before people start building plugins on top of our new structure. Here is the way I would like to see this work:

Blocks declare the kinds of semantically meaningful markup they can understand (including where they expect key data to be stored) when they are registered. For example, for a gallery block, this would look like a set of figure and figcaption elements that point to WP images, probably using data- attributes to describe things like image IDs.
There is a centralized or at least "recommended" parsing process that deserializes block data from this schema and reports any errors, downgrading to the "generic html" block if key information is missing. This process should be very robust and very easy for plugins to hook into.
Splitting the post content into blocks and parsing out the data for each block should happen at the same time, in a single parsing step. This will become very important later as we add the ability to transparently upgrade shortcodes and chunks of "recognized" HTML markup to blocks.
The same structure described above (specified during the block registration) is used to serialize block information when changes are made.
If nothing about a block is changed (the user hasn't modified a block in the UI), its markup shouldn't be changed either, even if it's currently stored as a "legacy" structure like a shortcode.
Lots of test cases for all of the above.

There is a lot to flesh out here, and a lot to be done, but I think that in the long run we will be much better served by providing standard and robust ways for blocks to parse their content now.

The text was updated successfully, but these errors were encountered:

youknowriad · 2017-04-11T14:43:48Z

It's a good idea to compare these "requirements" with the current parsing/serialization approach.

Blocks declare the kinds of semantically meaningful markup they can understand

We have the attributes property used to extract data from the markup. The parsed block is composed from these extracted attributes and the ones stored in the block comment.

There is a centralized or at least "recommended" parsing process that deserializes block data from this schema and reports any errors, downgrading to the "generic html" block if key information is missing.

For now, there's not fallback to "freeform" block if the extracted attributes are empty (which probably means the markup is invalid). This could be done by adding another method to the block API (validate for example) and we'll fallback to the freeform block if it returns false

Splitting the post content into blocks and parsing out the data for each block should happen at the same time, in a single parsing step.

Wether the internal parsing is done in a single or two steps is not an issue for me. What we really need is a parse( string ) function that reports any error in the same way no matter the internal step we're in. For me, the number internal steps to achieve this is implementation detail.

Now we have this parse( string ) function, we should probably enhance the error handling. It's composed of two steps, the first using the generic grammar and the second using the attributes property of each block.

The same structure described above (specified during the block registration) is used to serialize block information when changes are made.

For now, the serialization happens in a separate method called save

If nothing about a block is changed (the user hasn't modified a block in the UI), its markup shouldn't be changed either, even if it's currently stored as a "legacy" structure like a shortcode.

Why is this an issue, this could be a good way to update the "legacy" posts iteratively but anyway, doesn't seem too hard to do.

I see where you're trying to go with this. Instead of having our current save, attributes and possibly validate properties while defining a block, you're thinking of a unique schema property to rule them all. It seems appealing but very hard to achieve. (Think of use cases where depending on its state, a block can have two different markups etc...).

Anyway, I'd be curious if you have any technical idea on how to do this.

nylen · 2017-04-11T15:13:46Z

Whether the internal parsing is done in a single or two steps is not an issue for me. What we really need is a parse( string ) function that reports any error in the same way no matter the internal step we're in. For me, the number internal steps to achieve this is implementation detail.

I've been looking into some options over the past few days... I am starting to agree with you that we may have multiple steps internally (one to split the post content into blocks, and another to parse/validate the blocks' content). However I still think it's pretty important to present a one-step API to parse a list of blocks from a post_content string.

I see where you're trying to go with this. Instead of having our current save, attributes and possibly validate properties while defining a block, you're thinking of a unique schema property to rule them all. It seems appealing but very hard to achieve.

Yes, basically. The largest issue I see with our current structure is the attributes block. I think it's far too flexible in some ways, and far too strict in other ways.

For example, an image block can contain basically whatever kind of markup you want, as long as it still has an <img src=...> tag somewhere in it, it will be recognized as a valid block. Then, when it is saved, the previous structure will be overwritten. Therefore simply querying for the src attribute is too flexible.

Also, the text block currently requires a wrapping p tag. If this isn't present, content will not be successfully parsed, again leading to the previous content being overwritten on save. Therefore simply querying for a p tag is too strict.

We could address these kinds of issues separately, but I think a much better solution is adding some kind of schema for acceptable markup.

I'd be curious if you have any technical idea on how to do this.

This is still pretty fuzzy to me - it needs to be something with a pretty simple interface, that plugin developers can understand and use quickly. I think pegjs is too complicated for this. Anyway, next I'll try to translate this into code and see how it goes.

aduth · 2017-04-18T20:11:35Z

Migrating discussion from #426 (comment) to here, the JSX proposal was intriguing to me, particularly in how it might eliminate the need for save altogether. As I expressed there, there's a delicate balance of strictness, flexibility, and approachability. I've spent the afternoon exploring some prior art, with a particular interest toward RELAX NG in some of its similarities with @nylen's ideas in #426. It's obviously more tailored to validation than it is to data mapping, but I toyed with some rough ideas for how it could be applied for existing blocks:

https://gist.github.com/aduth/6f039e7da2a312a1689c4ace2a2d062c

With heavy reference toward patterns prescribed by RELAX NG: http://books.xmlschemata.org/relaxng/relax-CHP-3-SECT-2.html

nylen · 2017-04-19T10:18:01Z

I quite like this proposal, overall it's a much cleaner iteration on my initial thoughts.

The main drawback I see is that we no longer have the <p> tags that directly map to the HTML. Perhaps this could be used as simple shorthand for <Element name="p">? (Edit: though I quite like the proposed <Attribute> syntax, and including shorthand would either complicate this or make it unavailable when using the shorthand.)

<Element>
	<Choice as="tagName" required>
		<Name>h1</Name>
		<Name>h2</Name>
		<Name>h3</Name>
		<Name>h4</Name>
	</Choice>
	<Attribute name="style">
		<Match
			pattern="\btext-align:\s*([^;]+)"
			index="1"
			as="align"
			default="left" />
	</Attribute>
	<Text as="content" />
</Element>

This Choice structure seems tricky to parse: I would expect that an Element requires a name, but in this case we'd have to go looking for it, and find it provided by the descendant Choice element. What do you think about <Element name="h1|h2|h3|h4"> here instead? What other places do you anticipate us needing to provide a Choice?

One other thing I didn't see addressed was how to grab an entire tag and its descendant tree as HTML/object content. What about <Element name="p" as="content">?

aduth · 2017-04-19T14:19:16Z

The book (which is a fantastic resource) has another example of the choice pattern:

http://books.xmlschemata.org/relaxng/relax-CHP-6-SECT-3.html

Applied to our case, this might be the way to handle an optional figure wrapper of the image block?

<Choice>
	<Element name="img">
		<Attribute name="src" as="url" required />
		<Attribute name="alt" as="alt" />
	</Element>
	<Element name="figure">
		<Element name="img">
			<Attribute name="src" as="url" required />
			<Attribute name="alt" as="alt" />
		</Element>
		<Optional>
			<Element name="figcaption">
				<Text as="caption" />
			</Element>
		</Optional>
	</Element>
</Choice>

(Optionally considering a parallel to define for modularizing repeated patterns; here the img description)

aduth · 2017-04-19T14:25:58Z

Related project: https://github.com/mangalam-research/salve

jasmussen · 2017-05-16T05:35:33Z

Which milestone would you assign this and the related ticket to? Beta?

nylen · 2017-05-16T14:39:05Z

That works for now, though this ticket will touch on a lot of things and we might have to move it later.

aduth · 2017-06-01T18:34:53Z

Brain-dump, specifically on the topic of extracting values from markup:

At this time there's a bit of a fork in the road with #892 and #914.

I find myself looking back at #892 and trying to feel my way through source matching once more. Issues with our current iteration are enumerated in #865. The ideas explored in #892 as an alternative may not really put us in much better a position except in clarifying intent and separating us further from the DOM. Verbose selectors like getEditableContent feel more specific than they ought to be, but do at least surface how we expect those values to be used.

ESLint recently released an AST selectors feature, which I find to be interesting, specifically in how they apply CSS-like selectors to an AST. In their case it's the parsed JavaScript AST, but I think we'd be equally in a position to apply it to one of parsed blocks, like the structure explored in @dmsnell's wp-post-grammar. I think there's an appeal in how it could allow very concise selection while remaining unbound to a DOM, e.g. url: 'figure > img[src]'.

There's still the distinction between source matching and save rendering though. What really appeals to me about #914 is not so much the validation (though it's great), it's that: if we're writing JSX for save anyways, why can't we use nearly the same thing to map back from the HTML it's generating to the block's object attribute form? We should be able to describe the structure of a block to serve both purposes. We're still early on in what this could look like, and I do think expressing schemas in forms like that in my previous comment are more burdensome than they really help. To be compelling, I think it needs more of the expressivity of JSX, which appears what @nylen's aiming for with phs shorthand elements.

It might come down to deciding just how much flexibility we want to allow in save logic. Forked logic paths which necessitate RELAX patterns like Choice could end up being sufficiently harmful to the ease-of-use of a schema-like approach. Personally, I'd rather stick with separate sourcing and saving logic if we'd need to support this level of flexibility. If we can limit to some predictability in how block content is to be structured, a phs-like approach could be appealing.

Some use-cases like the one described in #914 (comment) could surface some unexpected consequences of adopting an expressive structure; unexpected in the sense that it appears foreign to see descendants of an img, even when acknowledging that this is neither HTML nor a true React element. Distinguishing HTML from React elements has proven challenging enough already, and attempting to introduce yet another variation could exasperate this further.

nylen · 2017-06-01T23:06:48Z

url: 'figure > img[src]'

Something like this would go part of the way towards solving my concern that we are not doing enough to validate the markup that appears inside of blocks. (To me this is still the most appealing aspect of #914).

It looks like this wouldn't handle extra tags entirely though (p tags before/after the figure for example).

It might come down to deciding just how much flexibility we want to allow in save logic. Forked logic paths which necessitate RELAX patterns like Choice could end up being sufficiently harmful to the ease-of-use of a schema-like approach.

There are other ways around this. For example: allow specifying multiple schemas, one of which is the "canonical" version that is always used to save a block after it is modified.

Avoiding something like Choice would both simplify the schema validation code significantly and limit what it is capable of doing (perhaps not so useful as a general-purpose tool outside of this project).

nylen · 2017-08-01T17:40:13Z

Closing; this discussion has served its purpose and further tasks should be split out into separate, smaller issues.

nylen added [Feature] Block API API that allows to express the block paradigm. [Feature] Blocks Overall functionality of blocks labels Apr 10, 2017

This was referenced Apr 11, 2017

More forgiving tag names inikulin/parse5#33

Closed

Propose block APIs for backwards compatibility #413

Closed

Editable: Enter splits the current paragraph into two blocks #409

Merged

Use TinyMCE utilities to parse post content #426

Merged

aduth mentioned this issue Apr 19, 2017

Adds two core blocks beyond generic blocks Automattic/wp-post-grammar#18

Closed

nylen mentioned this issue May 1, 2017

Define API method for handling legacy content data #590

Closed

nylen added this to the Beta milestone May 16, 2017

ellatrix assigned nylen May 22, 2017

This was referenced May 22, 2017

Discussion: Block attributes initialization #865

Closed

Try: Alternative attributes initialization #892

Closed

nylen mentioned this issue May 26, 2017

Add HTML schema for separator block #914

Closed

5 tasks

jasmussen modified the milestones: May Week 4: Beta, May 29 to Jun 2 May 30, 2017

aduth mentioned this issue Jun 1, 2017

Plugin: PHP wrapper to add new blocks #886

Closed

This was referenced Jun 1, 2017

Surface alignment toolbar as block control #987

Merged

Change attribute serialization to explicit opt-in #988

Closed

jasmussen modified the milestones: May 29 to Jun 2, Beta Jun 5, 2017

aduth mentioned this issue Jun 12, 2017

Gallery: Add simple columns slider #1135

Merged

jasmussen modified the milestones: Beta, Beta 2 Jun 21, 2017

mtias removed this from the Beta 2 milestone Jun 22, 2017

This was referenced Jul 14, 2017

Enhance attributes to specify types and defaults #1905

Merged

Validate and protect corrupted block content #1929

Merged

nylen closed this as completed Aug 1, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block parsing and serialization/deserialization #391

Block parsing and serialization/deserialization #391

nylen commented Apr 10, 2017

youknowriad commented Apr 11, 2017 •

edited

Loading

nylen commented Apr 11, 2017

aduth commented Apr 18, 2017 •

edited

Loading

nylen commented Apr 19, 2017 •

edited

Loading

aduth commented Apr 19, 2017

aduth commented Apr 19, 2017

jasmussen commented May 16, 2017

nylen commented May 16, 2017

aduth commented Jun 1, 2017

nylen commented Jun 1, 2017

nylen commented Aug 1, 2017

Block parsing and serialization/deserialization #391

Block parsing and serialization/deserialization #391

Comments

nylen commented Apr 10, 2017

youknowriad commented Apr 11, 2017 • edited Loading

nylen commented Apr 11, 2017

aduth commented Apr 18, 2017 • edited Loading

nylen commented Apr 19, 2017 • edited Loading

aduth commented Apr 19, 2017

aduth commented Apr 19, 2017

jasmussen commented May 16, 2017

nylen commented May 16, 2017

aduth commented Jun 1, 2017

nylen commented Jun 1, 2017

nylen commented Aug 1, 2017

youknowriad commented Apr 11, 2017 •

edited

Loading

aduth commented Apr 18, 2017 •

edited

Loading

nylen commented Apr 19, 2017 •

edited

Loading