RFC: Spec for HTML Inlines vs. HTML Blocks #102

dillonkearns · 2021-08-22T16:51:37Z

dillonkearns
Aug 22, 2021
Maintainer

Background

There are some things to clarify about the specification for HTML renderers, and some related details to clarify around how to parse HTML tags.

The goal is to make HTML tags in elm-markdown predictable, simple, and explicit. And to give you the tools to accomplish what you need to.

Related issues: #50 and #70.

In #100, I have an implementation of changing out HTML Inlines to run the inline parser only (instead of the block parser). This also simplifies the Inline and Block type definitions because it means that the Inline type can no longer refer back up to a Block.

Wrapping Paragraphs

Should there be a way to render without the paragraph wrapper? In some cases, you may want to directly get the list of rendered inlines and choose how to wrap them yourself (rather than having them implicitly wrapped in a Paragraph). In other cases, having the list of rendered children represent two different things (rendered blocks vs. rendered inlines) could lead to strange visual bugs like rendering horizontal elements vertically or vice versa.

For example,

x<sup>3</sup>

If that were wrapped implicitly in a Paragraph, it would render the sup as a block and there wouldn't be a way to opt out of that display. If you're given a List representing the rendered inlines, then you could render them with an inline display and show them correctly.

Should there be a way to get the literal text within a tag?

In the case of the <sup>, maybe you don't want to deal with rendered markdown children at all and instead just want to get the text inside of the HTML tag and use that directly. Is this a good idea, or does this make things more confusing and hard to predict how things will render? For example, should someone writing x<sup>**3**</sup> reasonably expect to be using markdown within the HTML tag? Is it worth the extra mental overhead of having two different modes here? There are some other uses cases worth considering, like rendering with different formats such as LaTeX. In cases like that, you wouldn't want markdown parsing to interfere with the raw format, so having access to the unparsed text would open up use cases like that.

The types would need to change to contain the raw data. I'm confident that parsing should happen independent of HTML renderers, so I think you should be able to take a parsed AST and then pass it to any different renderer, so therefore the AST would need to include both the parsed markdown children as well as the raw String in order to handle both.

It would also be possible to defer parsing the inner body, but I prefer to have a fully parsed structure so the data structure can be traversed without doing multiple calls to the parser, and also for performance reasons for tools like elm-pages that want to fully parse the AST at build time and then serialize that data to avoid running the parser in the browser.

The type would need to change to something like this:

type Html children
-   = HtmlElement String (List HtmlAttribute) (List children)
+   = HtmlElement String (List HtmlAttribute) (List children) (Maybe String)
    | HtmlComment String
    | ProcessingInstruction String
    | HtmlDeclaration String String
    | Cdata String

The Maybe String would be the unparsed string inside the HTML tag (or Nothing if it is a self-closing tag).

Defining Inline vs. Block Renderers

Should you be able to define an HTML renderer to only handle Inline HTML or only Block HTML? For example, if you have an HTML handler for a <Youtube id="..." /> tag, it may be designed to render properly as a block, but could look strange in the middle of a Paragraph of text.

Validation Semantics

It could potentially fit into the mental model of HTML renderers as just another validation. Just like you can have a validation error if there is a required HTML attribute, or an unhandled HTML tag, you could also give an error if you try to render a <Youtube id="..." /> embed as an inline.

Disallowing block render could be confusing

If you could give an error in the case that an HTML tag is used as a Block, and only allow it to be used as an Inline HTML tag, that could be confusing. Because it can change from an Inline HTML tag to a Block HTML tag simply by moving from the middle of a line to the beginning. So this seems like it would be likely to cause issues, because allowing a tag to only be rendered as an Inline would disallow using it as the first item in a paragraph.

This could be a sign that either:

It's not a good idea to have a feature to disallow Block or Inline HTML independently, or
It should only be possible to disallow Inline HTML rendering for a type of tag, but it should not be possible to disallow Block HTML rendering

Rendering

Or if you have an HTML handler for <DictionaryDefinition word="equestrian">, and the HTML handler is designed to display an annotated word and render it like an inline (similar to bold or italic inlines), it may render unexpectedly if it displays as a block element, pushing the remaining text to the next line.

For example, you wouldn't want this Block HTML to push the rest of the paragraph to a newline because the HTML tag is at the beginning of the line (making it Block HTML not Inline HTML).

<DictionaryDefinition word="equestrian" /> is a word that dates back to the 1600s.

Since the DictionaryDefinition example is relying on an HTML attribute for input, not the rendered markdown children within the HTML tag, it could simply render as an inline rather than block styling (using a span tag or CSS, for example). However, if we were relying on the rendered children, the parsing changes to block parsing vs. inline parsing.

<FunFacts title="Piano Facts">
- A standard grand piano has 88 keys.
- An Bösendorfer Imperial has 97 keys
</FunFacts>

This would parse as an UnorderedList as expected. If we use FunFacts in an Inline HTML, should it wrap them in a Paragraph, or should the renderedChildren be the list of rendered inlines?

Learn more about the violin <FunFacts title="Violin Facts">There are *two* violins in a string quartet.</FunFacts>

Should the renderedChildren in this case be:

A list with a single rendered Paragraph tag
A list with 3 rendered inlines (plain text, italic text, plain text)

And should there be a way to render differently based on whether it is Block HTML or Inline HTML?

Is Multi-Line Inline HTML valid?

If there is a newline before the closing tag of an HTML Inline, what should happen?

A) Don't parse it as an HTML tag, parse it as plain text
B) Parse it as HTML until the closing tag?

matheus23 · 2021-08-23T10:17:22Z

matheus23
Aug 23, 2021

It would also be possible to defer parsing the inner body, but I prefer to have a fully parsed structure so the data structure can be traversed without doing multiple calls to the parser, and also for performance reasons for tools like elm-pages that want to fully parse the AST at build time and then serialize that data to avoid running the parser in the browser.

If there were an API that wouldn't "preparse" the content of html elements, but instead provide them as raw strings (assuming this is possible(?)), and then a user writes an html renderer calling the inline parser manually, so transforming that raw string back into inlines. Wouldn't it be possible for elm-pages to still distill this as a data source?
I assume the problem is that the user would now have to define their own intermediate format - something like a markdown AST, but with their custom HTML renderers defunctionalised in between I guess.

0 replies

matheus23 · 2021-08-23T10:22:07Z

matheus23
Aug 23, 2021

Is Multi-Line Inline HTML valid?

I think this is indeed a weird situation, but B would be least surprising for me personally.

0 replies

matheus23 · 2021-08-23T10:36:22Z

matheus23
Aug 23, 2021

For example, you wouldn't want this Block HTML to push the rest of the paragraph to a newline because the HTML tag is at the beginning of the line (making it Block HTML not Inline HTML).

Hmm yeah, this example illustrates something unexpected. I assume the spec says this html element should be a block and wrapped in paragraphs? (does the spec specify something like this?)

I think the least surprising thing for me would be, if the rules for when HTML are blocks is the same as the rules when a string of text would become its own paragraph vs. when it becomes part of a paragraph.

Or, to try to specify it differently: For each HTML element, (as a thought experiment) replace all its characters (including the tags) with X. If these characters form a paragraph of their own, then the HTML element should be considered a block.
If these characters get added to another paragraph, then consider the element an inline.

Examples:

EDIT: Turns out the github markdown renderer renders the ordered list here weirdly for me :) Markdown is hard, huh?

<DictionaryDefinition word="equestrian" /> is a word that dates back to the 1600s.

becomes

XXXXXXXXXXXXXXXXXXXXX XXXXXXXXXXXXXXXXX XX is a word that dates back to the 1600s.

Here, the Xes would be part of the bigger paragraph => DictionaryDefinition should be considered an inline.

# Pianos   

<FunFacts title="Piano Facts">
- A standard grand piano has 88 keys.
- An Bösendorfer Imperial has 97 keys
</FunFacts>

Fun, isn't it?

becomes

# Pianos   

XXXXXXXXX XXXXXXXXXXXX XXXXXXX
X X XXXXXXXX XXXXX XXXXX XXX XX XXXXX
X XX XXXXXXXXXXX XXXXXXXX XXX XX XXXX
XXXXXXXXXXX

Fun, isn't it?

The Xes become a paragraph of their own -> the HTML element should be a block.

Learn more about the violin <FunFacts title="Violin Facts">There are *two* violins in a string quartet.</FunFacts>

becomes

Learn more about the violin XXXXXXXXX XXXXXXXXXXXXX XXXXXXXXXXXX XXX XXXXX XXXXXXX XX X XXXXXX XXXXXXXXXXXXXXXXXXX

Thus, again, an inline.

Of course I'm not suggesting this gets implemented like this (replacing html elements non-whitespace characters with some gibberish and checking what that gibberish becomes), but I think it's a good mental model for what a user might expect, right?

1 reply

dillonkearns Aug 23, 2021
Maintainer Author

I love this idea! This does seem to capture the semantics for my intuition.

It seems like a pretty elegant solution to interpret an HTML tag at the start of a block of text as an inline HTML tag within a Paragraph rather than a block. It also seems intuitive to me to interpret the text after HTML at the start of a line as part of the same paragraph. For example, in this example, with the current parsing behavior where the HTML tag is interpreted as a Block because it is at the beginning of the line, the text after that HTML tag is interpreted as a new paragraph. Which definitely is not what I would expect as someone writing that out. And also, there's no clear way to fix that problem so it feels like I'm stuck if it's interpreted as an HTML Block here:

dillonkearns · 2021-08-23T19:59:31Z

dillonkearns
Aug 23, 2021
Maintainer Author

For reference, and to contrast with a different set of semantics, here is what MDXJS does. It feels counterintuitive to me.

In this example:

In the middle of a paragraph (as an inline), the Button renders the markdown emphasis
If it gets moved to the beginning of the paragraph, then it is rendered as raw *'s
A few leading spaces are still interpreted as a Block HTML tag with raw *'s
Adding text the Button, like <Button>*Here is a button*</Button> for you is still treated as Block HTML

What I find unintuitive about that implementation is that markdown parsing turns on or off based on the context. This seems likely to cause confusion.

To "turn markdown parsing on" for the Button when it's interpreted as a Block, you need to surround it with newlines.

<Button>

*Here is a button*
</Button>

See the MDX playground to try it out.

Turning markdown on/off based on newlines is somewhat of a separate issue. I think that if you write an HTML renderer that is getting rendered view's (not a String), then it should always be getting rendered markdown, so there should be no state where the markdown parsing within the HTML tag is turned off based on surrounding context.

Perhaps this design is attempting to be as close to possible as how you can render markdown within HTML tags in vanilla markdown? Like in this example here: https://spec.commonmark.org/0.30/#example-152

<DIV CLASS="foo">

*Markdown*

</DIV>

Which renders to

<DIV CLASS="foo">
<p><em>Markdown</em></p>
</DIV>

Whereas this example: https://spec.commonmark.org/0.30/#example-162

<a href="foo">
*bar*
</a>

Does not parse the *'s as emphasis but leaves them as raw asterisks, similar to the MDX example

<a href="foo">
*bar*
</a>

Given that HTML is more of a first-class citizen for elm-markdown, I think the spec for this should behave in a more predictable way and we should essentially ignore the way that parsed markdown within HTML tags can be done in vanilla markdown because it seems like more of a hack than a first-class feature.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Spec for HTML Inlines vs. HTML Blocks #102

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

RFC: Spec for HTML Inlines vs. HTML Blocks #102

dillonkearns Aug 22, 2021 Maintainer

Background

Wrapping Paragraphs

Should there be a way to get the literal text within a tag?

Defining Inline vs. Block Renderers

Validation Semantics

Disallowing block render could be confusing

Rendering

Is Multi-Line Inline HTML valid?

Replies: 4 comments · 1 reply

matheus23 Aug 23, 2021

matheus23 Aug 23, 2021

matheus23 Aug 23, 2021

dillonkearns Aug 23, 2021 Maintainer Author

dillonkearns Aug 23, 2021 Maintainer Author

dillonkearns
Aug 22, 2021
Maintainer

Replies: 4 comments 1 reply

matheus23
Aug 23, 2021

matheus23
Aug 23, 2021

matheus23
Aug 23, 2021

dillonkearns Aug 23, 2021
Maintainer Author

dillonkearns
Aug 23, 2021
Maintainer Author