RFC: Explicit Specification for elm-markdown's HTML Parsing Markdown Extension #93

dillonkearns · 2021-08-20T16:57:05Z

dillonkearns
Aug 20, 2021
Maintainer

Background

The goal of elm-markdown is to provide extensibility without adding new markdown syntax, but by leveraging the HTML syntax within markdown with an API for customizing how HTML is rendered.

This is in contrast to other approaches like MDX, which add new syntax like import, export, and JSX interpolation. elm-markdown instead aims to have a syntax that is not specific to any language or framework, so the same markdown content could be rendered by differing implementations since it is not coupled to any language-specific syntax. For example, you could use Web Components to render the custom HTML tags. Or you could implement an API similar to elm-markdown for hooking in custom HTML handlers to render custom HTML tags. The elm-markdown approach is sort of flipped upside down compared to MDX - whereas MDX directly imports React components, in elm-markdown you write HTML tags, and it is a rendering concern how to interpret those HTML tags. It's similar to dependency injection - you're injecting the HTML handlers from outside of the markdown file (in the renderer code), rather than reaching out to import Components from within the markdown.

So the goal is to be as similar to the markdown specification as possible. However, there are some important places where we diverge from the markdown specification for the specific purpose of supporting these custom HTML rendering use cases. For cases that aren't related to HTML and don't interfere with HTML parsing in any way, complete markdown spec compliance is the goal.

In this thread I hope to make all of the places where we diverge for the specific purpose of the custom HTML renderers explicit.

Areas of Divergence

Closing HTML tags?
Unclosed HTML tags
Can the HTML parsing ever fail if there are unclosed tags? If not, what should the fallback behavior be? Extensible or always the same?
- Could treat as literal strings
- Could ignore
- What cases should be resilient against invalid HTML markup (similar to browsers, filling in implicit closing tags, etc.), and where should it be strict?
GFM extension to whitelist HTML tags - it's implemented differently, but should it have a similar feature?

HTML Parsing Errors

Ideally, the elm-markdown library should have parsing that is guaranteed to succeed. The markdown spec is designed to always parse to something. For example, even unclosed link tags like

[This should have a closing paren](http://example.com

will be parsed as valid markdown: https://babelmark.github.io/?text=%5BThis+should+have+a+closing+paren%5D(http%3A%2F%2Fexample.com - just not a link tag, but as a literal string.

I'm pretty certain that we should share this goal even with the markdown parsing. Some reasons that resilient markdown parsing is valuable:

Markdown input can sometimes come from users, or other systems, so it's not guaranteed to be valid but we do want to guarantee that we can render something
If we're typing in a system with hot reloading, it can be a nice developer experience to see incomplete sections as raw text

On the other hand,

In the Elm community, we highly value robust error feedback and quality error messages
It's easy to make a mistake with HTML tags and we want to have confidence that everything is okay when we build sites with tools like elm-pages that can give us build errors if anything has an error

This example in the markdown spec shows how an invalid HTML tag is interpreted as just a plain string.

A few possibilities:

Have different parsing options to give errors or use fallbacks (Markdown.Parser.parse : String -> List Block vs. parseWithErrors : String -> Result Errors (List Block))
Parse into a special warnings type, and optionally choose to ignore any warnings, or turn them into errors?

Nested markdown parsing within HTML tags

Background

The markdown spec has a somewhat hacky and obscure way to wrap an HTML tag around rendered markdown:

<div>

# Heading

Renders into

<div>
  <h1>
    Heading
  </h1>
</div>

If you add an explicit closing </div> tag, it then renders as an unparsed markdown string # Heading inside of a div.

This is very counterintuitive, and the version that renders parsed markdown looks broken, whereas the one that renders unparsed markdown looks correct. So this is a place in the spec that we explicitly choose to diverge from the markdown spec (CommonMark and GFM).

Proposed Specification

To be explicit about HTML parsing, it should feel more like the semantics of HTML tags that we're familiar with. If you have an opening HTML tag, the parser should search until it finds a closing tag.

However, even with that simple rule, there is a challenging corner case. Inline vs. block HTML tags. In markdown, there is a concept of blocks and inlines.

Block parsing always takes higher precedence than inline parsing.

For example, this will parse as an inline emphasis:

*Is this all
emphasized?*

But we can interrupt it by adding a list marker, because the list marker is a block-level marker and therefore takes precedence:

*Is this all
- emphasized?*

However, this works differently for HTML tags. In this example, the HTML tag is still parsed even though the next line starts a block element (markdown list).

Inline tag: <div>
- *asdf*

I think it would be most predictable if a closing HTML tag takes the highest precedence above all other parsing. So if there is a matching closing tag, no matter where it is in the document, it should be matched up as the closing tag.

Implicit Closing

If there is no closing tag found anywhere, should it be implicitly closed?
If so, by what, any block level element or newline?
Should block HTML tags be handled differently with implicit closing than inline HTML tags?

Block Vs. Inline HTML

Most people won't be thinking about block vs. inline when they write markdown. But intuitively, you wouldn't expect a heading, which is a block element, to work within an inline HTML tag:

Here is my <agenda>- [ ] Big Goal for Today</agenda>

This, on the other hand, is a block HTML tag, so intuitively you should be able to use block elements like lists:

Here is my
<agenda>
- [ ] Big Goal for Today
</agenda>

So I think elm-markdown should run the markdown inline parser within Inline HTML tags, and the block parser within Block HTML tags.

However, for the rendered markdown, should the rendered markdown inlines be wrapped in a paragraph block? Or should they just be rendered inlines directly?

Either approach can lead to some unintuitive cases. For example, if you display inlines horizontally, but you display block items vertically, you could end up changing the rendering from horizontal to vertical if you change something from a block HTML tag to an inline tag, or vice versa.

Renders as block (renderer could render markdown children horizontally):

<bio name="Dillon Kearns" photo="https://avatars2.githubusercontent.com/u/1384166" twitter="dillontkearns" github="dillonkearns">**Hello** world</bio>

Renders as inline (renderer could render markdown children vertically):

Inline: <bio name="Dillon Kearns" photo="https://avatars2.githubusercontent.com/u/1384166" twitter="dillontkearns" github="dillonkearns">**Hello** world</bio>

A few directions it could go:

A) Let the HTML handlers know whether it's inline or block

Seems a bit strange because they could render totally different things contextually. Also, you could imagine surprising rendering that feels like a "bug" (that is, just unexpected) because they don't realize it's rendering differently because they never tested it as inline. Or maybe they never intend to use it as inline. Just seems a little off.

B) Automatically wrap it in a Paragraph

This is likely to render somewhat reasonably, because it will just use their rendering logic to render paragraphs as expected. And it doesn't have the issues of letting the user render different things based on context like (A) does. However, there is less control, and if the user wants to skip the paragraph rendering, that could be problematic.

C) Maybe just a special inline renderer?

In addition to paragraph rendering, there could be an inlines renderer??? 🤔 Not sure if this makes any sense or not. And it's still all or nothing - doesn't give the user a chance to have inline HTML tags render like a paragraph for some HTML element types and like regular inlines in other cases.

D) Have a special inline renderer, like (C), and let the user specify whether to use the paragraph or regular inline wrapper for that type of HTML element

E) Have a special inline renderer, and let the user specify an inline wrapper function (List view -> view) for their HTML element

Indentation

Often the contents of HTML tags are indented by a level. How should we deal with these cases?

Markdown content is not indented whatsoever (seems like it should be okay and just parsed as normal - this is the current behavior).
Markdown content in HTML tag is indented - should we interpret a certain number of indentation characters as not being part of the markdown?

For example, with (2), should we treat this as an indented code block, or plain text?

<MyHtmlTag>
    Is this a code block, or is it just a regular paragraph that is indented because it's in an HTML tag?
</MyHtmlTag>

Next Steps

Parse Inline HTML tags closed on new line without errors

Inline HTML tags give a parsing error if they aren't closed on the same line right now.

This is inline <bio name="Dillon Kearns" photo="https://avatars2.githubusercontent.com/u/1384166" twitter="dillontkearns" github="dillonkearns">

</bio>

Parses with error

Problem at row 4
Expecting at least 1 tag name character

Find other error cases and provide an API for parsing without errors

#83 adds a fuzz tester to catch any cases where the parser can fail. We should get a more robust fuzzer to catch more cases, and get all of those cases passing with whichever strategy we choose for handling errors.

dillonkearns · 2021-08-22T16:52:12Z

dillonkearns
Aug 22, 2021
Maintainer Author

I created a separate RFC to discuss Block HTML vs. Inline HTML: #102.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Explicit Specification for elm-markdown's HTML Parsing Markdown Extension #93

{{title}}

Replies: 1 comment

{{title}}

Select a reply

RFC: Explicit Specification for elm-markdown's HTML Parsing Markdown Extension #93

dillonkearns Aug 20, 2021 Maintainer

Background

Areas of Divergence

HTML Parsing Errors

Nested markdown parsing within HTML tags

Background

Proposed Specification

Implicit Closing

Block Vs. Inline HTML

Indentation

Next Steps

Parse Inline HTML tags closed on new line without errors

Find other error cases and provide an API for parsing without errors

Replies: 1 comment

dillonkearns Aug 22, 2021 Maintainer Author

dillonkearns
Aug 20, 2021
Maintainer

dillonkearns
Aug 22, 2021
Maintainer Author