RFC: Explicit Specification for elm-markdown's HTML Parsing Markdown Extension #93
dillonkearns
started this conversation in
Ideas
Replies: 1 comment
-
I created a separate RFC to discuss Block HTML vs. Inline HTML: #102. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Background
The goal of
elm-markdown
is to provide extensibility without adding new markdown syntax, but by leveraging the HTML syntax within markdown with an API for customizing how HTML is rendered.This is in contrast to other approaches like MDX, which add new syntax like
import
,export
, and JSX interpolation.elm-markdown
instead aims to have a syntax that is not specific to any language or framework, so the same markdown content could be rendered by differing implementations since it is not coupled to any language-specific syntax. For example, you could use Web Components to render the custom HTML tags. Or you could implement an API similar toelm-markdown
for hooking in custom HTML handlers to render custom HTML tags. Theelm-markdown
approach is sort of flipped upside down compared to MDX - whereas MDX directlyimport
s React components, inelm-markdown
you write HTML tags, and it is a rendering concern how to interpret those HTML tags. It's similar to dependency injection - you're injecting the HTML handlers from outside of the markdown file (in the renderer code), rather than reaching out toimport
Components from within the markdown.So the goal is to be as similar to the markdown specification as possible. However, there are some important places where we diverge from the markdown specification for the specific purpose of supporting these custom HTML rendering use cases. For cases that aren't related to HTML and don't interfere with HTML parsing in any way, complete markdown spec compliance is the goal.
In this thread I hope to make all of the places where we diverge for the specific purpose of the custom HTML renderers explicit.
Areas of Divergence
HTML Parsing Errors
Ideally, the
elm-markdown
library should have parsing that is guaranteed to succeed. The markdown spec is designed to always parse to something. For example, even unclosed link tags likewill be parsed as valid markdown: https://babelmark.github.io/?text=%5BThis+should+have+a+closing+paren%5D(http%3A%2F%2Fexample.com - just not a link tag, but as a literal string.
I'm pretty certain that we should share this goal even with the markdown parsing. Some reasons that resilient markdown parsing is valuable:
On the other hand,
elm-pages
that can give us build errors if anything has an errorThis example in the markdown spec shows how an invalid HTML tag is interpreted as just a plain string.
A few possibilities:
Markdown.Parser.parse : String -> List Block
vs.parseWithErrors : String -> Result Errors (List Block)
)Nested markdown parsing within HTML tags
Background
The markdown spec has a somewhat hacky and obscure way to wrap an HTML tag around rendered markdown:
Renders into
If you add an explicit closing
</div>
tag, it then renders as an unparsed markdown string# Heading
inside of a div.This is very counterintuitive, and the version that renders parsed markdown looks broken, whereas the one that renders unparsed markdown looks correct. So this is a place in the spec that we explicitly choose to diverge from the markdown spec (CommonMark and GFM).
Proposed Specification
To be explicit about HTML parsing, it should feel more like the semantics of HTML tags that we're familiar with. If you have an opening HTML tag, the parser should search until it finds a closing tag.
However, even with that simple rule, there is a challenging corner case. Inline vs. block HTML tags. In markdown, there is a concept of blocks and inlines.
Block parsing always takes higher precedence than inline parsing.
For example, this will parse as an inline emphasis:
But we can interrupt it by adding a list marker, because the list marker is a block-level marker and therefore takes precedence:
However, this works differently for HTML tags. In this example, the HTML tag is still parsed even though the next line starts a block element (markdown list).
I think it would be most predictable if a closing HTML tag takes the highest precedence above all other parsing. So if there is a matching closing tag, no matter where it is in the document, it should be matched up as the closing tag.
Implicit Closing
Block Vs. Inline HTML
Most people won't be thinking about block vs. inline when they write markdown. But intuitively, you wouldn't expect a heading, which is a block element, to work within an inline HTML tag:
This, on the other hand, is a block HTML tag, so intuitively you should be able to use block elements like lists:
So I think
elm-markdown
should run the markdown inline parser within Inline HTML tags, and the block parser within Block HTML tags.However, for the rendered markdown, should the rendered markdown inlines be wrapped in a paragraph block? Or should they just be rendered inlines directly?
Either approach can lead to some unintuitive cases. For example, if you display inlines horizontally, but you display block items vertically, you could end up changing the rendering from horizontal to vertical if you change something from a block HTML tag to an inline tag, or vice versa.
A few directions it could go:
A) Let the HTML handlers know whether it's inline or block
Seems a bit strange because they could render totally different things contextually. Also, you could imagine surprising rendering that feels like a "bug" (that is, just unexpected) because they don't realize it's rendering differently because they never tested it as inline. Or maybe they never intend to use it as inline. Just seems a little off.
B) Automatically wrap it in a Paragraph
This is likely to render somewhat reasonably, because it will just use their rendering logic to render paragraphs as expected. And it doesn't have the issues of letting the user render different things based on context like (A) does. However, there is less control, and if the user wants to skip the paragraph rendering, that could be problematic.
C) Maybe just a special inline renderer?
In addition to paragraph rendering, there could be an inlines renderer??? 🤔 Not sure if this makes any sense or not. And it's still all or nothing - doesn't give the user a chance to have inline HTML tags render like a paragraph for some HTML element types and like regular inlines in other cases.
D) Have a special inline renderer, like (C), and let the user specify whether to use the paragraph or regular inline wrapper for that type of HTML element
E) Have a special inline renderer, and let the user specify an inline wrapper function (List view -> view) for their HTML element
Indentation
Often the contents of HTML tags are indented by a level. How should we deal with these cases?
For example, with (2), should we treat this as an indented code block, or plain text?
Next Steps
Parse Inline HTML tags closed on new line without errors
Inline HTML tags give a parsing error if they aren't closed on the same line right now.
Parses with error
Find other error cases and provide an API for parsing without errors
#83 adds a fuzz tester to catch any cases where the parser can fail. We should get a more robust fuzzer to catch more cases, and get all of those cases passing with whichever strategy we choose for handling errors.
Beta Was this translation helpful? Give feedback.
All reactions