diff --git a/grammar.js b/grammar.js index 4231f54..9be6a71 100644 --- a/grammar.js +++ b/grammar.js @@ -1,57 +1,59 @@ -// Parsing markdown is hard, here's how this parser works: -// (This is meant to implement the Github flavored markdown spec https://github.github.com/gfm/) +// This is meant to implement the CommonMark spec https://spec.commonmark.org/. +// To understand how this parser works it is useful to at least skim the specification. I will +// sometimes refer to the Github falvored markdown spec (https://github.github.com/gfm/) instead, +// which is just an extension of the CommonMark spec. Also it is important to understand +// tree-sitters "conflicts". // -// A markdown has a double tree structure. On the top level there are blocks, like -// lists, block quotes, paragraphs, ..., then some of these blocks can contain inline -// elements like emphasis, links, backslash escapes, ... +// All code for this parser can be found in this file and in src/scanner.cc. // -// Markdown can not be parsed by a traditional tree-sitter parser, but tree-sitter -// offers to use an external "scanner" (or lexer, see src/scanner.cc) to inject some -// hand-written C code into the parser. This parser tries to use this as little as -// possible, in practice this means using the external scanner to parse: +// There are 2 types of elements to parse: inline elments and block elements. Block elements +// can contain other blocks and inline elements. Inline elements can contain other inline +// elements. // -// * All container blocks (besides list because they are just multiple list items) -// This is needed because at the start of each line we need to match all open blocks -// so we need to be able to look back on the parse stack arbitrarily far. Traditional -// tree-sitter parsers are not able to do this. +// Each block element always spans a range of lines. A block can only end at the end of a line. +// Block structure can also always be determined by just the beginning of the line. To this +// first all open blocks get "matched" meaning that any tokens needed to keep a block open get +// parsed e.g. the ">" for block quotes. If matching fails that does not automatically mean +// that the block closes on this line. It could also be a lazy continuation. After matching new +// blocks can be opened. More documentation about the matching process can be found in the +// external scanner. // -// * Some leaf blocks -// The design of this parser has actually changed so that most leaf blocks _could_ -// be parsed by traditional rules, but in the initial version they were not and it -// would take a lot of time to implement this +// Lazy continuations can happen after any newline while in a paragraph and the following line +// can be interpreted as part of the paragraph. E.g. // -// * Code spans -// Code span delimiters have an arbitrary ammount of backticks ('`'), which must -// match between opening and closing delimiters. Maybe this would be possible to -// do as an traditional tree-sitter rule, but it would be VERY ugly. +// > foo +// bar // -// * Emphasis delimiters -// Parsing of emphasis delimiters depends on the character before and after a run -// of '*'s or '_'s, so we need more context than tree-sitter rules offer. +// Is just one paragraph inside a block quote. In essence this means that to check if a newline +// can be a lazy continuation we need to check if it starts with a token that can open a new block +// If yes then it cannot be a lazy continuation as in // -// Matching is done in 2 stages: First we try to match all open blocks, if we don't -// manage to do so and cannot emit a lazy continuation (since we are not in a paragraph) -// we close all unmatched blocks. -// If we can emit a lazy continuation we still need to split the parser state to check -// that the line does not start with a new block. +// > foo +// # bar // -// A lot of inlines work like this: If we match an opening token, like a '`' or '