Skip to content

Latest commit

 

History

History
214 lines (185 loc) · 10.2 KB

lossless-ast-parsing.md

File metadata and controls

214 lines (185 loc) · 10.2 KB

Notes on a full fidelity parser

Nickel does not have a formatter, which it should ideally have. The current parser and ast are not suited for this purpose since they perform desugaring during decoding and drop all information on whitespace and comments. For example, the following program:

{
  // This bar is very important
  foo.bar = "bar",
  foo.baz = "baz",
}

Parses to:

{
  foo = { bar = "bar"} & { baz = "baz" },
}

This note explores the options we have within the rust ecosystem to construct a full fidelity (or lossless) syntax tree.

Glossary

  • Concrete Syntax Tree (CST)/Parse Tree: A tree that results from parsing, and represents the syntactic structure of a string/file. May be lossless/full fidelity.
  • Abstract Syntax Tree (AST): A tree that represents the abstract syntactic structure of a string/file. Used for further processing of the file. Not lossless/full fidelity.
  • Lossless/full fidelity: A lossless CST is a tree that contains all information to recreate the original string/file exactly.
  • Typed: A Typed Syntax Tree is a tree where the type of the parent implies the type of the children. Every node has a specific type, with specific fields, nodes are not interchangeable. Most ASTs are typed.
  • Untyped: An Untyped Syntax Tree is a tree where the type/kind of a parent node does not force the type/kind of the children nodes. Lossless syntax trees are typically untyped.

Keeping the existing lalrpop grammar

The current lexing/parsing stack of Nickel is based on a lexer generated using logos, a parser generated by lalrpop, and a native (abstract) syntax tree.

Our initial idea was to keep the existing lalrpop grammar in place, but replacing the syntax tree with a rowan based alternative. Unfortunately, while it is possible to integrate logos, rowan, and lalrpop, comments and whitespace would still be lost. This is an inherent problem with lalrpop in which grammar rules match directly on the token input.

The problem with comments/whitespace

Above, it was stated that the way lalrpop consumes the token input stream is not well suited for a formatter (or other programming tools). This is a general problem with parsers that attempt to construct a typed syntax tree directly (there is no place for comments in a typed tree). There are ways around this, that fall broadly into three categories:

  1. Don't have a typed syntax tree. For these solutions, an untyped syntax tree is used instead, that is optionally converted to a typed tree when needed.
  2. Lose some information by "intelligently" assigning every comment to a node in the typed tree. The loss in information here means, for instance, that we have to assign a comment surrounded by empty lines to a expression somewhere near it.
  3. Lex the whitespace and comments, but don't pass this information to the lalrpop parser. From here, have lalrpop create either an untyped tree, or a stream of event representing an untyped ast and reinject the comments and whitespace into said tree or stream.

Options

Custom

Like rustc, nickel could employ a custom lexer and parser.

Advantages

  • Full control

Disadvantages

  • By far the most work
  • Rapid development is harder when implementing a parser from scratch

Rowan + Ungrammar (optional) + Logos

Rowan and Ungrammar are two libraries created for the rust-analyzer project.

Of course, rust-analyzer is greatly benefited by a lossless concrete syntax tree. To this extent they created two libraries specifically for this purpose. rowan and ungrammar.

Rowan is a library specifically created to represent (lossless) concrete syntax trees. It requires a parser to be written. For a higher level view, one can write functions that convert these nodes in the CST to AST nodes. This is done using the AstNode trait. In rust-analyzer, this layer is actually generated using a tool called Ungrammar. An overview of Rowan can be found in the rust-analyzer repository.

Ungrammar is "a ... formalism for describing concrete syntax trees". Concretely, ungrammar takes a file written in its grammar and produces a file that contains functions used to interact with the GreenNodes/RedNodes/lossess syntax tree (typically produced by rowan). Ungrammar has seen limited/no use outside of rust-analyzer. The main advantage of ungrammar is that it allows the syntax of a language to be changed easily without requiring manual changes to the AstNode implementations. This can be a great benefit in the early stages of development when changes to the CST are frequent. More on ungrammar can be read in a relevant blogpost.

Rowan is a library that has been adopted in several language servers and other programming tools. Additionally, there exists at least one language (slint) that (aside from a language server and formatter) also uses rowan for their compiler. Hence we can safely say rowan is likely suitable for nickel.

Since we can no longer use lalrpop, the parser would have to be written by hand. Parser combinators are not very suitable to this approach, as many are based around a parser succeeding or failing, negating rowan's advantage of being able to produce a CST even when there is an error.

There is of course nothing fundamentally impossible about using some parser combinator library.

Advantages

  • Rowan has seen use for compilers, language servers, and formatters.
  • Logos and Rowan can work nicely together, because the created parser links them together.

Disadvantages

  • We would have to manually write an additional layer for the AstNode implementations.
  • We might need two AstNode implementations, one used for the formatter, one used to construct the current AST. Alternatively, it might be possible to build a formatter directly on the CST.

Comment map/assigning comments

Another approach is used by ucg and nixfmt that both associate a comment with nodes in the typed syntax tree. Either by looking them up during pretty printing (thus saving having them in the tree for other purposes), or having every node in the tree have an optional associated comment.

This method is obviously not lossless in the sense that whitespace might still removed, but it would suit our purpose (formatter). There are some challenges related to how to format associated comments, especially comments that were not intended to be associated with anything.

Advantages

  • This is a safe option, ucg is a language much like nickel
  • Allows us to use any parser combinator library

Disadvantages

  • Not lossless

Reinjecting whitespace/comments

Xavier proposed the idea of keeping lalrpop, but having it generate a stream/list of what he called "events". These events represent an untyped ast in a flat manner. From here, the comments and whitespace can be reinjected into this event list before or during the construction of the lossless untyped ast. Of course, this method isn't restricted to lalrpop, any parser library can be used in this approach. Rust-analyzer employs this approach as is discussed in event.rs.

Advantages

  • Is a method that is agnostic to the libraries used. We could keep lalrpop in place for instance.

Disadvantages

  • Perceived complexity. This adds another layer to the parser. Xavier has already applied this approach to a toy language of themselves, so it is at least a valid approach.
  • Doesn't reduce the number of grammars.

Tree-sitter?

In issue #656, Yann remarked that there are multiple grammar definitions for the nickel language. It would be nice to share a grammar for editors and nickel itself. Tree-sitter's concept is not too far removed from rowan (an untyped tree that is easy to update) with similar purposes (to ease the creation of development tools). However, tree-sitter does not provide the nice interface that rowan provides. For instances, there are no SyntaxNodes/RedNodes and no included trait to convert the CST to a suitable AST. Aside from toy-examples there are, no existing projects that utilize tree-sitter as a parser for a rts/compiler/etc. This is due to the fact that tree-sitter's error nodes do not provide substantial information about the error that occurred. Some information can be derived from the surrounding nodes, but this is obviously suboptimal.

Don't use a uniform parser

While the concept of using the same parser/grammar for the interpreter and formatter/other tools is nice, we ought to also consider the fact that these tools have distinctly varying purposes. In short, a lossless parser just doesn't serve any purpose to the nickel binary. Based on current experience with the lalrpop parser, it might actually introduce more overhead if we try to use the same parser for both.

Proposed approach

I can't recommend any solution without input from those who have worked with the current lalrpop grammar. Depending on how hard it was to implement that, we should determine if integrating a lossless ast with the nickel binary is desirable. If so, I propose we contribute to the existing tree-sitter parser . With that experience, we can attempt to write a simple formatter, which could give us the insight required to determine if tree-sitter would be sufficient for use in the nickel binary.