From 53f7284d2207b66fba1fe3cd936463612e750604 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Thu, 22 Aug 2024 16:02:31 -0700 Subject: [PATCH 01/16] Move toolchain architecture to markdown --- toolchain/README.md | 4 +- toolchain/docs/README.md | 164 ++++++ toolchain/docs/adding_features.md | 413 ++++++++++++++ toolchain/docs/check.md | 608 ++++++++++++++++++++ toolchain/docs/check.svg | 1 + toolchain/docs/diagnostics.md | 222 ++++++++ toolchain/docs/idioms.md | 424 ++++++++++++++ toolchain/docs/parse.md | 912 ++++++++++++++++++++++++++++++ toolchain/docs/parse.svg | 1 + website/prebuild.py | 4 +- 10 files changed, 2749 insertions(+), 4 deletions(-) create mode 100644 toolchain/docs/README.md create mode 100644 toolchain/docs/adding_features.md create mode 100644 toolchain/docs/check.md create mode 100644 toolchain/docs/check.svg create mode 100644 toolchain/docs/diagnostics.md create mode 100644 toolchain/docs/idioms.md create mode 100644 toolchain/docs/parse.md create mode 100644 toolchain/docs/parse.svg diff --git a/toolchain/README.md b/toolchain/README.md index 33c36780d7631..43150b9641188 100644 --- a/toolchain/README.md +++ b/toolchain/README.md @@ -6,6 +6,4 @@ Exceptions. See /LICENSE for license information. SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception --> -A design is currently maintained in -[Google Drive](https://docs.google.com/document/d/1RRYMm42osyqhI2LyjrjockYCutQ5dOf8Abu50kTrkX0/edit?resourcekey=0-kHyqOESbOHmzZphUbtLrTw). -It'll be migrated to markdown once we are confident in its stability. +See [docs](docs/). diff --git a/toolchain/docs/README.md b/toolchain/docs/README.md new file mode 100644 index 0000000000000..4425e5e30c835 --- /dev/null +++ b/toolchain/docs/README.md @@ -0,0 +1,164 @@ +# Toolchain architecture + + + + + +## Table of contents + +- [Goals](#goals) +- [High-level architecture](#high-level-architecture) + - [Design patterns](#design-patterns) +- [Main components](#main-components) + - [Driver](#driver) + - [Diagnostics](#diagnostics) + - [Lex](#lex) + - [Bracket matching](#bracket-matching) + - [Parse](#parse) + - [Check](#check) +- [Adding features](#adding-features) +- [Alternatives considered](#alternatives-considered) + - [Bracket matching in parser](#bracket-matching-in-parser) + - [Using a traditional AST representation](#using-a-traditional-ast-representation) + + + +## Goals + +The toolchain represents the production portion of Carbon. At a high level, the +toolchain's top priorities are: + +- Correctness. +- Quality of generated code, including performance. +- Compilation performance. +- Quality of diagnostics for incorrect or questionable code. + +TODO: Add an expanded document that details the goals and priorities and link to +it here. + +## High-level architecture + +The default compilation flow is: + +1. Load the file into a [SourceBuffer](/toolchain/source/source_buffer.h). +2. Lex a SourceBuffer into a + [Lex::TokenizedBuffer](/toolchain/lex/tokenized_buffer.h). +3. Parse a TokenizedBuffer into a [Parse::Tree](/toolchain/parse/tree.h). +4. Check a Tree to produce [SemIR::File](/toolchain/sem_ir/file.h). +5. Lower the SemIR to an + [LLVM Module](https://llvm.org/doxygen/classllvm_1_1Module.html). +6. CodeGen turns the LLVM Module into an Object File. + +### Design patterns + +A few common design patterns are: + +- Distinct steps: Each step of processing produces an output structure, + avoiding callbacks passing data between structures. + + - For example, the parser takes a `Lex::TokenizedBuffer` as input and + produces a `Parse::Tree` as output. + + - Performance: It should yield better locality versus a callback approach. + + - Understandability: Each step has a clear input and output, versus + callbacks which obscure the flow of data. + +- Vectorized storage: Data is stored in vectors and flyweights are passed + around, avoiding more typical heap allocation with pointers. + + - For example, the parse tree is stored as a + `llvm::SmallVector` indexed by `Parse::Node` + which wraps an `int32_t`. + + - Performance: Vectorization both minimizes memory allocation overhead and + enables better read caching because adjacent entries will be cached + together. + +- Iterative processing: We rely on state stacks and iterative loops for + parsing, avoiding recursive function calls. + + - For example, the parser has a `Parse::State` enum tracked in + `state_stack_`, and loops in `Parse::Tree::Parse`. + + - Scalability: Complex code must not cause recursion issues. We have + experience in Clang seeing stack frame recursion limits being hit in + unexpected ways, and non-recursive approaches largely avoid that risk. + +See also [Idioms](idioms.md) for abbreviations and more implementation +techniques. + +## Main components + +### Driver + +The driver provides commands and ties together the toolchain's flow. Running a +command such as `carbon compile --phase=lower ` will run through the flow +and print output. Several dump flags, such as `--dump-parse-tree`, print output +in YAML format for easier parsing. + +### Diagnostics + +The diagnostic code is used by the toolchain to produce output. + +See [Diagnostics](diagnostics.md) for details. + +### Lex + +Lexing converts input source code into tokenized output. Literals, such as +string literals, have their value parsed and form a single token at this stage. + +#### Bracket matching + +The lexer handles matching for `()`, `[]`, and `{}`. When a bracket lacks a +match, it will insert a "recovery" token to produce a match. As a consequence, +the lexer's output should always have matched brackets, even with invalid code. + +While bracket matching could use hints such as contextual clues from +indentation, that is not yet implemented. + +### Parse + +Parsing uses tokens to produce a parse tree that faithfully represents the tree +structure of the source program, interpreted according to the Carbon grammar. No +semantics are associated with the tree structure at this level, and no name +lookup is performed. + +See [Parse](parse.md) for details. + +### Check + +Check takes the parse tree and generates a semantic intermediate representation, +or SemIR. This will look closer to a series of instructions, in preparation for +transformation to LLVM IR. Semantic analysis and type checking occurs during the +production of SemIR. It also does any validation that requires context. + +See [Check](check.md) for details. + +## Adding features + +We have a [walkthrough for adding features](adding_features.md). + +## Alternatives considered + +### Bracket matching in parser + +Bracket matching could have also been implemented in the parser, with some +awareness of parse state. However, that would shift some of the complexity of +recovery in other error situations, such as where the parser searches for the +next comma in a list. That needs to skip over bracketed ranges. We don't think +the trade-offs would yield a net benefit, so any change in this direction would +need to show concrete improvement, for example better diagnostics for common +issues. + +### Using a traditional AST representation + +Clang creates an AST as part of compilation. In Carbon, it's something we could +do as a step between parsing and checking, possibly replacing the SemIR. It's +likely that doing so would be simpler, amongst other possible trade-offs. +However, we think the SemIR approach is going to yield higher performance, +enough so that it's the chosen approach. diff --git a/toolchain/docs/adding_features.md b/toolchain/docs/adding_features.md new file mode 100644 index 0000000000000..a297127ffd7a4 --- /dev/null +++ b/toolchain/docs/adding_features.md @@ -0,0 +1,413 @@ +# Adding features + + + + + +## Table of contents + +- [Lex](#lex) +- [Parse](#parse) +- [Typed parse node metadata implementation](#typed-parse-node-metadata-implementation) +- [Check](#check) +- [SemIR typed instruction metadata implementation](#semir-typed-instruction-metadata-implementation) +- [Lower](#lower) +- [Tests and debugging](#tests-and-debugging) +- [Running tests](#running-tests) +- [Updating tests](#updating-tests) + - [Reviewing test deltas](#reviewing-test-deltas) +- [Verbose output](#verbose-output) +- [Stack traces](#stack-traces) + + + +## Lex + +New lexed tokens must be added to +[token_kind.def](/toolchain/lex/token_kind.def). `CARBON_SYMBOL_TOKEN` and +`CARBON_KEYWORD_TOKEN` both provide some built-in lexing logic, while +`CARBON_TOKEN` requires custom lexing support. + +[TokenizedBuffer::Lex](/toolchain/lex/tokenized_buffer.h) is the main dispatch +for lexing, and calls that need to do custom lexing will be dispatched there. + +## Parse + +A parser feature will have state transitions that produce new parse nodes. + +The resulting parse nodes are in +[parse/node_kind.def](/toolchain/parse/node_kind.def) and +[typed_nodes.h](/toolchain/parse/typed_nodes.h). When choosing node structure, +consider how semantics will process it in post-order; this will rule out some +designs. Adding a parse node kind will also require a handler in the `Check` +step. + +The state transitions are in [parse/state.def](/toolchain/parse/state.def). Each +`CARBON_PARSER_STATE` defines a distinct state and has comments for state +transitions. If several states should share handling, name them +`FeatureAsVariant`. + +Adding a state requires adding a `Handle` function in an appropriate +`parse/handle_*.cpp` file, possibly a new file. The macros are used to generate +declarations in the header, so only extra helper functions should be added +there. Every state handler pops the state from the stack before any other +processing. + +## Typed parse node metadata implementation + +As of [#3534](https://github.com/carbon-language/carbon-lang/pull/3534): + +![parse](parse.svg) + +> TODO: Convert this chart to Mermaid. + +- [common/enum_base.h](/common/enum_base.h) defines the `EnumBase` + [CRTP](idioms.md#crtp-or-curiously-recurring-template-pattern) class + extending `Printable` from [common/ostream.h](/common/ostream.h), along with + `CARBON_ENUM` macros for making enumerations + +- [parse/node_kind.h](/toolchain/parse/node_kind.h) includes + [common/enum_base.h](/common/enum_base.h) and defines an enumeration + `NodeKind`, along with bitmask enum `NodeCategory`. + + - The `NodeKind` enumeration is populated with the list of all parse node + kinds using [parse/node_kind.def](/toolchain/parse/node_kind.def) (using + [the .def file idiom](idioms.md#def-files)) _declared_ in this file + using a macro from [common/enum_base.h](/common/enum_base.h) + + - `NodeKind` has a member type `NodeKind::Definition` that extends + `NodeKind` and adds a `NodeCategory` field (and others in the future). + + - `NodeKind` has a method `Define` for creating a `NodeKind::Definition` + with the same enumerant value, plus values for the other fields. + + - `HasKindMember` at the bottom of + [parse/node_kind.h](/toolchain/parse/node_kind.h) uses + [field detection](idioms.md#field-detection) to determine if the type + `T` has a `NodeKind::Definition Kind` static constant member. + + - Note: both the type and name of these fields must match exactly. + + - Note that additional information is needed to define the `category()` + method (and other methods in the future) of `NodeKind`. This information + comes from the typed parse node definitions in + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) (described below). + +- [parse/node_ids.h](/toolchain/parse/node_ids.h) defines a number of types + that store a _node id_ that identifies a node in the parse tree + + - `NodeId` stores a node id with no restrictions + + - `NodeIdForKind` inherits from `NodeId` and stores the id of a node + that must have the specified `NodeKind` "`Kind`". Note that this is not + used directly, instead aliases `FooId` for + `NodeIdForKind` are defined for every node kind using + [parse/node_kind.def](/toolchain/parse/node_kind.def) (using + [the .def file idiom](idioms.md#def-files)). + + - `NodeIdInCategory` inherits from `NodeId` and stores the id of + a node that must overlap the specified `NodeCategory` "`Category`". Note + that this is not typically used directly, instead this file defines + aliases `AnyDeclId`, `AnyExprId`, ..., `AnyStatementId`. + + - Similarly `NodeIdOneOf` and `NodeIdNot` inherit from `NodeId` + and stores the id of a node restricted to either matching `T::Kind` or + `U::Kind` or not matching `V::Kind`. + - In addition to the node id type definitions above, the struct + `NodeForId` is declared but not defined. + +- [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) defines a typed parse + node struct type for each kind of parse node. + + - Each one defines a static constant named `Kind` that is set using a call + to `Define()` on the corresponding enumerant member of `NodeKind` from + [parse/node_kind.h](/toolchain/parse/node_kind.h) (which is included by + this file). + - The fields of these types specify the children of the parse node using + the types from [parse/node_ids.h](/toolchain/parse/node_ids.h). + + - The struct `NodeForId` that is declared in + [parse/node_ids.h](/toolchain/parse/node_ids.h) is defined in this file + such that `NodeForId::TypedNode` is the `Foo` typed parse node + struct type. + + - This file will fail to compile unless every kind of parse node kind + defined in [parse/node_kind.def](/toolchain/parse/node_kind.def) has a + corresponding struct type in this file. + +- [parse/node_kind.cpp](/toolchain/parse/node_kind.cpp) includes both + [parse/node_kind.h](/toolchain/parse/node_kind.h) and + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) + + - Uses the macro from [common/enum_base.h](/common/enum_base.h), the + enumerants of `NodeKind` are _defined_ using the list of parse node + kinds from [parse/node_kind.def](/toolchain/parse/node_kind.def) (using + [the .def file idiom](idioms.md#def-files)). + + - `NodeKind::definition()` is defined. It has a static table of + `const NodeKind::Definition*` indexed by the enum value, populated by + taking the address of the `Kind` member of each typed parse node struct + type, using the list from + [parse/node_kind.def](/toolchain/parse/node_kind.def). + + - `NodeKind::category()` is defined using `NodeKind::definition()`. + + - Tested assumption: the tables built in this file are indexed by the enum + values. We rely on the fact that we get the parse node kinds in the same + order by consistently using + [parse/node_kind.def](/toolchain/parse/node_kind.def). + +- [parse/tree.h](/toolchain/parse/tree.h) includes + [parse/node_ids.h](/toolchain/parse/node_ids.h). It does not depend on + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) to reduce compilation + time in those files that don't use the typed parse node struct types. + + - Defines `Tree::Extract`... functions that take a node id and return a + typed parse node struct type from + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h). + + - Uses `HasKindMember` to restrict calling `ExtractAs` except on typed + nodes defined in [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h). + + - `Tree::Extract` uses `NodeForId` to get the corresponding typed parse + node struct type for a `FooId` type defined in + [parse/node_ids.h](/toolchain/parse/node_ids.h). + + - Note that this is done without a dependency on the typed parse node + struct types by using the forward declaration of `NodeForId` from + [parse/node_ids.h](/toolchain/parse/node_ids.h). + + - The `Tree::Extract`... functions ultimately call + `Tree::TryExtractNodeFromChildren`, which is a templated function + only declared in this file. Its definition is in + [parse/extract.cpp](/toolchain/parse/extract.cpp). + +- [parse/extract.cpp](/toolchain/parse/extract.cpp) includes + [parse/tree.h](/toolchain/parse/tree.h) and + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) + + - Defines struct `Extractable` that defines how to extract a field of + type `T` from a `Tree::SiblingIterator` pointing at the corresponding + child node. + + - `Extractable` is defined for the node id types defined in + [parse/node_ids.h](/toolchain/parse/node_ids.h). + + - In addition, `Extractable` is defined for standard types + `std::optional` and `llvm::SmallVector`, to support optional and + repeated children. + + - Uses [struct reflection](idioms.md#struct-reflection) to support + aggregate struct types containing extractable fields. This is used to + support typed parse node struct types as well as struct fields that they + contain. + + - Uses `HasKindMember` to detect accidental uses of a parse node type + directly as fields of typed parse node struct types -- in those places + `FooId` should be used instead. + + - Defines `Tree::TryExtractNodeFromChildren` and explicitly + instantiates it for every typed parse node struct type defined in + [parse/typed_nodes.h](/toolchain/parse/typed_nodes.h) using + [parse/node_kind.def](/toolchain/parse/node_kind.def) (using + [the .def file idiom](idioms.md#def-files)). By explicitly instantiating + this function only in this file, we avoid redundant compilation work, + which reduces build times, and allow us to keep all the extraction + machinery as a private implementation detail of this file. + +- [parse/typed_nodes_test.cpp](/toolchain/parse/typed_nodes_test.cpp) + validates that each typed parse node struct type has a static `Kind` member + that defines the correct corresponding `NodeKind`, and that the `category()` + function agrees between the `NodeKind` and `NodeKind::Definition`. + +Note: this is broadly similar to +[SemIR typed instruction metadata implementation](#semir-typed-instruction-metadata-implementation). + +## Check + +Each parse node kind requires adding a `Handle` function in a +`check/handle_*.cpp` file. + +If the resulting SemIR needs a new instruction: + +- add a new kind to [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def) + - Add a `CARBON_SEM_IR_INST_KIND(NewInstKindName)` line in alphabetical + order +- a new struct definition to + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h), with (italics + highlight what changes): + +Adding an instruction will also require a handler in the Lower step. + +Most new instructions will automatically be formatted reasonably by the SemIR +formatter. + +If the resulting SemIR needs a new built-in, add it to +[builtin_inst_kind.def](/toolchain/sem_ir/builtin_inst_kind.def). + +## SemIR typed instruction metadata implementation + +How does this work? As of +[#3310](https://github.com/carbon-language/carbon-lang/pull/3310): + +![check](check.svg) + +> TODO: Convert this chart to Mermaid. + +- [common/enum_base.h](/common/enum_base.h) defines the `EnumBase` + [CRTP](idioms.md#crtp-or-curiously-recurring-template-pattern) class + extending `Printable` from [common/ostream.h](/common/ostream.h), along with + `CARBON_ENUM` macros for making enumerations + +- [sem_ir/inst_kind.h](/toolchain/sem_ir/inst_kind.h) includes + [common/enum_base.h](/common/enum_base.h) and defines an enumeration + `InstKind`, along with `InstValueKind` and `TerminatorKind`. + + - The `InstKind` enumeration is populated with the list of all instruction + kinds using [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def) + (using [the .def file idiom](idioms.md#def-files)) _declared_ in this + file using a macro from [common/enum_base.h](/common/enum_base.h) + + - `InstKind` has a member type `InstKind::Definition` that extends + `InstKind` and adds the `ir_name` string field, and a `TerminatorKind` + field. + + - `InstKind` has a method `Define` for creating a `InstKind::Definition` + with the same enumerant value, plus values for the other fields. + +- Note that additional information is needed to define the `ir_name()`, + `value_kind()`, and `terminator_kind()` methods of `InstKind`. This + information comes from the typed instruction definitions in + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h). + +- [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) defines a typed + instruction struct type for each kind of SemIR instruction, as described + above. + + - Each one defines a static constant named `Kind` that is set using a call + to `Define()` on the corresponding enumerant member of `InstKind` from + [sem_ir/inst_kind.h](/toolchain/sem_ir/inst_kind.h) (which is included + by this file). + +- `HasParseNodeMember` and `HasTypeIdMember` at the + bottom of [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) use + [field detection](idioms.md#field-detection) to determine if `TypedInst` has + a `Parse::Node parse_node` or a `TypeId type_id` field respectively. + + - Note: both the type and name of these fields must match exactly. + +- [sem_ir/inst_kind.cpp](/toolchain/sem_ir/inst_kind.cpp) includes both + [sem_ir/inst_kind.h](/toolchain/sem_ir/inst_kind.h) and + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) + + - Uses the macro from [common/enum_base.h](/common/enum_base.h), the + enumerants of `InstKind` are _defined_ using the list of instruction + kinds from [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def) + (using [the .def file idiom](idioms.md#def-files)) + + - `InstKind::value_kind()` is defined. It has a static table of + `InstValueKind` values indexed by the enum value, populated by applying + `HasTypeIdMember` from + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) to every + instruction kind by using the list from + [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def). + - `InstKind::definition()` is defined. It has a static table of + `const InstKind::Definition*` indexed by the enum value, populated by + taking the address of the `Kind` member of each `TypedInst`, using the + list from [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def). + + - `InstKind::ir_name()` and `InstKind::terminator_kind()` are defined + using `InstKind::definition()`. + - Tested assumption: the tables built in this file are indexed by the enum + values. We rely on the fact that we get the instruction kinds in the + same order by consistently using + [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def). + + - This file will fail to compile unless every kind of SemIR instruction + defined in [sem_ir/inst_kind.def](/toolchain/sem_ir/inst_kind.def) has a + corresponding struct type in + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h). + +- `TypedInstArgsInfo` defined in + [sem_ir/inst.h](/toolchain/sem_ir/inst.h) uses + [struct reflection](idioms.md#struct-reflection) to determine the other + fields from `TypedInst`. It skips the `parse_node` and `type_id` fields + using `HasParseNodeMember` and `HasTypeIdMember`. + + - Tested assumption: the `parse_node` and `type_id` are the first fields + in `TypedInst`, and there are at most two more fields. + +- [sem_ir/inst.h](/toolchain/sem_ir/inst.h) defines templated conversions + between `Inst` and each of the typed instruction structs: + + - Uses `TypedInstArgsInfo`, `HasParseNodeMember`, + and `HasTypeIdMember`, and + [local lambda](idioms.md#local-lambdas-to-reduce-duplicate-code). + + - Defines a templated `ToRaw` function that converts the various id field + types to an `int32_t`. + - Defines a templated `FromRaw` function that converts an `int32_t` to + `T` to perform the opposite conversion. + - Tested assumption: The `parse_node` field is first, when present, and + the `type_id` is next, when present, in each `TypedInst` struct type. + +- The "tested assumptions" above are all tested by + [sem_ir/typed_insts_test.cpp](/toolchain/sem_ir/typed_insts_test.cpp) + +## Lower + +Each SemIR instruction requires adding a `Handle` function in a +`lower/handle_*.cpp` file. + +## Tests and debugging + +## Running tests + +Tests are run in bulk as `bazel test //toolchain/...`. Many tests are using the +file_test infrastructure; see +[testing/file_test/README.md](/testing/file_test/README.md) for information. + +There are several supported ways to run Carbon on a given test file. For +example, with `toolchain/parse/testdata/basics/empty.carbon`: + +- `bazel test //toolchain/testing:file_test --test_arg=--file_tests=toolchain/parse/testdata/basics/empty.carbon` + - Executes an individual test. +- `bazel run //toolchain/parse:testdata/basics/empty.carbon.run` + - Runs `carbon` on the file with standard arguments, printing output to + console. + - This form will often be most useful when iterating over a specific test. +- `bazel run //toolchain/parse:testdata/basics/empty.carbon.verbose` + - Similar to the previous command, but with the `-v` flag implied. +- `bazel run //toolchain/driver:carbon -- compile --phase=parse --dump-parse-tree toolchain/parse/testdata/basics/empty.carbon` + - Explicitly runs `carbon` with the provided arguments. +- `bazel-bin/toolchain/driver/carbon compile --phase=parse --dump-parse-tree toolchain/parse/testdata/basics/empty.carbon` + - Similar to the previous command, but without using `bazel`. + +## Updating tests + +The `toolchain/autoupdate_testdata.py` script can be used to update output. It +invokes the `file_test` autoupdate support. See +[testing/file_test/README.md](/testing/file_test/README.md) for file syntax. + +### Reviewing test deltas + +Using `autoupdate_testdata.py` can be useful to produce deltas during the +development process because it allows `git status` and `git diff` to be used to +examine what changed. + +## Verbose output + +The `-v` flag can be passed to trace state, and should be specified before the +subcommand name: `carbon -v compile ...`. `CARBON_VLOG` is used to print output +in this mode. There is currently no control over the degree of verbosity. + +## Stack traces + +While the iterative processing pattern means function stack traces will have +minimal context for how the current function is reached, we use LLVM's +`PrettyStackTrace` to include details about the state stack. The state stack +will be above the function stack in crash output. diff --git a/toolchain/docs/check.md b/toolchain/docs/check.md new file mode 100644 index 0000000000000..8bb292510c816 --- /dev/null +++ b/toolchain/docs/check.md @@ -0,0 +1,608 @@ +# Check + + + + + +## Table of contents + +- [Postorder processing](#postorder-processing) +- [Key IR concepts](#key-ir-concepts) +- [Parameters and arguments](#parameters-and-arguments) +- [SemIR textual format](#semir-textual-format) +- [Raw form](#raw-form) +- [Formatted IR](#formatted-ir) + - [Instructions](#instructions) + - [Top-level entities](#top-level-entities) +- [Core loop](#core-loop) +- [Node stack](#node-stack) +- [Delayed evaluation (not yet implemented)](#delayed-evaluation-not-yet-implemented) +- [Templates (not yet implemented)](#templates-not-yet-implemented) +- [Rewrites](#rewrites) +- [Types](#types) +- [Type printing (not yet implemented)](#type-printing-not-yet-implemented) +- [Expression categories](#expression-categories) + - [ExprCategory::NotExpression](#exprcategorynotexpression) + - [ExprCategory::Value](#exprcategoryvalue) + - [ExprCategory::DurableReference and ExprCategory::EphemeralReference](#exprcategorydurablereference-and-exprcategoryephemeralreference) + - [ExprCategory::Initializing](#exprcategoryinitializing) + - [ExprCategory::Mixed](#exprcategorymixed) +- [Value bindings](#value-bindings) +- [Handling Parse::Tree errors (not yet implemented)](#handling-parsetree-errors-not-yet-implemented) + + + +## Postorder processing + +The checking step is oriented on postorder processing on the `Parse::Tree` to +iterate through the `Parse::NodeImpl` vectorized storage once, in order, as much +as possible. This is primarily for performance, but also relies on the +[information accumulation principle](/docs/project/principles/information_accumulation.md): +that is, when that principle applies, we should be able to generate IR +immediately because we can rely on the principle that when a line is processed, +the information necessary to semantically check that line is already available. + +Indirectly, what this really means is that we should be able to go from a +Parse::Tree (which cannot be used for name lookups) to a SemIR with name lookups +completed in a single pass. The SemIR should not need to be re-processed to add +more information outside of templates. By doing this, we avoid an additional +processing pass with associated storage needs. + +This single-pass approach also means that the checking step does not make use of +the tree structure of the `Parse::Tree`. In cases where the actions performed +for a parse tree node depend on the context in which that node appears, a node +that is visited earlier in the postorder traversal, such as a bracketing node, +needs to establish the necessary context. In this respect, the sequence of +`Parse::Node`s can be thought of as a byte code input that the check step +interprets to build the `SemIR`. + +## Key IR concepts + +A `SemIR::Inst` is the basic building block that represents a simple +instruction, such as an operator or declaring a literal. For each kind of +instruction, a typedef for that specific kind of instruction is provided in the +`SemIR` namespace. For example, `SemIR::Assign` represents an assignment +instruction, and `SemIR::PointerType` represents a pointer type instruction. + +Each instruction class has up to four public data members describing the +instruction, as described in +[sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h) (also see +[adding features for Check](adding_features.md#check)): + +- A `Parse::Node parse_node;` member that tracks its location is present on + almost all instructions, except instructions like `SemIR::Builtin` that + don't have an associated location. + +- A `SemIR::TypeId type_id;` member that describes the type of the instruction + is present on all instructions that produce a value. This includes namespace + instructions, which are modeled as producing a value of "namespace" type, + even though they can't be used as a first-class value in Carbon expressions. + +- Up to two additional, kind-specific members. For example `SemIR::Assign` has + members `InstId lhs_id` and `InstId rhs_id`. + +Instructions are stored as type-erased `SemIR::Inst` objects, which store the +instruction kind and the (up to) four fields described above. This balances the +size of `SemIR::Inst` against the overhead of indirection. + +A `SemIR::InstBlock` can represent a code block. However, it can also be created +when a series of instructions needs to be closely associated, such as a +parameter list. + +A `SemIR::Builtin` represents a language built-in, such as the unconstrained +facet type `type`. We will also have built-in functions which would need to form +the implementation of some library types, such as `i32`. Built-ins are in a +stable index across `SemIR` instances. + +## Parameters and arguments + +Parameters and arguments will be stored as two `SemIR::InstBlock`s each. The +first will contain the full IR, while the second will contain references to the +last instruction for each parameter or argument. The references block will have +a size equal to the number of parameters or arguments, allowing for quick size +comparisons and indexed access. + +## SemIR textual format + +There are two textual ways to view `SemIR`. + +## Raw form + +The raw form of SemIR shows the details of the representation, such as numeric +instruction and block IDs. The representation is intended to very closely match +the `SemIR::File` and `SemIR::Inst` representations. This can be useful when +debugging low-level issues with the `SemIR` representation. + +The driver will print this when passed `--dump-raw-sem-ir`. + +## Formatted IR + +In addition to the raw form, there is a higher-level formatted IR that aims to +be human readable. This is used in most `check` tests to validate the output, +and also expected to be used regularly by toolchain developers to inspect the +result of checking the parse tree. + +The driver will print this when passed `--dump-sem-ir`. + +Unlike the raw form, certain representational choices in the `SemIR` data may +not be visible in this form. However, it is intended to be possible to parse the +`SemIR` output and form an equivalent – but not necessarily identical – `SemIR` +representation, although no such parser currently exists. + +As an example, given the program: + +```cpp +fn Cond() -> bool; +fn Run() -> i32 { return if Cond() then 1 else 2; } +``` + +The formatted IR is currently: + +``` +constants { + %.1: i32 = int_literal 1 [template] + %.2: i32 = int_literal 2 [template] +} + +file { + package: = namespace [template] { + .Cond = %Cond + .Run = %Run + } + %Cond: = fn_decl @Cond [template] { + %return.var.loc1: ref bool = var + } + %Run: = fn_decl @Run [template] { + %return.var.loc2: ref i32 = var + } +} + +fn @Cond() -> bool; + +fn @Run() -> i32 { +!entry: + %Cond.ref: = name_ref Cond, file.%Cond [template = file.%Cond] + %.loc2_33.1: init bool = call %Cond.ref() + %.loc2_26.1: bool = value_of_initializer %.loc2_33.1 + %.loc2_33.2: bool = converted %.loc2_33.1, %.loc2_26.1 + if %.loc2_33.2 br !if.expr.then else br !if.expr.else + +!if.expr.then: + %.loc2_41: i32 = int_literal 1 [template = constants.%.1] + br !if.expr.result(%.loc2_41) + +!if.expr.else: + %.loc2_48: i32 = int_literal 2 [template = constants.%.2] + br !if.expr.result(%.loc2_48) + +!if.expr.result: + %.loc2_26.2: i32 = block_arg !if.expr.result + return %.loc2_26.2 +} +``` + +There are three kinds of names in formatted IR, which are distinguished by their +leading sigils: + +- `%name` denotes a value produced by an instruction. These names are + introduced by a line of the form `%name: = `, + and are scoped to the enclosing top-level entity. `` describes the + [expression category](#expression-categories), which is `init` for an + initializing expression, `ref` for a reference expression, or omitted for a + value expression. Typically, values can only be referenced by instructions + that their introduction + [dominates](), but + some kinds of instruction might have other rules. Names in the `file` block + can be referenced as `file.%`. + +- `!name` denotes a label, and `!name:` appears as a prefix of each + `InstBlock` in a `Function`. These names are scoped to their enclosing + function, and can be referenced anywhere in that function, but not outside. + +- `@name` denotes a top-level entity, such as a function, class, or interface. + The SemIR view of these entities is flattened, so member functions are + treated as top-level entities. + +Names in formatted IR are all invented by the formatter, and generally are of +the form `[.loc[_[.]]]` where `` and +`` describe the location of the instruction, and `` is used as a +disambiguator if multiple instructions appear at the same location. Trailing +name components are only included if they are necessary to disambiguate the +name. `` is a guessed good name for the instruction, often derived +from source-level identifiers, and is empty if no guess was made. + +### Instructions + +There is usually one line in a `InstBlock` for each `Inst`. You can find the +documentation for the different kinds of instructions in +[toolchain/sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h). For example, +given a formatted SemIR line like: + +``` +%N: i32 = assoc_const_decl N [template] +``` + +you would look for a `struct` definition that uses `"assoc_const_decl"` as its +`ir_name`. In this case, this is the `AssociatedConstantDecl` instruction: + +```cpp +// An associated constant declaration in an interface, such as `let T:! type;`. +struct AssociatedConstantDecl { + static constexpr auto Kind = + InstKind::AssociatedConstantDecl.Define( + {.ir_name = "assoc_const_decl", .is_lowered = false}); + + TypeId type_id; + NameId name_id; +}; +``` + +Since this instruction produces a value, it has a `TypeId type_id` field, which +corresponds to the type written between the `:` and the `=`. In the example +above, that type is `i32`. The other arguments to the instruction are written +after the `ir_name` -- in this example the `name_id` is `N`. From this we find +that the instruction corresponds to an associated constant declaration in an +interface like `let N:! i32;`. + +Instructions producing a constant value, like `assoc_const_decl` above, are +followed by their phase, either `[symbolic]` or `[template]`, and then `=` the +value if it is the value of a different instruction. + +Instructions that do not produce a value, such as the `br` and `return` +instructions above, omit the leading `%name: ... =` prefix, as they cannot be +named by other instructions. These instructions do not have a `TypeId type_id` +field, like the `AdaptDecl` instruction: + +```cpp +// An adapted type declaration in a class, of the form `adapt T;`. +struct AdaptDecl { + static constexpr auto Kind = InstKind::AdaptDecl.Define( + {.ir_name = "adapt_decl", .is_lowered = false}); + + // No type_id; this is not a value. + TypeId adapted_type_id; +}; +``` + +An `adapt SomeClass;` declaration would have the corresponding SemIR formatted +as: + +``` +adapt_decl %SomeClass +``` + +Some instructions have special argument handling. For example, some invalid +arguments will be omitted. Or an `InstBlockId` argument will be rendered inline, +commonly enclosed in braces `{`...`}` or parens `(`...`)`. In other cases, the +formatter will combine instructions together to make the IR more readable: + +- A terminator sequence in a block, comprising a sequence of `BranchIf` + instructions followed by a `Branch` or `BranchWithArg` instruction, is + collapsed into a single + `if %cond br !label1 else if ... else br !labelN(%arg)` line. +- A struct type, formed by a sequence of `StructTypeField` instructions + followed by a `StructType` instruction, is collapsed into a single + `struct_type{.field1: %value1, ..., .fieldN: %valueN}` line. + +These exceptions may be found in +[toolchain/sem_ir/formatter.cpp](/toolchain/sem_ir/formatter.cpp). + +### Top-level entities + +**Question:** Are these too in flux to document at this time? + +- `constants`: TODO +- `imports`: TODO +- `file`: TODO +- entities + - TODO: may be preceded by `extern`. + - TODO: may be preceded by `generic`. + - These may have an optional `!definition:` section containing the + generic's `definition_block_id`. + - `fn`: TODO; followed by `= "`...`"` for builtins + - `class`: TODO + - `interface`: TODO + - `impl`: TODO +- `specific`: TODO + - body in braces `{`...`}` has a bunch of + `` => ` assignment lines + - The first lines of the body describe the declaration + - If there is a valid definition, there are additional definition + assignments after a `!definition:` line. + +## Core loop + +The core loop is `Check::CheckParseTree`. This loops through the `Parse::Tree` +and calls a `Handle`... function corresponding to the `NodeKind` of each node. +Communication between these functions for different nodes working together is +through the `Context` object defined in +[check/context.h](/toolchain/check/context.h), which stores things in a +collection of stacks. The common pattern is that the children of a node are +processed first. They produce information that is then consumed when processing +the parent node. + +One example of this pattern is expressions. Each subexpression outputs SemIR +instructions to compute the value of that subexpression to the current +instruction block, added to the top of the + +`InstBlockStack` stored in the `Context` object. It leaves an instruction id on +the top of the [node stack](#node-stack) pointing to the instruction that +produces the value of that subexpression. Those are consumed by parent +operations, like an [RPN](https://en.wikipedia.org/wiki/Reverse_Polish_notation) +calculator. For example, the expression `1 * 2 + 3` corresponds to this parse +tree: + +```yaml + {kind: 'IntegerLiteral', text: '1'}, + {kind: 'IntegerLiteral', text: '2'}, + {kind: 'InfixOperator', text: '*', subtree_size: 3}, + {kind: 'IntegerLiteral', text: '3'}, +{kind: 'InfixOperator', text: '+', subtree_size: 5}, +``` + +This parse tree is processed by one call to a `Handle` function per node: + +- The first node is an integer literal, so the core loop calls + `HandleIntegerLiteral`. + + - It calls `context::AddInstAndPush` to output a `SemIR::IntegerLiteral` + instruction to the current instruction block, and pushes the parse node + along with the instruction id to the [node stack](#node-stack). + +- The second node is also an integer literal, which outputs a second + instruction and pushes another entry onto the node stack. + +- `HandleInfixOperator` pops the two entries off of the node stack, outputs + any conversion instructions that are needed, and uses + `context::AddInstAndPush` to create and push the instruction id representing + the output of a multiplication instruction. That multiplication instruction + takes the instruction ids it popped off the stack at the beginning as + arguments. + +- Another integer literal instruction is created for `3` and pushed onto the + stack. + +- `HandleInfixOperator` is called again. It pops the two instruction ids off + the stack to use as the arguments to the multiplication instruction it + creates and pushes. + +In this way, the handle functions coordinate producing their output using the +instruction block stack and node block stack from the context. + +A similar pattern uses bracketing nodes to support parent nodes that can have a +variable number of children. For example, a `return` statement can produce parse +trees following a few different patterns: + +- `return;` + + ```yaml + {kind: 'ReturnStatementStart', text: 'return'}, + {kind: 'ReturnStatement', text: ';', subtree_size: 2}, + ``` + +- `return x;` + + ```yaml + {kind: 'ReturnStatementStart', text: 'return'}, + {kind: 'NameExpr', text: 'x'}, + {kind: 'ReturnStatement', text: ';', subtree_size: 3}, + ``` + +- `return var;` + + ```yaml + {kind: 'ReturnStatementStart', text: 'return'}, + {kind: 'ReturnVarModifier', text: 'var'}, + {kind: 'ReturnStatement', text: ';', subtree_size: 3}, + ``` + +In all three cases, the introducer node `ReturnStatementStart` pushes an entry +on the [node stack](#node-stack) with just the parse node and no id, called a +_solo parse node_. The handler for the parent `ReturnStatement` node can pop and +process entries from the node stack until it finds that solo parse node from +`ReturnStatementStart` that indicates it is done. + +Another pattern that arises is state is set up by an introducer node, updated by +its siblings, and then consumed by the bracketing parent node. FIXME: example + +## Node stack + +The node stack, defined in [check/node_stack.h](/toolchain/check/node_stack.h), +stores pairs of a `Parse::Node` and an id. The type of the id is determined by +the `NodeKind` of the parse node. It is the default, general-purpose stack used +by `Handle`... functions in the check stage. Using a single stack is beneficial +since it improves locality of reference and reduces allocations. However, +additional stacks are used to ensure we never need to search through the stack +to find data -- we always want to be operating on the top of the stack (or a +fixed offset). + +The node stack contains any state pushed by siblings of the current +`Parse::Node` at the top, and state pushed by siblings of ancestors below. The +boundaries between what is a sibling of the current `Parse::Node` versus what is +a sibling of an ancestor are not explicitly determined. Instead, the handler for +the parent node knows how many nodes it must pop from the stack based either on +knowing the fixed number of children for that node kind or popping nodes until +it reaches a bracketing node. The arity or bracketing node kind for each parent +node is documented in [parse/node_kind.def](/toolchain/parse/node_kind.def). + +When each `Parse::Node` is evaluated, the SemIR for it is typically immediately +generated as `SemIR::Inst`s. To help generate the IR to an appropriate context, +scopes have separate `SemIR::InstBlock`s. + +## Delayed evaluation (not yet implemented) + +Sometimes, nodes will need to have delayed evaluation; for example, an inline +definition of a class member function needs to be evaluated after the class is +fully declared. The `SemIR::Inst`s cannot be immediately generated because they +may include name references to the class. We're likely to store a reference to +the relevant `Parse::Node` for each definition for re-evaluation after the class +scope completes. This means that nodes in a definition would be traversed twice, +once while determining that they're inline and without full checking or IR +generation, then again with full checking and IR generation. + +## Templates (not yet implemented) + +Templates need to have partial semantic checking when declared, but can't be +fully implemented before they're instantiated against a specific type. + +We are likely to generate a partial IR for templates, allowing for checking with +the incomplete information in the IR. Instantiation will likely use that IR and +fill in the missing information, but it could also reevaluate the original +`Parse::Node`s with the known template state. + +## Rewrites + +Carbon relies on rewrites of code, such as rewriting the destination of an +initializer to a specific target object once that object is known. + +We have two ways to achieve this. One is to track the IR location of a +placeholder instruction and, if it needs updating, replace it with a "rewrite" +`SemIR::Inst` that points to a new `SemIR::InstBlock` containing the required IR +and specifying which value is the result of that rewrite. This is expressed in +SemIR as a `splice_block` instruction. Another is to track the list of +instructions to be created separately from the node block stack, and merge those +instructions into the current block once we have decided on their contents. + +## Types + +Type expressions are treated like any other expression, and are modeled as +`SemIR::Inst`s. The types computed by type expressions are deduplicated, +resulting in a canonical `SemIR::TypeId` for each distinct type. + +## Type printing (not yet implemented) + +The `TypeId` preserves only the identity of the type, not its spelling, and so +printing it will produce a fully-resolved type name, which isn't a great user +experience as it doesn't reflect how the type was written in the source code. + +Instead, when printing a type name for use in a diagnostic, we will start with +one of two `InstId`s: + +- A `InstId` for a type expression that describes the way the type was + computed. +- A `InstId` for an expression that has the given type. + +In the former case, the type is pretty-printed by walking the type expression +and printing it. In the latter case, the type of the expression is reconstructed +based on the form of the expression: for example, to print the type of `&x`, we +print the type of `x` and append a `*`, being careful to take potential +precedence issues into account. + +TODO: This requires being able to print the type of, for example, +`x.foo[0].bar`, by printing only the desired portion of the type of `x`, and +similarly may require handling the case where the type of an expression involves +generic parameters whose arguments are specified by that expression. In effect, +the type computation performed when checking an operation is duplicated into the +type printing logic, but is simpler because errors don't need to be detected. + +This approach means we don't need to preserve a fully-sugared type for each +expression instruction. Instead, we compute that type when we need to print it. + +## Expression categories + +Each `SemIR::Inst` that has an associated type also has an expression category, +which describes how it produces a value of that type. These +`SemIR::ExprCategory` values correspond to the Carbon expression categories +defined in proposal +[#2006](https://github.com/carbon-language/carbon-lang/pull/2006): + +### ExprCategory::NotExpression + +This instruction is not an expression instruction, and doesn't have an +expression category. This is used for namespaces, control flow instructions, and +other constructs that represent some non-expression-level semantics. + +### ExprCategory::Value + +This instruction produces a value using the type's value representation. +Lowering the instruction will produce an LLVM value using that value +representation. + +### ExprCategory::DurableReference and ExprCategory::EphemeralReference + +This instruction produces a reference to an object. Lowering will produce a +pointer to an object representation. + +### ExprCategory::Initializing + +This instruction represents the initialization of an object. Depending on the +initializing representation for the type, the initializing expression +instruction will do one of the following: + +- For an in-place initializing representation, the instruction will store a + value to the target of the initialization. + +- For a by-copy initializing representation, the instruction will produce an + object representation by value that can be stored into the target. This is + currently only used in cases where the object representation and the value + representation are the same. + +- For a type with no initializing representation, such as an empty struct or + tuple, it does neither of the above things. + +Regardless of the initializing representation, an initializing expression should +be consumed by another instruction that finishes the initialization. For a +by-copy initialization, this final instruction represents the store into the +target, whereas in the other cases it is only used to track in SemIR how the +initialization was used. When an in-place initializer uses a by-copy initializer +as a subexpression, an `initialize_from` instruction is inserted to perform this +final store. + +### ExprCategory::Mixed + +This instruction represents a language construct that doesn't have a single +expression category. This is used for struct and tuple literals, where the +elements of the literal can have different expression categories. Instructions +with a mixed expression category are treated as a special case in conversion, +which recurses into the elements of those instructions before performing +conversions. + +## Value bindings + +A value binding represents a conversion from a reference expression to the value +stored in that expression. There are three important cases here: + +- For types with a by-copy value representation, such as `i32`, a value + binding represents a load from the address indicated by the reference + expression. + +- For types with a by-pointer value representation, such as arrays and large + structs and tuples, a value binding implicitly takes the address of the + reference expression. + +- For structs and tuples, the value representation is a struct or tuple of the + elements' value representations, which is not necessarily the same as a + struct or tuple of the elements' object representations. In the case where + the value representation is not a copy of, or pointer to, the object + representation, `value_binding` instructions are not used, and a + `tuple_value` or `struct_value` instruction is used to construct a value + representation instead. `value_binding` should still be used in the case + where the value and object representation are the same, but this is not yet + implemented. + +## Handling Parse::Tree errors (not yet implemented) + +`Parse::Tree` errors will typically indicate that checking would error for a +given context. We'll want to be careful about how this is handled, but we'll +likely want to generate diagnostics for valid child nodes, then reduce +diagnostics once invalid nodes are encountered. We should be able to reasonably +abandon generated IR of the valid children when we encounter an invalid parent, +without severe effects on surrounding checks. + +For example, an invalid line of code in a function might generate some +incomplete IR in the function's `SemIR::InstBlock`, but that IR won't negatively +interfere with checking later valid lines in the same function. + +# Lower + +Lowering takes the SemIR and produces LLVM IR. At present, this is done in a +single pass, although it's possible we may need to do a second pass so that we +can first generate type information for function arguments. + +Lowering is done per `SemIR::InstBlock`. This minimizes changes to the +`IRBuilder` insertion point, something that is both expensive and potentially +fragile. diff --git a/toolchain/docs/check.svg b/toolchain/docs/check.svg new file mode 100644 index 0000000000000..22d9ef3da9fad --- /dev/null +++ b/toolchain/docs/check.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/toolchain/docs/diagnostics.md b/toolchain/docs/diagnostics.md new file mode 100644 index 0000000000000..e83bb70c23db1 --- /dev/null +++ b/toolchain/docs/diagnostics.md @@ -0,0 +1,222 @@ +# Diagnostics + + + + + +## Table of contents + +- [DiagnosticEmitter](#diagnosticemitter) +- [DiagnosticConsumers](#diagnosticconsumers) +- [Producing diagnostics](#producing-diagnostics) +- [Diagnostic registry](#diagnostic-registry) +- [CARBON_DIAGNOSTIC placement](#carbon_diagnostic-placement) +- [Diagnostic context](#diagnostic-context) +- [Diagnostic parameter types](#diagnostic-parameter-types) +- [Diagnostic message style guide](#diagnostic-message-style-guide) + + + +## DiagnosticEmitter + +[DiagnosticEmitters](/toolchain/diagnostics/diagnostic_emitter.h) handle the +main formatting of a message. It's parameterized on a location type, for which a +DiagnosticLocationTranslator must be provided that can translate the location +type into a standardized DiagnosticLocation of file, line, and column. + +When emitting, the resulting formatted message is passed to a +DiagnosticConsumer. + +## DiagnosticConsumers + +DiagnosticConsumers handle output of diagnostic messages after they've been +formatted by an Emitter. Important consumers are: + +- [ConsoleDiagnosticConsumer](/toolchain/diagnostics/diagnostic_emitter.h): + prints diagnostics to console. + +- [ErrorTrackingDiagnosticConsumer](/toolchain/diagnostics/diagnostic_emitter.h): + counts the number of errors produced, particularly so that it can be + determined whether any errors were encountered. + +- [SortingDiagnosticConsumer](/toolchain/diagnostics/sorting_diagnostic_consumer.h): + sorts diagnostics by line so that diagnostics are seen in terminal based on + their order in the file rather than the order they were produced. + +- [NullDiagnosticConsumer](/toolchain/diagnostics/null_diagnostics.h): + suppresses diagnostics, particularly for tests. + +Note that `SortingDiagnosticConsumer` is used by default by `carbon compile`. In +cases where one error leads to another error at an earlier location, for example +if an error in a function call argument leads to an error in the function call, +this can result in confusing diagnostic output where a consequence of the error +is reported before the cause. Usually this should be handled by tracking that an +error occurred and suppressing the follow-on diagnostic. During toolchain +development, it can be useful to disable the sorting so that the diagnostic +order matches the order in which the file was processed. This can be done using +`carbon compile –stream-errors`. + +## Producing diagnostics + +Diagnostics are used to surface issues from compilation. A simple diagnostic +looks like: + +```cpp +CARBON_DIAGNOSTIC(InvalidCode, Error, "Code is invalid"); +emitter.Emit(location, InvalidCode); +``` + +Here, `CARBON_DIAGNOSTIC` defines a static instance of a diagnostic named +`InvalidCode` with the associated severity (`Error` or `Warning`). + +The `Emit` call produces a single instance of the diagnostic. When emitted, +`"Code is invalid"` will be the message used. The type of `location` depends on +the `DiagnosticEmitter`. + +A diagnostic with an argument looks like: + +```cpp +CARBON_DIAGNOSTIC(InvalidCharacter, Error, "Invalid character {0}.", char); +emitter.Emit(location, InvalidCharacter, invalid_char); +``` + +Here, the additional `char` argument to `CARBON_DIAGNOSTIC` specifies the type +of an argument to expect for message formatting. The `invalid_char` argument to +`Emit` provides the matching value. It's then passed along with the diagnostic +message format to `llvm::formatv` to produce the final diagnostic message. + +## Diagnostic registry + +There is a [registry](/toolchain/diagnostics/diagnostic_kind.def) which all +diagnostics must be added to. Each diagnostic has a line like: + +```cpp +CARBON_DIAGNOSTIC_KIND(InvalidCode) +``` + +This produces a central enumeration of all diagnostics. The eventual intent is +to require tests for every diagnostic that can be produced, but that isn't +currently implemented. + +## CARBON_DIAGNOSTIC placement + +Idiomatically, `CARBON_DIAGNOSTIC` will be adjacent to the `Emit` call. However, +this is only because many diagnostics can only be produced in one code location. +If they can be produced in multiple locations, they will be at a higher scope so +that multiple `Emit` calls can reference them. When in a function, +`CARBON_DIAGNOSTIC` should be placed as close as possible to the usage so that +it's easier to see the associated output. + +## Diagnostic context + +Diagnostics can provide additional context for errors by attaching notes, which +have their own location information. A diagnostic with a note looks like: + +```cpp +CARBON_DIAGNOSTIC(CallArgCountMismatch, Error, + "{0} argument(s) passed to function expecting " + "{1} argument(s).", + int, int); +CARBON_DIAGNOSTIC(InCallToFunction, Note, + "Calling function declared here."); +context.emitter() + .Build(call_parse_node, CallArgCountMismatch, arg_refs.size(), + param_refs.size()) + .Note(param_parse_node, InCallToFunction) + .Emit(); +``` + +The error and the note are registered as two separate diagnostics, but a single +overall diagnostic object is built and emitted, so that the error and the note +can be treated as a single unit. + +Diagnostic context information can also be registered in a scope, so that all +diagnostics produced in that scope attach a specific note. For example: + +```cpp +DiagnosticAnnotationScope annotate_diagnostics( + &context.emitter(), [&](auto& builder) { + CARBON_DIAGNOSTIC( + InCallToFunctionParam, Note, + "Initializing parameter {0} of function declared here.", int); + builder.Note(param_parse_node, InCallToFunctionParam, + diag_param_index + 1); + }); +``` + +This is useful when delegating to another part of Check that may produce many +different kinds of diagnostic. + +## Diagnostic parameter types + +Here are some types you might consider for the parameters to a diagnostic: + +- `llvm::StringLiteral`. Note that we don't use `llvm::StringRef` to avoid + lifetime issues. +- `std::string` +- Carbon types `T` that implement `llvm::format_provider` like: + - `Lex::TokenKind` + - `Lex::NumericLiteral::Radix` + - `Parse::RelativeLocation` +- integer types: `int`, `uint64_t`, `int64_t`, `size_t` +- `char` +- Other + [types supported by llvm::formatv](https://llvm.org/doxygen/FormatVariadic_8h_source.html) + +## Diagnostic message style guide + +In order to provide a consistent experience, Carbon diagnostics should be +written in the following style: + +- Start diagnostics with a capital letter or quoted code, and end them with a + period. + +- Quoted code should be enclosed in backticks, for example: `"{0} is bad."` + +- Phrase diagnostics as bullet points rather than full sentences. Leave out + articles unless they're necessary for clarity. + +- Diagnostics should describe the situation the toolchain observed and the + language rule that was violated, although either can be omitted if it's + clear from the other. For example: + + - "Redeclaration of X." describes the situation and implies that + redeclarations are not permitted. + + - "`self` can only be declared in an implicit parameter list." describes + the language rule and implies that you declared `self` somewhere else. + + - It's OK for a diagnostic to guess at the developer's intent and provide + a hint after explaining the situation and the rule, but not as a + substitute for that. For example, "Add an `as String` cast to format + this integer as a string." is not sufficient as an error message, but + "Cannot add i32 to String. Add an `as String` cast to format this + integer as a string." could be acceptable. + +- TODO: Should diagnostics be atemporal and non-sequential ("multiple + declarations of X", "additional declaration here"), present tense but + sequential ("redeclaration of X", "previous declaration is here"), or + temporal ("redeclaration of X", "previous declaration was here")? We could + try to sidestep difference between the latter two by avoiding verbs with + tense ("previously declared here", "Y declared here", with no is/was). + +- TODO: Word choices: + + - For disallowed constructs, do we say they're not permitted / not allowed + / not valid / not legal / illegal / ill-formed / disallowed? Do we say + "X cannot be Y" or "X may not be Y" or "X must not be Y" or "X shall not + be Y"? + +- TODO: Is structuring diagnostics such that inputs can be parsed without + string parsing important? that is, when is passing strings in as part of the + message templating okay? + +- TODO: When do we put identifiers or expressions in diagnostics, versus + requiring notes pointing at relevant code? Is it only avoided for values, or + only allowed for types? + +- TODO: Lots more things to decide, give examples. diff --git a/toolchain/docs/idioms.md b/toolchain/docs/idioms.md new file mode 100644 index 0000000000000..a8d6fb9bbe4b9 --- /dev/null +++ b/toolchain/docs/idioms.md @@ -0,0 +1,424 @@ +# Idioms + + + + + +## Table of contents + +- [Overview](#overview) +- [C++ dialect](#c-dialect) +- [Abbreviations used in the code (AKA Carbon abbreviation decoder ring)](#abbreviations-used-in-the-code-aka-carbon-abbreviation-decoder-ring) +- [`.def` files](#def-files) + - [EnumBase types](#enumbase-types) +- [Index types](#index-types) +- [ValueStore](#valuestore) +- [Template metaprogramming](#template-metaprogramming) + - [Struct reflection](#struct-reflection) + - [Field detection](#field-detection) +- [Local lambdas to reduce duplicate code](#local-lambdas-to-reduce-duplicate-code) +- [Immediately invoked function expressions (IIFE)](#immediately-invoked-function-expressions-iife) +- [Declarations in conditions](#declarations-in-conditions) +- [CRTP or "Curiously recurring template pattern"](#crtp-or-curiously-recurring-template-pattern) +- [Multiple inheritance](#multiple-inheritance) +- [Defining constants usable in constexpr contexts](#defining-constants-usable-in-constexpr-contexts) + + + +## Overview + +The toolchain implementation uses some implementation techniques that may not be +commonly found in typical C++ code. + +## C++ dialect + +The toolchain implementation does not use some C++ features, following +[Google's C++ style guide](https://google.github.io/styleguide/cppguide.html): + +- [Exceptions](https://google.github.io/styleguide/cppguide.html#Exceptions) +- [Virtual base classes](https://google.github.io/styleguide/cppguide.html#Inheritance) +- [RTTI](https://google.github.io/styleguide/cppguide.html#Run-Time_Type_Information__RTTI_) + +## Abbreviations used in the code (AKA Carbon abbreviation decoder ring) + +Note that abbreviations are typically only used in code, not comments (except +when referring to an entity from the code). + +- **Addr**: "address" +- **Arg**: "argument" +- **Decl**: "declaration" +- **Expr**: "expression" + - **SubExpr**: "subexpression" +- **Float**: "floating point" +- **Init**: "initialization" +- **Inst**: "instruction" +- **Int**: "integer" +- **Loc**: "location" +- **Param**: "parameter" +- **Paren**: "parenthesis" +- **Ref**: "reference" + - **Deref**: "dereference" +- **Subst**: "substitute" + +Phrase abbreviations (where we have an abbreviation for a phrase, where we +wouldn't perform all of the abbreviations of those words individually): + +- **InitRepr**: "initializing representation" +- **ObjectRepr**: "object representation" +- **SemIR**: "semantics intermediate representation" +- **ValueRepr**: "value representation" + +## `.def` files + +The Carbon toolchain uses a technique related to +[X-macros](https://en.wikipedia.org/wiki/X_macro) to generate code that operates +over a collection of types, enumerators, or another similar list of names. This +works as follows: + +- A `.def` file is provided, that is intended to be repeatedly included by way + of `#include`. +- The user of the `.def` defines a macro, with a name and a form specified by + the `.def` file, for example + `#define CARBON_EACH_WIDGET(Name) Scope::Name,`. +- A `#include` of the `.def` file expands to `CARBON_EACH_WIDGET(Name1)`, + `CARBON_EACH_WIDGET(Name2)`, ... for each widget name, and then `#undef`s + the `CARBON_EACH_WIDGET` macro. + +For example: + +```cpp +enum Widgets { +#define CARBON_EACH_WIDGET(Name) Name, +#include "widgets.def" +} +``` + +... would expand to an enumeration definition with one enumerator per widget +name. + +### EnumBase types + +Most `.def` files will have a corresponding [EnumBase](/common/enum_base.h) +child class (if `widgets.def` has X-macros, `widgets.h` and `widgets.cpp` has +the `EnumBase` child class). These work similarly to an `enum class`, with the +addition of a `name()` function and `<<` stream operator support. Many also have +further utility functions for information related to the enum value. + +In code, these types and values can be used directly in a `switch`. They will +convert to an internal _actual_ `enum class` for the `switch`, and receive +corresponding compiler safety checks that all enum values are handled. + +## Index types + +Carbon makes frequent use of +[IndexBase and IdBase](/toolchain/base/index_base.h). The `IndexBase` and +`IdBase` types are small wrappers around `int32_t` to provide a measure of +type-checking when passing around indices to vector-like storage types. The only +difference is that `IndexBase` supports all comparison operators, whereas +`IdBase` only supports equality comparison. + +Variable naming will often have `_id` at the end to indicate that it corresponds +to an `IdBase`. This may include the full type, as in `operand_inst_id` being an +`InstId` for an operand. + +A block is an array of ids. These will be indicated with either a `_block` +suffix or pluralization (for example, `param_refs` pluralizing `refs`). + +The `ref` concept in a name means that there is an underlying instruction block, +but only a subset of instructions are present in the `refs` block. For example, +function parameters have a sequence, and also have a `refs` block with one entry +per parameter. The `refs` block allows parameters to be counted and accessed +directly, rather than through vector iteration. + +## ValueStore + +Many of Carbon's data types are stored in a +[ValueStore](/toolchain/base/value_store.h) or related type with similar +semantics (`sem_ir` has [several such classes](/toolchain/base/value_store.h)). +`ValueStore` links an indexing type to a value type with vector-like storage. +The indices typically use `IdBase`. + +`ValueStore`s APIs follow the shape of simple array access and mutation: + +- `Add` which takes a value and returns the index. +- `Set` which takes a value and index to modify. +- `Get` takes an index and returns a reference to the value (possibly a + constant reference). +- Other vector-like functionality, including `size` or `Reserve` + +ValueStores should be named after the type they contain. The index type used on +the value store should have a `using ValueType...` which indicates the stored +type. When taking a return of one of these functions, it's common to use `auto` +and rely on the name of the storage type to imply the returned type. + +Some name mirroring examples are: + +- `ints` is a `ValueStore`, which has an index type of `IntId` and a + value type of `llvm::APInt`. + +- `functions` is a `ValueStore`, which has an index type of + `SemIR::FunctionId` and a value type of `SemIR::` `Function`. + +- `strings` is a `ValueStore`, which has an index type of + `StringId`, but for copy-related reasons, uses `llvm::StringRef` for values. + +A fairly complete list of `ValueStore` uses should be available on +[checking's Context class](https://github.com/search?q=repository%3Acarbon-language%2Fcarbon-lang%20path%3Acheck%2Fcontext.h%20symbol%3Aidentifiers&type=code). + +## Template metaprogramming + +FIXME: show example patterns + +- TypedInstArgsInfo from toolchain/sem_ir/inst.h +- templated using +- std::declval +- decltype +- static_assert +- if constexpr +- template specialization, for example `Inst::FromRaw` (maybe also type + traits?) + +### Struct reflection + +The toolchain uses a primitive form of struct reflection to operate generically +over the fields in a typed `SemIR` instruction. This is implemented in +`common/struct_reflection.h`, and the interface to the functionality is +`StructReflection::AsTuple(your_struct)`, which converts the given struct into a +`std::tuple` containing the same fields in the same order. + +### Field detection + +The presence of specific fields in a struct with a specified type is detected +using the following idiom: + +```cpp +template +constexpr bool HasField = false; +template +constexpr bool HasField = true; +``` + +This is intended to check the same property as the following concept, which we +can't use because we currently need to compile in C++17 mode: + +```cpp +template concept HasField = requires (T x) { + { x.field } -> std::same_as; +}; +``` + +To detect a field with a specific name with a type derived from a specified base +type, use this idiom: + +```cpp +// HasField is true if T has a `U field` field, +// where `U` extends `BaseClass`. +template +inline constexpr bool HasField = false; +template +inline constexpr bool HasField< + T, bool(std::is_base_of_v)> = true; +``` + +The equivalent concept is: + +```cpp +template concept HasField = requires (T x) { + { x.field } -> std::derived_from; +}; +``` + +## Local lambdas to reduce duplicate code + +Sometimes code that would be repeated in a function is factored into a local +variable containing a +[lambda](https://en.cppreference.com/w/cpp/language/lambda): + +```cpp +auto common_code = [&](AType param1, AnotherType param2) { + // code that would otherwise be repeated + ... +} +if (something) { + common_code(...); +} +if (something_else) { + common_code(...) +} +``` + +Compared to defining a new function, this has the advantage of being able to be +declared in context and access the local variables of the enclosing function. + +## Immediately invoked function expressions (IIFE) + +Instead of creating a separate function with its own name that will be called +once to produce the initial value for a variable, the function can be declared +inline and then immediately called. + +This can be used for complex initialization, as in: + +```cpp +// variable declaration +static const llvm::ArrayRef entropy_bytes = +// initializer starts with a lambda + []() -> llvm::ArrayRef { + static llvm::SmallVector bytes; + + // a bunch of code + + // return the value to initialize the variable with + return bytes; + +// finish defining the lambda, and then immediately invoke it +}(); +``` + +It can also be used inside a `CARBON_DCHECK` to avoid computation that is only +needed in debug builds: + +```cpp +CARBON_DCHECK([&] { + // a bunch of code + + // condition that will be tested by CARBON_DCHECK + return complicated && multiple_parts; + +// finish defining the lambda, and then immediately invoke it +}()) << "Complicated things went wrong"; +``` + +See a description of this technique on +[wikipedia](https://en.wikipedia.org/wiki/Immediately_invoked_function_expression). + +## Declarations in conditions + +The condition part of an `if` statement may contain a declaration with an +initializer followed by a semicolon (`;`) and then the proper boolean condition +expression, as in: + +```cpp +if (auto verify = tree.Verify(); !verify.ok()) { +``` + +The condition can be replaced by a declaration entirely, as in: + +```cpp +if (auto equals = context.ConsumeIf(Lex::TokenKind::Equal)) { +// Equivalent to: +if (auto equals = context.ConsumeIf(Lex::TokenKind::Equal); equals) { +``` + +or + +```cpp +if (auto literal = bound_inst.TryAs()) { +// Equivalent to: +if (auto literal = bound_inst.TryAs(); literal) { +``` + +This is a common way of handling a function that returns an optional value. + +See +[https://en.cppreference.com/w/cpp/language/if](https://en.cppreference.com/w/cpp/language/if) + +## CRTP or "Curiously recurring template pattern" + +[Curiously Recurring Template Pattern - cppreference.com](https://en.cppreference.com/w/cpp/language/crtp) + +[Curiously recurring template pattern - Wikipedia](https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern) + +[Google search](https://www.google.com/search?q=crtp+c%2B%2B) + +Examples: + +- `template ` in [enum_base.h](/common/enum_base.h) +- `template ` in [ostream.h](/common/ostream.h) + +## Multiple inheritance + +We use multiple inheritance to support uses of +[CRTP](#crtp-or-curiously-recurring-template-pattern). + +Example: + +```cpp +struct NameScopeId : public IndexBase, public Printable { +``` + +## Defining constants usable in constexpr contexts + +To declare a constant usable at compile time in `constexpr` contexts as a static +class member, we use this pattern: + +Declaration: + +```cpp +class Foo { + // ... + static const std::array MyTable; + static constexpr auto ComputeMyTable() + -> std::array { ... } +}; +``` + +Definition: + +```cpp +constexpr std::array + Foo::MyTable = Foo::ComputeMyTable(); +``` + +Note the `const` on the declaration does not match the `constexpr` on +definition, and that the definition is outside of the class body. This allows +the initializer to depend on the definition of the class. + +Further note that this only works with static members of classes, not static +variables in functions. + +Due to [a Clang bug](https://github.com/llvm/llvm-project/issues/85461), this +technique does not work in a class template. The following pattern can be used +instead: + +```cpp +template +class Foo { + // ... + template + static constexpr auto MyValueImpl = Self(); + static constexpr const Foo& MyValue = MyValueImpl<>; + // ... +}; +``` + +The parameters of the variable template can be chosen to allow reuse of the same +variable template for multiple static data members. + +Examples: + +- `NodeStack::IdKindTable` in + [check/node_stack.h](/toolchain/check/node_stack.h) +- `BuiltinKind::ValidCount` in + [sem_ir/builtin_inst_kind.h](/toolchain/sem_ir/builtin_inst_kind.h) + +A global constant may use a single definition without a separate declaration: + +```cpp +static constexpr std::array IsIdStartByteTable = [] { + std::array table = {}; + // ... + return table; +}(); +``` + +Note this example is using an +[immediately invoked function expression](#immediately-invoked-function-expressions-iife) +to compute the initial value, which is common. + +Examples: + +- [lex/lex.cpp](/toolchain/lex/lex.cpp) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md new file mode 100644 index 0000000000000..beffaea615c53 --- /dev/null +++ b/toolchain/docs/parse.md @@ -0,0 +1,912 @@ +# Parse + + + + + +## Table of contents + +- [Overview](#overview) +- [Parse stack](#parse-stack) +- [Postorder tree](#postorder-tree) +- [Bracketing inside the tree](#bracketing-inside-the-tree) +- [Visual example](#visual-example) +- [Handling invalid parses](#handling-invalid-parses) +- [How is this accomplished?](#how-is-this-accomplished) +- [Introducer](#introducer) +- [Optional modifiers before an introducer](#optional-modifiers-before-an-introducer) +- [Something required in context](#something-required-in-context) +- [Optional clauses](#optional-clauses) + - [Case 1: introducer to optional clause is used as parent node](#case-1-introducer-to-optional-clause-is-used-as-parent-node) + - [Case 2: parent node is required token after optional clause, with different parent node kinds for different options](#case-2-parent-node-is-required-token-after-optional-clause-with-different-parent-node-kinds-for-different-options) + - [Case 3: optional sibling](#case-3-optional-sibling) +- [Operators](#operators) + + + +## Overview + +Parsing uses tokens to produce a parse tree that faithfully represents the tree +structure of the source program, interpreted according to the Carbon grammar. No +semantics are associated with the tree structure at this level, and no name +lookup is performed. + +The parse tree's structure corresponds to the grammar of the Carbon language. On +valid input, there will be a 1:1 correspondence between parse tree nodes and +tokens. + +A parse tree is considered _structurally valid_ if all nodes have the number of +children that their node kind requires. On invalid input, nodes may be added +that don't correspond to a token to maintain a structurally valid parse tree. +When a parse tree node is marked as having an error, it will still be +structurally valid, but its children may not match a valid grammar. Code trying +to handle children of erroneous nodes must be prepared to handle atypical +structures, but it may still be helpful for tools such as syntax highlighters or +refactoring tools. + +In general, we favor doing the checking for whether something is allowed _in a +particular context_ in [the check stage](check.md) instead of the parse stage, +unless the context is very local. This is for a few reasons: + +- We anticipate that the parse stage will be used to operate on invalid code + while still preserving as much of the intent of the author as possible, for + example in an IDE or a code formatter. +- To keep as much code out of the parse stage as possible, so it is simple and + fast. +- We are building all the infrastructure to keep track of context in the check + stage. + +These reasons explain what local context is okay: where we already have the +contextual information at hand so there is no performance cost, and we can +output a parse tree that still captures faithfully what the user wrote. +Examples: + +- All declaration modifiers are allowed in any order on any declaration in the + parse stage. Diagnosing duplicated modifiers, modifiers that conflict with + other modifiers, or modifiers that can't be used on a particular declaration + is postponed until the check stage. +- Rejecting a keyword after `fn` where a name is expected is done at the parse + stage. + +## Parse stack + +The core parser loop is `Parse::Tree::Parse`. In the loop, it pops the next +state off the stack, and dispatches to the appropriate `Handle` function. + +A typical handler function pops the state first, leaving the stack ready for the +next state. It may add nodes to the parse tree, based on the current code. If it +needs to trigger other states, it will push them onto the stack; because it's a +stack, the _next_ state is always pushed _last_. + +Operator expressions store information about current operator precedence in the +stack as well. While this isn't necessary for most parser states, and could be +stored separately, it's currently together because it has no impact on the size +of a stack entry and is thus more efficient to store in one place. + +## Postorder tree + +The parse tree's storage layout is in postorder. For example, given the code: + +```cpp +fn foo() -> f64 { + return 42; +} +``` + +The node order is (with indentation to indicate nesting): + +```yaml +[ + { kind: 'FileStart', text: '' }, + { kind: 'FunctionIntroducer', text: 'fn' }, + { kind: 'Name', text: 'foo' }, + { kind: 'ParamListStart', text: '(' }, + { kind: 'ParamList', text: ')', subtree_size: 2 }, + { kind: 'Literal', text: 'f64' }, + { kind: 'ReturnType', text: '->', subtree_size: 2 }, + { kind: 'FunctionDefinitionStart', text: '{', subtree_size: 7 }, + { kind: 'ReturnStatementStart', text: 'return' }, + { kind: 'Literal', text: '42' }, + { kind: 'ReturnStatement', text: ';', subtree_size: 3 }, + { kind: 'FunctionDefinition', text: '}', subtree_size: 11 }, + { kind: 'FileEnd', text: '' }, +] +``` + +In this example, `FileStart`, `FunctionDefinition`, and `FileEnd` are "root" +nodes for the tree. Function components are children of `FunctionDefinition`. + +It's produced in this way because it's an efficient layout to produce with +vectorized storage, requiring little context to be maintained during parsing. +Because it's stored in postorder, it's also most efficient to process the parsed +output in postorder; this affects checking. + +The parse tree is printed in postorder by default because it matches how the +parse tree is expected to be processed within the toolchain , and so can make it +easier to reason about. However, the `--preorder` flag may be used in contexts +where a preorder representation would be easier to handle. + +## Bracketing inside the tree + +The parse tree is designed to be walked in postorder by checking, allowing +checking to be more efficient. To support this, checking sometimes requires +context on the meaning of a node when it is encountered. + +Each `ParseNodeKind` has either a bracketing node, or a specific child count. +This helps document and enforce the expected tree structure. + +When a bracketing node is indicated, it is the opening bracket: it will always +be the first child of the parent, and that will be the only time it occurs in +the parent's children (it may still occur in children of children). When +checking encounters the opening bracket, this means it can make contextual +decisions for the later children of the node. + +Nodes can also have a specific child count, for example, infix operators always +have two children: the lhs and rhs expressions. Many nodes have a child count of +0; this just means they're leaf nodes, and will never have children. + +Because the tree structure is always valid, these are treated as contracts. Some +nodes exist only to be used to construct valid tree structures for invalid +input, such as `StructFieldUnknown`. + +Although each subtree's size is also tracked as part of the node, we're +currently trying to avoid relying on it and may eliminate it if it turns out to +be unnecessary and a meaningful cost for the compiler. + +## Visual example + +To try to explain the transition from code to Parse Tree, consider the +statement: + +```carbon +var x: i32 = y + 1; +``` + +Lexing creates distinct tokens for each syntactic element, which will form the +basis of the parse tree: + + + +```mermaid +flowchart BT + subgraph tokens["Tokens"] + token1[var] + token2[x] + token3[:] + token4[i32] + token5[=] + token6[y] + token7[+] + token8[1] + token9[;] + end +``` + +First the `var` keyword is used as a "bracketing" node (VariableIntroducer). +When this is seen in a postorder traversal, it tells us to expect the basics of +a variable declaration structure. + +```mermaid +flowchart BT + subgraph tokens["Remaining tokens"] + token1[var]:::used + token2[x] + token3[:] + token4[i32] + token5[=] + token6[y] + token7[+] + token8[1] + token9[;] + end + + classDef used visibility:hidden +``` + +```mermaid +flowchart BT + root:::hidden + subgraph nodes["Parsed nodes"] + direction BT + node1[var]:::moved + end + + classDef hidden visibility:hidden,display:none + classDef moved fill:#0F0,color:#000 + + node1 ~~~ root +``` + +Next, we can consider the pattern binding. Here, `x` is the identifier and `i32` +is the type expression. The `:` provides a parent node that must always contain +two children, the name and type expression. Because it always has two direct +children, it doesn't need to be bracketed. + +```mermaid +flowchart BT + subgraph tokens["Remaining tokens"] + token1[var]:::used + token2[x]:::used + token3[:]:::used + token4[i32]:::used + token5[=] + token6[y] + token7[+] + token8[1] + token9[;] + end + + classDef used visibility:hidden +``` + +```mermaid +flowchart BT + root:::hidden + subgraph nodes["Parsed nodes"] + direction BT + node1[var] + node2[x]:::moved + node3[:]:::moved + node4[i32]:::moved + end + + classDef hidden visibility:hidden,display:none + classDef moved fill:#0F0,color:#000 + + node1 ~~~~ root + node3 --- node2 & node4 + node3 ~~~ root +``` + +We use the `=` as a separator (instead of a node with children like `:`) to help +indicate the transition from binding to assignment expression, which is +important for expression parsing during checking. + +```mermaid +flowchart BT + subgraph tokens["Remaining tokens"] + token1[var]:::used + token2[x]:::used + token3[:]:::used + token4[i32]:::used + token5[=]:::used + token6[y] + token7[+] + token8[1] + token9[;] + end + + classDef used visibility:hidden +``` + +```mermaid +flowchart BT + root:::hidden + subgraph nodes["Parsed nodes"] + direction BT + node1[var] + node2[x] + node3[:] + node4[i32] + node5[=]:::moved + end + + classDef hidden visibility:hidden,display:none + classDef moved fill:#0F0,color:#000 + + node1 ~~~~ root + node3 --- node2 & node4 + node3 ~~~ root + node5 ~~~~ root +``` + +The expression is a subtree with `+` as the parent, and the two operands as +child nodes. + +```mermaid +flowchart BT + subgraph tokens["Remaining tokens"] + token1[var]:::used + token2[x]:::used + token3[:]:::used + token4[i32]:::used + token5[=]:::used + token6[y]:::used + token7[+]:::used + token8[1]:::used + token9[;] + end + + classDef used visibility:hidden +``` + +```mermaid +flowchart BT + root:::hidden + subgraph nodes["Parsed nodes"] + direction BT + node1[var] + node2[x] + node3[:] + node4[i32] + node5[=] + node6[y]:::moved + node7[+]:::moved + node8[1]:::moved + end + + classDef hidden visibility:hidden,display:none + classDef moved fill:#0F0,color:#000 + + node1 ~~~~ root + node3 --- node2 & node4 + node3 ~~~ root + node5 ~~~~ root + node7 --- node6 & node8 + node7 ~~~ root +``` + +Finally, the `;` is used as the "root" of the variable declaration. It's +explicitly tracked as the `;` for a variable declaration so that it's +unambiguously bracketed by `var`. + +```mermaid +flowchart BT + root:::hidden + subgraph nodes["Parsed nodes"] + direction BT + node1[var] + node2[x] + node3[:] + node4[i32] + node5[=] + node6[y] + node7[+] + node8[1] + node9[;]:::moved + end + + classDef hidden visibility:hidden,display:none + classDef moved fill:#0F0,color:#000 + + node1 ~~~~ root + node3 --- node2 & node4 + node3 ~~~ root + node5 ~~~~ root + node7 --- node6 & node8 + node7 ~~~ root + node9 --- node1 & node3 & node5 & node7 + node9 ~~~ root +``` + +Thus we have the parse tree: + +```mermaid +flowchart BT + root:::hidden + subgraph nodes["Parsed nodes"] + direction BT + node1[var] + node2[x] + node3[:] + node4[i32] + node5[=] + node6[y] + node7[+] + node8[1] + node9[;] + end + + classDef hidden visibility:hidden,display:none + + node1 ~~~~ root + node3 --- node2 & node4 + node3 ~~~ root + node5 ~~~~ root + node7 --- node6 & node8 + node7 ~~~ root + node9 --- node1 & node3 & node5 & node7 + node9 ~~~ root +``` + +In storage, this tree will be flat and in postorder. Because the order hasn't +changed much from the original code, we can do the reordering for postorder with +a minimal number of nodes being delayed for later output: it will be linear with +respect to the depth of the parse tree. + +```mermaid +flowchart BT + subgraph tokens["Tokens"] + token1[var] + token2[x] + token3[:] + token4[i32] + token5[=] + token6[y] + token7[+] + token8[1] + token9[;] + end +``` + +```mermaid +flowchart BT + root:::hidden + subgraph nodes["Parsed nodes"] + direction BT + node1[var] + node2[x] + node3[:] + node4[i32] + node5[=] + node6[y] + node7[+] + node8[1] + node9[;] + end + + classDef hidden visibility:hidden,display:none + + node1 ~~~~ root + node3 --- node2 & node4 + node3 ~~~ root + node5 ~~~~ root + node7 --- node6 & node8 + node7 ~~~ root + node9 --- node1 & node3 & node5 & node7 + node9 ~~~ root +``` + +```mermaid +flowchart BT + subgraph storage["Storage"] + storage1[var] + storage2[x] + storage4[i32]:::moved + storage3[:]:::moved + storage5[=] + storage6[y] + storage8[1]:::moved + storage7[+]:::moved + storage9[;] + end + + classDef moved fill:#0F0,color:#000 +``` + +The structural concepts of bracketing nodes (`var` and `;`) and parent nodes +with a known child count (`:` and `+` with 2 children, but also `=` with 0 +children) will allow checking to reconstruct the tree as it encounters nodes +during the postorder. + +There are other structures that could have been used here, such as `=` being +parent of the `var` and pattern nodes, and `;` being the parent of the `=` and +assignment expression nodes. In that example alternative, the storage order +would be the same; it would only change the tree representation. The current +structure is influenced by choices in checking. + +## Handling invalid parses + +On an invalid parse, the output tree should still try to mirror the intended +tree structure when possible. There's a balance here, and it's not expected to +try too hard to make things correct, but outputting nodes is preferred. There +are `InvalidParse` nodes which may be used to provide a node when the planned +node kind is too difficult to get correct child counts (bracketed subtrees may +not need an `InvalidParse` node). + +When marking a child node with `has_error=true`, parent nodes may also be marked +with `has_error=true`, but try to be conservative about this. As a rule of +thumb, if checking could continue on a parent node without needing the child +node to be fully checked (possibly with incomplete information), then the parent +node should not be marked as `has_error=true`. The goal remains providing +something similar to a well-formed parse tree. + +In general, a parent node must have the immediate children described in +[parse/typed_nodes.h](/toolchain/parse/typed_nodes.h), unless it is marked +`has_error=true`. If this is violated for a particular parse tree, an error will +be raised in `Tree::Verify`. Note that an `InvalidParse` node is allowed as a +declaration or expression, and an `InvalidParseSubtree` is allowed as a +declaration. These invalid nodes can be added to more node categories as needed. + +Child states may indicate an error to their parent using `ReturnErrorOnState`. +This is particularly intended for when a child state emits a diagnostic, to +prevent the parent state from emitting redundant diagnostics; for example, an +invalid expression might have more invalid tokens following it, and the parent +might skip those without emitting diagnostics. + +## How is this accomplished? + +The specific approach to producing the desired tree depends on the kind of +grammar rule being implemented, as well as the desired output tree structure. + +## Introducer + +**Example:** `if (c) { ... }` + +Here `if` is the introducer. Many other possible introducers could occur in that +position, such as `while` or `var`, and we want to dispatch based on which token +is present. See +[parse/handle_statement.cpp](/toolchain/parse/handle_statement.cpp). + +The first step is to identify the introducer token, typically using a `switch` +or `if` on the `Lex::TokenKind` at the current position: + +```cpp +switch (context.PositionKind()) { + case Lex::TokenKind::___: { + ... + break; + } + ... +} +``` + +There should be a `default:` (or `else`) case so every kind of token is handled. +This may be an error, in which case: + +- A [diagnostic](diagnostics.md) should be emitted. + +- An invalid parse node should be added, using something like: + + ```cpp + context.AddLeafNode(NodeKind::InvalidParse, context.Consume(), + /*has_error=*/true); + ``` + +- At least one node should be consumed, particularly if it will continue with + this state at this position, to avoid an infinite loop. + +The default case may also be delegated to another state. For example, in the +state where a statement is expected, if no keyword introducer is recognized, it +switches to the expression-statement state. + +Depending on the introducer, different actions can be taken. The most common +case is to: + +- Call `context.PushState(State::___);` to mark the beginning of the statement + or declaration and indicate the state that will handle the tokens after the + introducer. + +- Call `context.AddLeafNode(NodeKind::___, context.Consume());` to output a + bracketing node for this introducer. + +The next state can then add sibling nodes until it gets to the end of the +declaration or statement. The last token, often a semicolon `;`, is used as a +parent node to match the bracketing node of the introducer. + +If the introducer token won't be used as a bracketing node, it can be +temporarily skipped after `context.PushState` by calling +`context.ConsumeAndDiscard()` instead of `context.AddLeafNode`. It must be added +to the output tree as a node by some later state, unless an error occurs. For +example, a `for` statement uses the `for` token as the root of the tree -- it +doesn't need a bracketing node since it has a fixed child count. Note that the +token was saved when the state was pushed, and can be retrieved when adding a +node as in this example: + +```cpp +auto state = context.PopState(); +context.AddNode(NodeKind::ForStatement, state.token, state.subtree_start, + state.has_error); +``` + +If this state is for an element of a scope like the statements in a code block, +most introducer tokens indicate that the current state should be repeated, to +handle the next statement, but some other token, like a close curly brace (`}`) +means that the state should be exited. + +## Optional modifiers before an introducer + +**Example:** `virtual fn Foo();` + +Here `fn` is the introducer, and `virtual` is an optional modifier that appears +before. See +[parse/handle_decl_scope_loop.cpp](/toolchain/parse/handle_decl_scope_loop.cpp). + +Use this pattern when the goal is to produce a subtree that starts with the +introducer as a bracketing node, as in the previous case, followed by nodes for +any modifiers. Note that bracketing is needed here, since the optional modifier +nodes mean that there is not a fixed child count for the parent node. This means +shuffling the introducer node before an unknown number of modifier nodes. This +is accomplished by emitting a placeholder node for the introducer, processing +all the modifiers until reaching the introducer, filling in the placeholder with +the information about the introducer, and then finishing the rest of the +declaration or statement. + +- **Step 1**: Save the current value of `context.tree().size()`. This could be + accomplished by calling `context.PushState()`, which saves that value in the + `subtree_start` field of `Context::StateStackEntry`; or by constructing a + `Context::StateStackEntry` value directly, as is done in + [parse/handle_decl_scope_loop.cpp](/toolchain/parse/handle_decl_scope_loop.cpp). + This marks the position of the placeholder node we are going to replace, as + well as the beginning of the subtree we are eventually going to emit for + this declaration or statement. + +- **Step 2**: Emit the placeholder node using + `context.AddLeafNode(NodeKind::Placeholder, *context.position());`. The + `NodeKind` and `Lex::TokenIndex` values will be overwritten later. + +- **Step 3**: Process tokens until we hit the introducer. All of the nodes we + emit at this point will appear as siblings after the introducer token in the + output tree. + +- **Step 4 - success**: If an introducer token is found, replace the + placeholder node using something like: + + ```cpp + context.ReplacePlaceholderNode(state.subtree_start, introducer_kind, + context.Consume()); + ``` + + - `state.subtree_start` is the value of `context.tree().size()` saved in + step 1, which marks the position of the placeholder node in the output + parse tree. + + - `introducer_kind` is the `NodeKind` for the introducer of this + declaration or statement, a leaf node that will act as a bracketing node + at the beginning of the subtree for this declaration or statement + +- **Step 4 - error**: If we run into something other than a modifier or + introducer before finding an introducer, we need to do error handling: + + ```cpp + context.ReplacePlaceholderNode(subtree_start, NodeKind::InvalidParseStart, + *context.position(), /*has_error=*/true); + ``` + + - Emit a [diagnostic](diagnostics.md). + + - Replace the placeholder node (similar to step 4) with an + `InvalidParseStart` node. It will be associated with the unexpected + token that triggered this error. + + - Consume input token up to the likely end of the end of the current + statement or declaration. For example, we might consume up to a `;` or a + token at a lesser indent level using `context.SkipPastLikelyEnd(...)`. + It is important that we consume at least one token in the error case, + otherwise we could have an infinite loop of generating the same error on + the same token. + + - Emit a `InvalidParseSubtree` node. This will be the parent of any + emitted modifier nodes, and will be bracketed by the `InvalidParseStart` + node emitted above. It should be associated with the last token + consumed. + + ```cpp + // Set `iter` to the last token consumed, one before the current position. + auto iter = context.position(); + --iter; + context.AddNode(NodeKind::InvalidParseSubtree, *iter, subtree_start, + /*has_error=*/true); + ``` + +- **Step 5**: (If success at step 4) Push whatever states are to be used to + parse the rest of the declaration. The first state pushed (the last state to + be processed) will handle the end of this declaration. That pushed state + should have a `subtree_start` field set to the value of + `context.tree().size()` saved in step 1. + +- **Step 6**: When handling the state for the end of the declaration, emit the + root node of subtree: + + ```cpp + state = context.PopState(); + context.AddNode(NodeKind::___, context.Consume(), + state.subtree_start, state.has_error); + ``` + + - This `state.subtree_start` will mark everything since the bracketing + introducer node as the children of this node. + +## Something required in context + +FIXME + +Example: name after introducer +[parse/handle_decl_name_and_params.cpp](/toolchain/parse/handle_decl_name_and_params.cpp) + +Example: `[]` after `impl forall` +[parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp) + +## Optional clauses + +### Case 1: introducer to optional clause is used as parent node + +**Example:** The optional `-> ` in a function signature +uses this pattern, so `fn foo() -> u32;` is transformed to: + +```yaml + {kind: 'FunctionIntroducer', text: 'fn'}, + {kind: 'IdentifierName', text: 'foo'}, + {kind: 'TuplePatternStart', text: '('}, + {kind: 'TuplePattern', text: ')', subtree_size: 2}, + {kind: 'UnsignedIntTypeLiteral', text: 'u32'}, + {kind: 'ReturnType', text: '->', subtree_size: 2}, +{kind: 'FunctionDecl', text: ';', subtree_size: 7}, +``` + +Note how the `->` token becomes a `ReturnType` node in the output tree, and is +moved after the `u32` type expression that becomes its child. Compare with the +parse tree output for `fn foo();` which has no `ReturnType` node: + +```yaml + {kind: 'FunctionIntroducer', text: 'fn'}, + {kind: 'IdentifierName', text: 'foo'}, + {kind: 'TuplePatternStart', text: '('}, + {kind: 'TuplePattern', text: ')', subtree_size: 2}, +{kind: 'FunctionDecl', text: ';', subtree_size: 5}, +``` + +Here is the code from +[parse/handle_function.cpp](/toolchain/parse/handle_function.cpp) that does +this: + +```cpp +auto HandleFunctionAfterParams(Context& context) -> void { + ... + // If there is a return type, parse the expression before adding the return + // type node. + if (context.PositionIs(Lex::TokenKind::MinusGreater)) { + context.PushState(State::FunctionReturnTypeFinish); + context.ConsumeAndDiscard(); + context.PushStateForExpr(PrecedenceGroup::ForType()); + } +} + +auto HandleFunctionReturnTypeFinish(Context& context) -> void { + auto state = context.PopState(); + + context.AddNode(NodeKind::ReturnType, state.token, state.subtree_start, + state.has_error); +} +``` + +The `->` token is saved by `context.PushState(`...`)`, so it is available as +`state.token` when calling +`context.AddNode(NodeKind::ReturnType, state.token,`...`)` later in +`HandleFunctionReturnTypeFinish`. + +Also see how the optional initializer is handled on `var`, treating the `=` as +its introducer in `HandleVarAfterPattern` and `HandleVarInitializer` in +[parse/handle_var.cpp](/toolchain/parse/handle_var.cpp). + +### Case 2: parent node is required token after optional clause, with different parent node kinds for different options + +**Example:** The optional type expression before `as` in `impl as` is +represented by producing two different output parse nodes for `as`. It outputs a +`DefaultSelfImplAs` node with no children when the type expression is absent, +and otherwise a `TypeImplAs` parse node with the type expression as its child. + +So `impl bool as Interface;` is transformed to: + +```yaml + {kind: 'ImplIntroducer', text: 'impl'}, + {kind: 'BoolTypeLiteral', text: 'bool'}, + {kind: 'TypeImplAs', text: 'as', subtree_size: 2}, + {kind: 'IdentifierNameExpr', text: 'Interface'}, +{kind: 'ImplDecl', text: ';', subtree_size: 5}, +``` + +while `impl as Interface;` is transformed to: + +```yaml + {kind: 'ImplIntroducer', text: 'impl'}, + {kind: 'DefaultSelfImplAs', text: 'as'}, + {kind: 'IdentifierNameExpr', text: 'Interface'}, +{kind: 'ImplDecl', text: ';', subtree_size: 4}, +``` + +This is handled by the `ExpectAsOrTypeExpression` code from +[parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp): + +```cpp +if (context.PositionIs(Lex::TokenKind::As)) { + // as ... + context.AddLeafNode(NodeKind::DefaultSelfImplAs, context.Consume()); + context.PushState(State::Expr); +} else { + // as ... + context.PushState(State::ImplBeforeAs); + context.PushStateForExpr(PrecedenceGroup::ForImplAs()); +} +``` + +and then `HandleImplBeforeAs` creates the parent node in the second case: + +```cpp +auto state = context.PopState(); +if (auto as = context.ConsumeIf(Lex::TokenKind::As)) { + context.AddNode(NodeKind::TypeImplAs, *as, state.subtree_start, + state.has_error); + context.PushState(State::Expr); +} else { + if (!state.has_error) { + CARBON_DIAGNOSTIC(ImplExpectedAs, Error, + "Expected `as` in `impl` declaration."); + context.emitter().Emit(*context.position(), ImplExpectedAs); + } + context.ReturnErrorOnState(); +} +``` + +Note (1) that the `state.subtree_start` value comes from the +`context.PushState(State::ImplBeforeAs);` before parsing the type expression, +and that is how that type expression ends up as the child of the created +`TypeImplAs` node. Unlike +[the previous case 1](#case-1-introducer-to-optional-clause-is-used-as-parent-node), +though, the parent node uses the token after the optional expression, rather +than an introducer token for the optional clause. + +Note (2) how `HandleImplBeforeAs` handles three cases of errors: + +- `as` present but an error in the child type expression -> error on the + output `TypeImplAs` node, but not propagated to the parent. +- Error from no `as` present but the type expression was okay -> create a new + error. +- There was error from the child type expression and no `as` present -> no new + diagnostic, we suppress errors once one is emitted until we can recover. + +If there is no `as` token, we don't output either a `TypeImplAs` or a +`DefaultSelfImplAs` node, as required by the parent node, so in those cases we +mark the parent as having an error. + +### Case 3: optional sibling + +> TODO: This was changed by +> [#3678](https://github.com/carbon-language/carbon-lang/pull/3678) and needs to +> be updated. + +**Example:** The optional type expression before `as` in `impl as` is output as +an optional sibling subtree between the `ImplIntroducer` node for the `impl` +introducer and the `ImplAs` node for the required `as` keyword. + +`impl bool as Interface;` is transformed to: + +```cpp + {kind: 'ImplIntroducer', text: 'impl'}, + {kind: 'BoolTypeLiteral', text: 'bool'}, + {kind: 'ImplAs', text: 'as'}, + {kind: 'IdentifierNameExpr', text: 'Interface'}, +{kind: 'ImplDecl', text: ';', subtree_size: 5}, +``` + +while `impl as Interface;` is transformed to: + +```cpp + {kind: 'ImplIntroducer', text: 'impl'}, + {kind: 'ImplAs', text: 'as'}, + {kind: 'IdentifierNameExpr', text: 'Interface'}, +{kind: 'ImplDecl', text: ';', subtree_size: 4}, +``` + +This is handled by the `ExpectAsOrTypeExpression` code from +[parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp): + +```cpp +if (context.PositionIs(Lex::TokenKind::As)) { + // as ... + context.AddLeafNode(NodeKind::ImplAs, context.Consume()); + context.PushState(State::Expr); +} else { + // as ... + context.PushState(State::ImplBeforeAs); + context.PushStateForExpr(PrecedenceGroup::ForImplAs()); +} +``` + +and then `HandleImplBeforeAs` follows +[the "something required in context" pattern](#something-required-in-context) to +deal with the `as` that follows when the type expression is present. + +## Operators + +FIXME + +An independent description of our approach: +["Better operator precedence" on scattered-thoughts.net](https://www.scattered-thoughts.net/writing/better-operator-precedence/) diff --git a/toolchain/docs/parse.svg b/toolchain/docs/parse.svg new file mode 100644 index 0000000000000..6576b352f8f39 --- /dev/null +++ b/toolchain/docs/parse.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/website/prebuild.py b/website/prebuild.py index c0a3829ac145c..a7d7457863e19 100755 --- a/website/prebuild.py +++ b/website/prebuild.py @@ -189,7 +189,9 @@ def next(nav_order: list[int]) -> int: # Reset the order for the implementation children. nav_order[0] = 0 - label_subdir("toolchain", next(nav_order), parent_title="Implementation") + label_subdir( + "toolchain/docs", next(nav_order), parent_title="Implementation" + ) label_subdir("explorer", next(nav_order), parent_title="Implementation") label_subdir("testing", next(nav_order), parent_title="Implementation") From 85c0def65c6014d406741802398fe88044a36253 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Fri, 23 Aug 2024 09:46:42 -0700 Subject: [PATCH 02/16] Fixes --- toolchain/docs/diagnostics.md | 3 ++- toolchain/docs/parse.md | 31 ++++++++++++++++++------------- 2 files changed, 20 insertions(+), 14 deletions(-) diff --git a/toolchain/docs/diagnostics.md b/toolchain/docs/diagnostics.md index e83bb70c23db1..9bd7f3a188e15 100644 --- a/toolchain/docs/diagnostics.md +++ b/toolchain/docs/diagnostics.md @@ -175,7 +175,8 @@ written in the following style: - Start diagnostics with a capital letter or quoted code, and end them with a period. -- Quoted code should be enclosed in backticks, for example: `"{0} is bad."` +- Quoted code should be enclosed in backticks, for example: + ``"`{0}` is bad."`` - Phrase diagnostics as bullet points rather than full sentences. Leave out articles unless they're necessary for clarity. diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index beffaea615c53..962410132c0fd 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -99,24 +99,29 @@ fn foo() -> f64 { The node order is (with indentation to indicate nesting): + + + ```yaml [ - { kind: 'FileStart', text: '' }, - { kind: 'FunctionIntroducer', text: 'fn' }, - { kind: 'Name', text: 'foo' }, - { kind: 'ParamListStart', text: '(' }, - { kind: 'ParamList', text: ')', subtree_size: 2 }, - { kind: 'Literal', text: 'f64' }, - { kind: 'ReturnType', text: '->', subtree_size: 2 }, - { kind: 'FunctionDefinitionStart', text: '{', subtree_size: 7 }, - { kind: 'ReturnStatementStart', text: 'return' }, - { kind: 'Literal', text: '42' }, - { kind: 'ReturnStatement', text: ';', subtree_size: 3 }, - { kind: 'FunctionDefinition', text: '}', subtree_size: 11 }, - { kind: 'FileEnd', text: '' }, + {kind: 'FileStart', text: ''}, + {kind: 'FunctionIntroducer', text: 'fn'}, + {kind: 'Name', text: 'foo'}, + {kind: 'ParamListStart', text: '('}, + {kind: 'ParamList', text: ')', subtree_size: 2}, + {kind: 'Literal', text: 'f64'}, + {kind: 'ReturnType', text: '->', subtree_size: 2}, + {kind: 'FunctionDefinitionStart', text: '{', subtree_size: 7}, + {kind: 'ReturnStatementStart', text: 'return'}, + {kind: 'Literal', text: '42'}, + {kind: 'ReturnStatement', text: ';', subtree_size: 3}, + {kind: 'FunctionDefinition', text: '}', subtree_size: 11}, + {kind: 'FileEnd', text: ''}, ] ``` + + In this example, `FileStart`, `FunctionDefinition`, and `FileEnd` are "root" nodes for the tree. Function components are children of `FunctionDefinition`. From 6be91cf16b5562fa3596a35f1b5bcb306ee07fbe Mon Sep 17 00:00:00 2001 From: jonmeow Date: Fri, 23 Aug 2024 09:52:55 -0700 Subject: [PATCH 03/16] Backtick a few diagnostic examples --- toolchain/docs/diagnostics.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/toolchain/docs/diagnostics.md b/toolchain/docs/diagnostics.md index 9bd7f3a188e15..2b4ece3eeb682 100644 --- a/toolchain/docs/diagnostics.md +++ b/toolchain/docs/diagnostics.md @@ -185,18 +185,20 @@ written in the following style: language rule that was violated, although either can be omitted if it's clear from the other. For example: - - "Redeclaration of X." describes the situation and implies that + - `"Redeclaration of X."` describes the situation and implies that redeclarations are not permitted. - - "`self` can only be declared in an implicit parameter list." describes - the language rule and implies that you declared `self` somewhere else. + - ``"`self` can only be declared in an implicit parameter list."`` + describes the language rule and implies that you declared `self` + somewhere else. - It's OK for a diagnostic to guess at the developer's intent and provide a hint after explaining the situation and the rule, but not as a - substitute for that. For example, "Add an `as String` cast to format - this integer as a string." is not sufficient as an error message, but - "Cannot add i32 to String. Add an `as String` cast to format this - integer as a string." could be acceptable. + substitute for that. For example, + ``"Add an `as String` cast to format this integer as a string."`` is not + sufficient as an error message, but + ``"Cannot add i32 to String. Add an `as String` cast to format this integer as a string."`` + could be acceptable. - TODO: Should diagnostics be atemporal and non-sequential ("multiple declarations of X", "additional declaration here"), present tense but From 34722ff35916720a2a6da859fc5065a2cd484632 Mon Sep 17 00:00:00 2001 From: Jon Ross-Perkins Date: Thu, 29 Aug 2024 09:42:45 -0700 Subject: [PATCH 04/16] Apply suggestions from code review Co-authored-by: Geoff Romer --- toolchain/docs/adding_features.md | 14 +++++++------- toolchain/docs/parse.md | 16 ++++++++-------- 2 files changed, 15 insertions(+), 15 deletions(-) diff --git a/toolchain/docs/adding_features.md b/toolchain/docs/adding_features.md index a297127ffd7a4..825c9f804971a 100644 --- a/toolchain/docs/adding_features.md +++ b/toolchain/docs/adding_features.md @@ -57,7 +57,7 @@ declarations in the header, so only extra helper functions should be added there. Every state handler pops the state from the stack before any other processing. -## Typed parse node metadata implementation +### Typed parse node metadata implementation As of [#3534](https://github.com/carbon-language/carbon-lang/pull/3534): @@ -249,7 +249,7 @@ formatter. If the resulting SemIR needs a new built-in, add it to [builtin_inst_kind.def](/toolchain/sem_ir/builtin_inst_kind.def). -## SemIR typed instruction metadata implementation +### SemIR typed instruction metadata implementation How does this work? As of [#3310](https://github.com/carbon-language/carbon-lang/pull/3310): @@ -365,7 +365,7 @@ Each SemIR instruction requires adding a `Handle` function in a ## Tests and debugging -## Running tests +### Running tests Tests are run in bulk as `bazel test //toolchain/...`. Many tests are using the file_test infrastructure; see @@ -387,25 +387,25 @@ example, with `toolchain/parse/testdata/basics/empty.carbon`: - `bazel-bin/toolchain/driver/carbon compile --phase=parse --dump-parse-tree toolchain/parse/testdata/basics/empty.carbon` - Similar to the previous command, but without using `bazel`. -## Updating tests +### Updating tests The `toolchain/autoupdate_testdata.py` script can be used to update output. It invokes the `file_test` autoupdate support. See [testing/file_test/README.md](/testing/file_test/README.md) for file syntax. -### Reviewing test deltas +#### Reviewing test deltas Using `autoupdate_testdata.py` can be useful to produce deltas during the development process because it allows `git status` and `git diff` to be used to examine what changed. -## Verbose output +### Verbose output The `-v` flag can be passed to trace state, and should be specified before the subcommand name: `carbon -v compile ...`. `CARBON_VLOG` is used to print output in this mode. There is currently no control over the degree of verbosity. -## Stack traces +### Stack traces While the iterative processing pattern means function stack traces will have minimal context for how the current function is reached, we use LLVM's diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index 962410132c0fd..e2c46d534300e 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -531,7 +531,7 @@ might skip those without emitting diagnostics. The specific approach to producing the desired tree depends on the kind of grammar rule being implemented, as well as the desired output tree structure. -## Introducer +### Introducer **Example:** `if (c) { ... }` @@ -606,7 +606,7 @@ most introducer tokens indicate that the current state should be repeated, to handle the next statement, but some other token, like a close curly brace (`}`) means that the state should be exited. -## Optional modifiers before an introducer +### Optional modifiers before an introducer **Example:** `virtual fn Foo();` @@ -709,7 +709,7 @@ declaration or statement. - This `state.subtree_start` will mark everything since the bracketing introducer node as the children of this node. -## Something required in context +### Something required in context FIXME @@ -719,9 +719,9 @@ Example: name after introducer Example: `[]` after `impl forall` [parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp) -## Optional clauses +### Optional clauses -### Case 1: introducer to optional clause is used as parent node +#### Case 1: introducer to optional clause is used as parent node **Example:** The optional `-> ` in a function signature uses this pattern, so `fn foo() -> u32;` is transformed to: @@ -781,7 +781,7 @@ Also see how the optional initializer is handled on `var`, treating the `=` as its introducer in `HandleVarAfterPattern` and `HandleVarInitializer` in [parse/handle_var.cpp](/toolchain/parse/handle_var.cpp). -### Case 2: parent node is required token after optional clause, with different parent node kinds for different options +#### Case 2: parent node is required token after optional clause, with different parent node kinds for different options **Example:** The optional type expression before `as` in `impl as` is represented by producing two different output parse nodes for `as`. It outputs a @@ -861,7 +861,7 @@ If there is no `as` token, we don't output either a `TypeImplAs` or a `DefaultSelfImplAs` node, as required by the parent node, so in those cases we mark the parent as having an error. -### Case 3: optional sibling +#### Case 3: optional sibling > TODO: This was changed by > [#3678](https://github.com/carbon-language/carbon-lang/pull/3678) and needs to @@ -909,7 +909,7 @@ and then `HandleImplBeforeAs` follows [the "something required in context" pattern](#something-required-in-context) to deal with the `as` that follows when the type expression is present. -## Operators +### Operators FIXME From ce7014babf910cf69336104d82f6e2f477ee20cc Mon Sep 17 00:00:00 2001 From: Jon Ross-Perkins Date: Thu, 29 Aug 2024 09:48:50 -0700 Subject: [PATCH 05/16] Apply suggestions from code review Co-authored-by: Geoff Romer --- toolchain/docs/check.md | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/toolchain/docs/check.md b/toolchain/docs/check.md index 8bb292510c816..757acd35101ed 100644 --- a/toolchain/docs/check.md +++ b/toolchain/docs/check.md @@ -98,7 +98,7 @@ facet type `type`. We will also have built-in functions which would need to form the implementation of some library types, such as `i32`. Built-ins are in a stable index across `SemIR` instances. -## Parameters and arguments +### Parameters and arguments Parameters and arguments will be stored as two `SemIR::InstBlock`s each. The first will contain the full IR, while the second will contain references to the @@ -110,7 +110,7 @@ comparisons and indexed access. There are two textual ways to view `SemIR`. -## Raw form +### Raw form The raw form of SemIR shows the details of the representation, such as numeric instruction and block IDs. The representation is intended to very closely match @@ -119,7 +119,7 @@ debugging low-level issues with the `SemIR` representation. The driver will print this when passed `--dump-raw-sem-ir`. -## Formatted IR +### Formatted IR In addition to the raw form, there is a higher-level formatted IR that aims to be human readable. This is used in most `check` tests to validate the output, @@ -215,7 +215,7 @@ name components are only included if they are necessary to disambiguate the name. `` is a guessed good name for the instruction, often derived from source-level identifiers, and is empty if no guess was made. -### Instructions +#### Instructions There is usually one line in a `InstBlock` for each `Inst`. You can find the documentation for the different kinds of instructions in @@ -291,7 +291,7 @@ formatter will combine instructions together to make the IR more readable: These exceptions may be found in [toolchain/sem_ir/formatter.cpp](/toolchain/sem_ir/formatter.cpp). -### Top-level entities +#### Top-level entities **Question:** Are these too in flux to document at this time? @@ -409,7 +409,7 @@ process entries from the node stack until it finds that solo parse node from Another pattern that arises is state is set up by an introducer node, updated by its siblings, and then consumed by the bracketing parent node. FIXME: example -## Node stack +### Node stack The node stack, defined in [check/node_stack.h](/toolchain/check/node_stack.h), stores pairs of a `Parse::Node` and an id. The type of the id is determined by @@ -433,7 +433,7 @@ When each `Parse::Node` is evaluated, the SemIR for it is typically immediately generated as `SemIR::Inst`s. To help generate the IR to an appropriate context, scopes have separate `SemIR::InstBlock`s. -## Delayed evaluation (not yet implemented) +### Delayed evaluation (not yet implemented) Sometimes, nodes will need to have delayed evaluation; for example, an inline definition of a class member function needs to be evaluated after the class is @@ -444,7 +444,7 @@ scope completes. This means that nodes in a definition would be traversed twice, once while determining that they're inline and without full checking or IR generation, then again with full checking and IR generation. -## Templates (not yet implemented) +### Templates (not yet implemented) Templates need to have partial semantic checking when declared, but can't be fully implemented before they're instantiated against a specific type. @@ -454,7 +454,7 @@ the incomplete information in the IR. Instantiation will likely use that IR and fill in the missing information, but it could also reevaluate the original `Parse::Node`s with the known template state. -## Rewrites +### Rewrites Carbon relies on rewrites of code, such as rewriting the destination of an initializer to a specific target object once that object is known. @@ -473,7 +473,7 @@ Type expressions are treated like any other expression, and are modeled as `SemIR::Inst`s. The types computed by type expressions are deduplicated, resulting in a canonical `SemIR::TypeId` for each distinct type. -## Type printing (not yet implemented) +### Type printing (not yet implemented) The `TypeId` preserves only the identity of the type, not its spelling, and so printing it will produce a fully-resolved type name, which isn't a great user @@ -561,7 +561,7 @@ with a mixed expression category are treated as a special case in conversion, which recurses into the elements of those instructions before performing conversions. -## Value bindings +### Value bindings A value binding represents a conversion from a reference expression to the value stored in that expression. There are three important cases here: From 0dc5af02716519d4ddda3a0f89ae0370839a3887 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Thu, 29 Aug 2024 10:49:12 -0700 Subject: [PATCH 06/16] Addressing comments --- toolchain/docs/README.md | 102 +++++------------------------- toolchain/docs/adding_features.md | 38 ++++++++--- toolchain/docs/check.md | 62 ++++++++++-------- toolchain/docs/diagnostics.md | 5 ++ toolchain/docs/driver.md | 22 +++++++ toolchain/docs/lex.md | 44 +++++++++++++ toolchain/docs/lower.md | 25 ++++++++ toolchain/docs/parse.md | 22 +++---- 8 files changed, 187 insertions(+), 133 deletions(-) create mode 100644 toolchain/docs/driver.md create mode 100644 toolchain/docs/lex.md create mode 100644 toolchain/docs/lower.md diff --git a/toolchain/docs/README.md b/toolchain/docs/README.md index 4425e5e30c835..fcb9426b69c9d 100644 --- a/toolchain/docs/README.md +++ b/toolchain/docs/README.md @@ -13,17 +13,7 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - [Goals](#goals) - [High-level architecture](#high-level-architecture) - [Design patterns](#design-patterns) -- [Main components](#main-components) - - [Driver](#driver) - - [Diagnostics](#diagnostics) - - [Lex](#lex) - - [Bracket matching](#bracket-matching) - - [Parse](#parse) - - [Check](#check) - [Adding features](#adding-features) -- [Alternatives considered](#alternatives-considered) - - [Bracket matching in parser](#bracket-matching-in-parser) - - [Using a traditional AST representation](#using-a-traditional-ast-representation) @@ -42,16 +32,23 @@ it here. ## High-level architecture -The default compilation flow is: +The main components are: -1. Load the file into a [SourceBuffer](/toolchain/source/source_buffer.h). -2. Lex a SourceBuffer into a - [Lex::TokenizedBuffer](/toolchain/lex/tokenized_buffer.h). -3. Parse a TokenizedBuffer into a [Parse::Tree](/toolchain/parse/tree.h). -4. Check a Tree to produce [SemIR::File](/toolchain/sem_ir/file.h). -5. Lower the SemIR to an - [LLVM Module](https://llvm.org/doxygen/classllvm_1_1Module.html). -6. CodeGen turns the LLVM Module into an Object File. +- [Driver](driver.md): Provides commands and ties together compilation flow. +- [Diagnostics](diagnostics.md): Produces diagnostic output. +- Compilation flow: + + 1. Source: Load the file into a + [SourceBuffer](/toolchain/source/source_buffer.h). + 2. [Lex](lex.md): Transform a SourceBuffer into a + [Lex::TokenizedBuffer](/toolchain/lex/tokenized_buffer.h). + 3. [Parse](parse.md): Transform a TokenizedBuffer into a + [Parse::Tree](/toolchain/parse/tree.h). + 4. [Check](check.md): Transform a Tree to produce + [SemIR::File](/toolchain/sem_ir/file.h). + 5. [Lower](lower.md): Transform the SemIR to an + [LLVM Module](https://llvm.org/doxygen/classllvm_1_1Module.html). + 6. CodeGen: Transform the LLVM Module into an Object File. ### Design patterns @@ -92,73 +89,6 @@ A few common design patterns are: See also [Idioms](idioms.md) for abbreviations and more implementation techniques. -## Main components - -### Driver - -The driver provides commands and ties together the toolchain's flow. Running a -command such as `carbon compile --phase=lower ` will run through the flow -and print output. Several dump flags, such as `--dump-parse-tree`, print output -in YAML format for easier parsing. - -### Diagnostics - -The diagnostic code is used by the toolchain to produce output. - -See [Diagnostics](diagnostics.md) for details. - -### Lex - -Lexing converts input source code into tokenized output. Literals, such as -string literals, have their value parsed and form a single token at this stage. - -#### Bracket matching - -The lexer handles matching for `()`, `[]`, and `{}`. When a bracket lacks a -match, it will insert a "recovery" token to produce a match. As a consequence, -the lexer's output should always have matched brackets, even with invalid code. - -While bracket matching could use hints such as contextual clues from -indentation, that is not yet implemented. - -### Parse - -Parsing uses tokens to produce a parse tree that faithfully represents the tree -structure of the source program, interpreted according to the Carbon grammar. No -semantics are associated with the tree structure at this level, and no name -lookup is performed. - -See [Parse](parse.md) for details. - -### Check - -Check takes the parse tree and generates a semantic intermediate representation, -or SemIR. This will look closer to a series of instructions, in preparation for -transformation to LLVM IR. Semantic analysis and type checking occurs during the -production of SemIR. It also does any validation that requires context. - -See [Check](check.md) for details. - ## Adding features We have a [walkthrough for adding features](adding_features.md). - -## Alternatives considered - -### Bracket matching in parser - -Bracket matching could have also been implemented in the parser, with some -awareness of parse state. However, that would shift some of the complexity of -recovery in other error situations, such as where the parser searches for the -next comma in a list. That needs to skip over bracketed ranges. We don't think -the trade-offs would yield a net benefit, so any change in this direction would -need to show concrete improvement, for example better diagnostics for common -issues. - -### Using a traditional AST representation - -Clang creates an AST as part of compilation. In Carbon, it's something we could -do as a step between parsing and checking, possibly replacing the SemIR. It's -likely that doing so would be simpler, amongst other possible trade-offs. -However, we think the SemIR approach is going to yield higher performance, -enough so that it's the chosen approach. diff --git a/toolchain/docs/adding_features.md b/toolchain/docs/adding_features.md index 825c9f804971a..abe341316beb3 100644 --- a/toolchain/docs/adding_features.md +++ b/toolchain/docs/adding_features.md @@ -12,16 +12,16 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - [Lex](#lex) - [Parse](#parse) -- [Typed parse node metadata implementation](#typed-parse-node-metadata-implementation) + - [Typed parse node metadata implementation](#typed-parse-node-metadata-implementation) - [Check](#check) -- [SemIR typed instruction metadata implementation](#semir-typed-instruction-metadata-implementation) + - [SemIR typed instruction metadata implementation](#semir-typed-instruction-metadata-implementation) - [Lower](#lower) - [Tests and debugging](#tests-and-debugging) -- [Running tests](#running-tests) -- [Updating tests](#updating-tests) - - [Reviewing test deltas](#reviewing-test-deltas) -- [Verbose output](#verbose-output) -- [Stack traces](#stack-traces) + - [Running tests](#running-tests) + - [Updating tests](#updating-tests) + - [Reviewing test deltas](#reviewing-test-deltas) + - [Verbose output](#verbose-output) + - [Stack traces](#stack-traces) @@ -238,8 +238,28 @@ If the resulting SemIR needs a new instruction: - Add a `CARBON_SEM_IR_INST_KIND(NewInstKindName)` line in alphabetical order - a new struct definition to - [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h), with (italics - highlight what changes): + [sem_ir/typed_insts.h](/toolchain/sem_ir/typed_insts.h), such as: + + ```cpp + struct NewInstKindName { + static constexpr auto Kind = InstKind::NewInstKindName.Define( + // the name used in textual IR + "new_inst_kind_name" + // Optional: , TerminatorKind::KindOfTerminator + ); + + // Optional: omit if not associated with a parse node. + Parse::Node parse_node; + + // Optional: omit if this sem_ir instruction does not produce a value. + TypeId type_id; + + // 0-2 id fields, with types from sem_ir/ids.h or sem_ir/builtin_kind.h + // For example, fields would look like: + StringId name_id; + InstId value_id; + }; + ``` Adding an instruction will also require a handler in the Lower step. diff --git a/toolchain/docs/check.md b/toolchain/docs/check.md index 757acd35101ed..4eafd63644fa1 100644 --- a/toolchain/docs/check.md +++ b/toolchain/docs/check.md @@ -10,32 +10,42 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception ## Table of contents +- [Overview](#overview) - [Postorder processing](#postorder-processing) - [Key IR concepts](#key-ir-concepts) -- [Parameters and arguments](#parameters-and-arguments) + - [Parameters and arguments](#parameters-and-arguments) - [SemIR textual format](#semir-textual-format) -- [Raw form](#raw-form) -- [Formatted IR](#formatted-ir) - - [Instructions](#instructions) - - [Top-level entities](#top-level-entities) + - [Raw form](#raw-form) + - [Formatted IR](#formatted-ir) + - [Instructions](#instructions) + - [Top-level entities](#top-level-entities) - [Core loop](#core-loop) -- [Node stack](#node-stack) -- [Delayed evaluation (not yet implemented)](#delayed-evaluation-not-yet-implemented) -- [Templates (not yet implemented)](#templates-not-yet-implemented) -- [Rewrites](#rewrites) + - [Node stack](#node-stack) + - [Delayed evaluation (not yet implemented)](#delayed-evaluation-not-yet-implemented) + - [Templates (not yet implemented)](#templates-not-yet-implemented) + - [Rewrites](#rewrites) - [Types](#types) -- [Type printing (not yet implemented)](#type-printing-not-yet-implemented) + - [Type printing (not yet implemented)](#type-printing-not-yet-implemented) - [Expression categories](#expression-categories) - [ExprCategory::NotExpression](#exprcategorynotexpression) - [ExprCategory::Value](#exprcategoryvalue) - [ExprCategory::DurableReference and ExprCategory::EphemeralReference](#exprcategorydurablereference-and-exprcategoryephemeralreference) - [ExprCategory::Initializing](#exprcategoryinitializing) - [ExprCategory::Mixed](#exprcategorymixed) -- [Value bindings](#value-bindings) + - [Value bindings](#value-bindings) - [Handling Parse::Tree errors (not yet implemented)](#handling-parsetree-errors-not-yet-implemented) +- [Alternatives considered](#alternatives-considered) + - [Using a traditional AST representation](#using-a-traditional-ast-representation) +## Overview + +Check takes the parse tree and generates a semantic intermediate representation, +or SemIR. This will look closer to a series of instructions, in preparation for +transformation to LLVM IR. Semantic analysis and type checking occurs during the +production of SemIR. It also does any validation that requires context. + ## Postorder processing The checking step is oriented on postorder processing on the `Parse::Tree` to @@ -135,7 +145,7 @@ representation, although no such parser currently exists. As an example, given the program: -```cpp +```carbon fn Cond() -> bool; fn Run() -> i32 { return if Cond() then 1 else 2; } ``` @@ -327,14 +337,12 @@ the parent node. One example of this pattern is expressions. Each subexpression outputs SemIR instructions to compute the value of that subexpression to the current -instruction block, added to the top of the - -`InstBlockStack` stored in the `Context` object. It leaves an instruction id on -the top of the [node stack](#node-stack) pointing to the instruction that -produces the value of that subexpression. Those are consumed by parent -operations, like an [RPN](https://en.wikipedia.org/wiki/Reverse_Polish_notation) -calculator. For example, the expression `1 * 2 + 3` corresponds to this parse -tree: +instruction block, added to the top of the `InstBlockStack` stored in the +`Context` object. It leaves an instruction id on the top of the +[node stack](#node-stack) pointing to the instruction that produces the value of +that subexpression. Those are consumed by parent operations, like an +[RPN](https://en.wikipedia.org/wiki/Reverse_Polish_notation) calculator. For +example, the expression `1 * 2 + 3` corresponds to this parse tree: ```yaml {kind: 'IntegerLiteral', text: '1'}, @@ -597,12 +605,12 @@ For example, an invalid line of code in a function might generate some incomplete IR in the function's `SemIR::InstBlock`, but that IR won't negatively interfere with checking later valid lines in the same function. -# Lower +## Alternatives considered -Lowering takes the SemIR and produces LLVM IR. At present, this is done in a -single pass, although it's possible we may need to do a second pass so that we -can first generate type information for function arguments. +### Using a traditional AST representation -Lowering is done per `SemIR::InstBlock`. This minimizes changes to the -`IRBuilder` insertion point, something that is both expensive and potentially -fragile. +Clang creates an AST as part of compilation. In Carbon, it's something we could +do as a step between parsing and checking, possibly replacing the SemIR. It's +likely that doing so would be simpler, amongst other possible trade-offs. +However, we think the SemIR approach is going to yield higher performance, +enough so that it's the chosen approach. diff --git a/toolchain/docs/diagnostics.md b/toolchain/docs/diagnostics.md index 2b4ece3eeb682..19fc5d9c8d05a 100644 --- a/toolchain/docs/diagnostics.md +++ b/toolchain/docs/diagnostics.md @@ -10,6 +10,7 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception ## Table of contents +- [Overview](#overview) - [DiagnosticEmitter](#diagnosticemitter) - [DiagnosticConsumers](#diagnosticconsumers) - [Producing diagnostics](#producing-diagnostics) @@ -21,6 +22,10 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception +## Overview + +The diagnostic code is used by the toolchain to produce output. + ## DiagnosticEmitter [DiagnosticEmitters](/toolchain/diagnostics/diagnostic_emitter.h) handle the diff --git a/toolchain/docs/driver.md b/toolchain/docs/driver.md new file mode 100644 index 0000000000000..01744b623b5d4 --- /dev/null +++ b/toolchain/docs/driver.md @@ -0,0 +1,22 @@ +# Driver + + + + + +## Table of contents + +- [Overview](#overview) + + + +## Overview + +The driver provides commands and ties together the toolchain's flow. Running a +command such as `carbon compile --phase=lower ` will run through the flow +and print output. Several dump flags, such as `--dump-parse-tree`, print output +in YAML format for easier parsing. diff --git a/toolchain/docs/lex.md b/toolchain/docs/lex.md new file mode 100644 index 0000000000000..32925aeee19eb --- /dev/null +++ b/toolchain/docs/lex.md @@ -0,0 +1,44 @@ +# Lex + + + + + +## Table of contents + +- [Overview](#overview) +- [Bracket matching](#bracket-matching) +- [Alternatives considered](#alternatives-considered) + - [Bracket matching in parser](#bracket-matching-in-parser) + + + +## Overview + +Lexing converts input source code into tokenized output. Literals, such as +string literals, have their value parsed and form a single token at this stage. + +## Bracket matching + +The lexer handles matching for `()`, `[]`, and `{}`. When a bracket lacks a +match, it will insert a "recovery" token to produce a match. As a consequence, +the lexer's output should always have matched brackets, even with invalid code. + +While bracket matching could use hints such as contextual clues from +indentation, that is not yet implemented. + +## Alternatives considered + +### Bracket matching in parser + +Bracket matching could have also been implemented in the parser, with some +awareness of parse state. However, that would shift some of the complexity of +recovery in other error situations, such as where the parser searches for the +next comma in a list. That needs to skip over bracketed ranges. We don't think +the trade-offs would yield a net benefit, so any change in this direction would +need to show concrete improvement, for example better diagnostics for common +issues. diff --git a/toolchain/docs/lower.md b/toolchain/docs/lower.md new file mode 100644 index 0000000000000..4574952c7b080 --- /dev/null +++ b/toolchain/docs/lower.md @@ -0,0 +1,25 @@ +# Lower + + + + + +## Table of contents + +- [Overview](#overview) + + + +## Overview + +Lowering takes the SemIR and produces LLVM IR. At present, this is done in a +single pass, although it's possible we may need to do a second pass so that we +can first generate type information for function arguments. + +Lowering is done per `SemIR::InstBlock`. This minimizes changes to the +`IRBuilder` insertion point, something that is both expensive and potentially +fragile. diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index e2c46d534300e..3a578cdedeff3 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -17,14 +17,14 @@ SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception - [Visual example](#visual-example) - [Handling invalid parses](#handling-invalid-parses) - [How is this accomplished?](#how-is-this-accomplished) -- [Introducer](#introducer) -- [Optional modifiers before an introducer](#optional-modifiers-before-an-introducer) -- [Something required in context](#something-required-in-context) -- [Optional clauses](#optional-clauses) - - [Case 1: introducer to optional clause is used as parent node](#case-1-introducer-to-optional-clause-is-used-as-parent-node) - - [Case 2: parent node is required token after optional clause, with different parent node kinds for different options](#case-2-parent-node-is-required-token-after-optional-clause-with-different-parent-node-kinds-for-different-options) - - [Case 3: optional sibling](#case-3-optional-sibling) -- [Operators](#operators) + - [Introducer](#introducer) + - [Optional modifiers before an introducer](#optional-modifiers-before-an-introducer) + - [Something required in context](#something-required-in-context) + - [Optional clauses](#optional-clauses) + - [Case 1: introducer to optional clause is used as parent node](#case-1-introducer-to-optional-clause-is-used-as-parent-node) + - [Case 2: parent node is required token after optional clause, with different parent node kinds for different options](#case-2-parent-node-is-required-token-after-optional-clause-with-different-parent-node-kinds-for-different-options) + - [Case 3: optional sibling](#case-3-optional-sibling) + - [Operators](#operators) @@ -91,7 +91,7 @@ of a stack entry and is thus more efficient to store in one place. The parse tree's storage layout is in postorder. For example, given the code: -```cpp +```carbon fn foo() -> f64 { return 42; } @@ -873,7 +873,7 @@ introducer and the `ImplAs` node for the required `as` keyword. `impl bool as Interface;` is transformed to: -```cpp +```yaml {kind: 'ImplIntroducer', text: 'impl'}, {kind: 'BoolTypeLiteral', text: 'bool'}, {kind: 'ImplAs', text: 'as'}, @@ -883,7 +883,7 @@ introducer and the `ImplAs` node for the required `as` keyword. while `impl as Interface;` is transformed to: -```cpp +```yaml {kind: 'ImplIntroducer', text: 'impl'}, {kind: 'ImplAs', text: 'as'}, {kind: 'IdentifierNameExpr', text: 'Interface'}, From a6b3e2ad797e2992610d2c7db55b989981d9ef2d Mon Sep 17 00:00:00 2001 From: Jon Ross-Perkins Date: Thu, 29 Aug 2024 12:20:12 -0700 Subject: [PATCH 07/16] Update toolchain/docs/parse.md Co-authored-by: Geoff Romer --- toolchain/docs/parse.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index 3a578cdedeff3..5b3316d20f0b9 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -716,7 +716,7 @@ FIXME Example: name after introducer [parse/handle_decl_name_and_params.cpp](/toolchain/parse/handle_decl_name_and_params.cpp) -Example: `[]` after `impl forall` +Example: "`[` _implicit parameter list_ `]`" after `impl forall` [parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp) ### Optional clauses From 4b560db502cf720c25488506297970acbfa9d4a8 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Thu, 29 Aug 2024 12:49:40 -0700 Subject: [PATCH 08/16] Layer the nodes more --- toolchain/docs/parse.md | 57 ++++++++++++++++++++++++++++++++++++----- 1 file changed, 51 insertions(+), 6 deletions(-) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index 5b3316d20f0b9..1f97a0306afad 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -221,12 +221,28 @@ flowchart BT subgraph nodes["Parsed nodes"] direction BT node1[var]:::moved + node2[x]:::pending + node3[:]:::pending + node4[i32]:::pending + node5[=]:::pending + node6[y]:::pending + node7[+]:::pending + node8[1]:::pending + node9[;]:::pending end - classDef hidden visibility:hidden,display:none + classDef pending visibility:hidden classDef moved fill:#0F0,color:#000 + classDef hidden visibility:hidden,display:none - node1 ~~~ root + node1 ~~~~ root + node3 ~~~ node2 & node4 + node3 ~~~ root + node5 ~~~~ root + node7 ~~~ node6 & node8 + node7 ~~~ root + node9 ~~~ node1 & node3 & node5 & node7 + node9 ~~~ root ``` Next, we can consider the pattern binding. Here, `x` is the identifier and `i32` @@ -260,14 +276,25 @@ flowchart BT node2[x]:::moved node3[:]:::moved node4[i32]:::moved + node5[=]:::pending + node6[y]:::pending + node7[+]:::pending + node8[1]:::pending + node9[;]:::pending end - classDef hidden visibility:hidden,display:none + classDef pending visibility:hidden classDef moved fill:#0F0,color:#000 + classDef hidden visibility:hidden,display:none node1 ~~~~ root node3 --- node2 & node4 node3 ~~~ root + node5 ~~~~ root + node7 ~~~ node6 & node8 + node7 ~~~ root + node9 ~~~ node1 & node3 & node5 & node7 + node9 ~~~ root ``` We use the `=` as a separator (instead of a node with children like `:`) to help @@ -301,15 +328,24 @@ flowchart BT node3[:] node4[i32] node5[=]:::moved + node6[y]:::pending + node7[+]:::pending + node8[1]:::pending + node9[;]:::pending end - classDef hidden visibility:hidden,display:none + classDef pending visibility:hidden classDef moved fill:#0F0,color:#000 + classDef hidden visibility:hidden,display:none node1 ~~~~ root node3 --- node2 & node4 node3 ~~~ root node5 ~~~~ root + node7 ~~~ node6 & node8 + node7 ~~~ root + node9 ~~~ node1 & node3 & node5 & node7 + node9 ~~~ root ``` The expression is a subtree with `+` as the parent, and the two operands as @@ -345,10 +381,12 @@ flowchart BT node6[y]:::moved node7[+]:::moved node8[1]:::moved + node9[;]:::pending end - classDef hidden visibility:hidden,display:none + classDef pending visibility:hidden classDef moved fill:#0F0,color:#000 + classDef hidden visibility:hidden,display:none node1 ~~~~ root node3 --- node2 & node4 @@ -356,6 +394,8 @@ flowchart BT node5 ~~~~ root node7 --- node6 & node8 node7 ~~~ root + node9 ~~~ node1 & node3 & node5 & node7 + node9 ~~~ root ``` Finally, the `;` is used as the "root" of the variable declaration. It's @@ -378,8 +418,9 @@ flowchart BT node9[;]:::moved end - classDef hidden visibility:hidden,display:none + classDef pending visibility:hidden classDef moved fill:#0F0,color:#000 + classDef hidden visibility:hidden,display:none node1 ~~~~ root node3 --- node2 & node4 @@ -409,6 +450,8 @@ flowchart BT node9[;] end + classDef pending visibility:hidden + classDef moved fill:#0F0,color:#000 classDef hidden visibility:hidden,display:none node1 ~~~~ root @@ -457,6 +500,8 @@ flowchart BT node9[;] end + classDef used visibility:hidden + classDef moved fill:#0F0,color:#000 classDef hidden visibility:hidden,display:none node1 ~~~~ root From 8ff303168cf3d1bdb65dcac637e476819dcb2e93 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Thu, 29 Aug 2024 14:04:03 -0700 Subject: [PATCH 09/16] One more approach --- toolchain/docs/parse.md | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index 1f97a0306afad..ef58b505a5c3e 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -212,12 +212,6 @@ flowchart BT token9[;] end - classDef used visibility:hidden -``` - -```mermaid -flowchart BT - root:::hidden subgraph nodes["Parsed nodes"] direction BT node1[var]:::moved @@ -231,18 +225,23 @@ flowchart BT node9[;]:::pending end + %% A token which has been used. + classDef used visibility:hidden + %% A node which will be used, but hasn't yet been. classDef pending visibility:hidden + %% A token or node which is actively being used. classDef moved fill:#0F0,color:#000 - classDef hidden visibility:hidden,display:none - node1 ~~~~ root + nodes ~~~ tokens + + node1 ~~~~ tokens node3 ~~~ node2 & node4 - node3 ~~~ root - node5 ~~~~ root + node3 ~~~ tokens + node5 ~~~~ tokens node7 ~~~ node6 & node8 - node7 ~~~ root + node7 ~~~ tokens node9 ~~~ node1 & node3 & node5 & node7 - node9 ~~~ root + node9 ~~~ tokens ``` Next, we can consider the pattern binding. Here, `x` is the identifier and `i32` From 9fb8c55db0d87af3449fb459d15dab209749bac6 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Thu, 29 Aug 2024 14:18:21 -0700 Subject: [PATCH 10/16] Fix last approach --- toolchain/docs/parse.md | 157 +++++++++++++++++++++------------------- 1 file changed, 83 insertions(+), 74 deletions(-) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index ef58b505a5c3e..cc8c2c9a17c35 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -174,11 +174,6 @@ var x: i32 = y + 1; Lexing creates distinct tokens for each syntactic element, which will form the basis of the parse tree: - - ```mermaid flowchart BT subgraph tokens["Tokens"] @@ -201,7 +196,7 @@ a variable declaration structure. ```mermaid flowchart BT subgraph tokens["Remaining tokens"] - token1[var]:::used + token1[var]:::moved token2[x] token3[:] token4[i32] @@ -253,9 +248,9 @@ children, it doesn't need to be bracketed. flowchart BT subgraph tokens["Remaining tokens"] token1[var]:::used - token2[x]:::used - token3[:]:::used - token4[i32]:::used + token2[x]:::moved + token3[:]:::moved + token4[i32]:::moved token5[=] token6[y] token7[+] @@ -263,12 +258,6 @@ flowchart BT token9[;] end - classDef used visibility:hidden -``` - -```mermaid -flowchart BT - root:::hidden subgraph nodes["Parsed nodes"] direction BT node1[var] @@ -282,18 +271,23 @@ flowchart BT node9[;]:::pending end + %% A token which has been used. + classDef used visibility:hidden + %% A node which will be used, but hasn't yet been. classDef pending visibility:hidden + %% A token or node which is actively being used. classDef moved fill:#0F0,color:#000 - classDef hidden visibility:hidden,display:none - node1 ~~~~ root + nodes ~~~ tokens + + node1 ~~~~ tokens node3 --- node2 & node4 - node3 ~~~ root - node5 ~~~~ root + node3 ~~~ tokens + node5 ~~~~ tokens node7 ~~~ node6 & node8 - node7 ~~~ root + node7 ~~~ tokens node9 ~~~ node1 & node3 & node5 & node7 - node9 ~~~ root + node9 ~~~ tokens ``` We use the `=` as a separator (instead of a node with children like `:`) to help @@ -307,19 +301,13 @@ flowchart BT token2[x]:::used token3[:]:::used token4[i32]:::used - token5[=]:::used + token5[=]:::moved token6[y] token7[+] token8[1] token9[;] end - classDef used visibility:hidden -``` - -```mermaid -flowchart BT - root:::hidden subgraph nodes["Parsed nodes"] direction BT node1[var] @@ -333,18 +321,23 @@ flowchart BT node9[;]:::pending end + %% A token which has been used. + classDef used visibility:hidden + %% A node which will be used, but hasn't yet been. classDef pending visibility:hidden + %% A token or node which is actively being used. classDef moved fill:#0F0,color:#000 - classDef hidden visibility:hidden,display:none - node1 ~~~~ root + nodes ~~~ tokens + + node1 ~~~~ tokens node3 --- node2 & node4 - node3 ~~~ root - node5 ~~~~ root + node3 ~~~ tokens + node5 ~~~~ tokens node7 ~~~ node6 & node8 - node7 ~~~ root + node7 ~~~ tokens node9 ~~~ node1 & node3 & node5 & node7 - node9 ~~~ root + node9 ~~~ tokens ``` The expression is a subtree with `+` as the parent, and the two operands as @@ -358,18 +351,12 @@ flowchart BT token3[:]:::used token4[i32]:::used token5[=]:::used - token6[y]:::used - token7[+]:::used - token8[1]:::used + token6[y]:::moved + token7[+]:::moved + token8[1]:::moved token9[;] end - classDef used visibility:hidden -``` - -```mermaid -flowchart BT - root:::hidden subgraph nodes["Parsed nodes"] direction BT node1[var] @@ -383,18 +370,23 @@ flowchart BT node9[;]:::pending end + %% A token which has been used. + classDef used visibility:hidden + %% A node which will be used, but hasn't yet been. classDef pending visibility:hidden + %% A token or node which is actively being used. classDef moved fill:#0F0,color:#000 - classDef hidden visibility:hidden,display:none - node1 ~~~~ root + nodes ~~~ tokens + + node1 ~~~~ tokens node3 --- node2 & node4 - node3 ~~~ root - node5 ~~~~ root + node3 ~~~ tokens + node5 ~~~~ tokens node7 --- node6 & node8 - node7 ~~~ root + node7 ~~~ tokens node9 ~~~ node1 & node3 & node5 & node7 - node9 ~~~ root + node9 ~~~ tokens ``` Finally, the `;` is used as the "root" of the variable declaration. It's @@ -403,7 +395,18 @@ unambiguously bracketed by `var`. ```mermaid flowchart BT - root:::hidden + subgraph tokens["Remaining tokens"] + token1[var]:::used + token2[x]:::used + token3[:]:::used + token4[i32]:::used + token5[=]:::used + token6[y]:::used + token7[+]:::used + token8[1]:::used + token9[;]:::moved + end + subgraph nodes["Parsed nodes"] direction BT node1[var] @@ -417,18 +420,23 @@ flowchart BT node9[;]:::moved end + %% A token which has been used. + classDef used visibility:hidden + %% A node which will be used, but hasn't yet been. classDef pending visibility:hidden + %% A token or node which is actively being used. classDef moved fill:#0F0,color:#000 - classDef hidden visibility:hidden,display:none - node1 ~~~~ root + nodes ~~~ tokens + + node1 ~~~~ tokens node3 --- node2 & node4 - node3 ~~~ root - node5 ~~~~ root + node3 ~~~ tokens + node5 ~~~~ tokens node7 --- node6 & node8 - node7 ~~~ root + node7 ~~~ tokens node9 --- node1 & node3 & node5 & node7 - node9 ~~~ root + node9 ~~~ tokens ``` Thus we have the parse tree: @@ -473,44 +481,45 @@ flowchart BT subgraph tokens["Tokens"] token1[var] token2[x] - token3[:] - token4[i32] + token3[:]:::moved + token4[i32]:::moved token5[=] token6[y] - token7[+] - token8[1] + token7[+]:::moved + token8[1]:::moved token9[;] end -``` -```mermaid -flowchart BT - root:::hidden subgraph nodes["Parsed nodes"] direction BT node1[var] node2[x] - node3[:] - node4[i32] + node3[:]:::moved + node4[i32]:::moved node5[=] node6[y] - node7[+] - node8[1] + node7[+]:::moved + node8[1]:::moved node9[;] end + %% A token which has been used. classDef used visibility:hidden + %% A node which will be used, but hasn't yet been. + classDef pending visibility:hidden + %% A token or node which is actively being used. classDef moved fill:#0F0,color:#000 - classDef hidden visibility:hidden,display:none - node1 ~~~~ root + nodes ~~~ tokens + + node1 ~~~~ tokens node3 --- node2 & node4 - node3 ~~~ root - node5 ~~~~ root + node3 ~~~ tokens + node5 ~~~~ tokens node7 --- node6 & node8 - node7 ~~~ root + node7 ~~~ tokens node9 --- node1 & node3 & node5 & node7 - node9 ~~~ root + node9 ~~~ tokens ``` ```mermaid From 13dba6150da2e3ae94e7adf58bf69775442ae1d6 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Thu, 29 Aug 2024 16:31:39 -0700 Subject: [PATCH 11/16] ascii --- toolchain/docs/parse.md | 434 ++++++++++------------------------------ 1 file changed, 111 insertions(+), 323 deletions(-) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index cc8c2c9a17c35..df614347405c8 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -174,69 +174,26 @@ var x: i32 = y + 1; Lexing creates distinct tokens for each syntactic element, which will form the basis of the parse tree: -```mermaid -flowchart BT - subgraph tokens["Tokens"] - token1[var] - token2[x] - token3[:] - token4[i32] - token5[=] - token6[y] - token7[+] - token8[1] - token9[;] - end +```ascii ++-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +| var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; | ++-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ ``` First the `var` keyword is used as a "bracketing" node (VariableIntroducer). When this is seen in a postorder traversal, it tells us to expect the basics of a variable declaration structure. -```mermaid -flowchart BT - subgraph tokens["Remaining tokens"] - token1[var]:::moved - token2[x] - token3[:] - token4[i32] - token5[=] - token6[y] - token7[+] - token8[1] - token9[;] - end - - subgraph nodes["Parsed nodes"] - direction BT - node1[var]:::moved - node2[x]:::pending - node3[:]:::pending - node4[i32]:::pending - node5[=]:::pending - node6[y]:::pending - node7[+]:::pending - node8[1]:::pending - node9[;]:::pending - end - - %% A token which has been used. - classDef used visibility:hidden - %% A node which will be used, but hasn't yet been. - classDef pending visibility:hidden - %% A token or node which is actively being used. - classDef moved fill:#0F0,color:#000 - - nodes ~~~ tokens - - node1 ~~~~ tokens - node3 ~~~ node2 & node4 - node3 ~~~ tokens - node5 ~~~~ tokens - node7 ~~~ node6 & node8 - node7 ~~~ tokens - node9 ~~~ node1 & node3 & node5 & node7 - node9 ~~~ tokens +```ascii + +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ + | x | | : | | i32 | | = | | y | | + | | 1 | | ; | + +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +``` + +```ascii ++-----+ +| var | ++-----+ ``` Next, we can consider the pattern binding. Here, `x` is the identifier and `i32` @@ -244,299 +201,130 @@ is the type expression. The `:` provides a parent node that must always contain two children, the name and type expression. Because it always has two direct children, it doesn't need to be bracketed. -```mermaid -flowchart BT - subgraph tokens["Remaining tokens"] - token1[var]:::used - token2[x]:::moved - token3[:]:::moved - token4[i32]:::moved - token5[=] - token6[y] - token7[+] - token8[1] - token9[;] - end - - subgraph nodes["Parsed nodes"] - direction BT - node1[var] - node2[x]:::moved - node3[:]:::moved - node4[i32]:::moved - node5[=]:::pending - node6[y]:::pending - node7[+]:::pending - node8[1]:::pending - node9[;]:::pending - end - - %% A token which has been used. - classDef used visibility:hidden - %% A node which will be used, but hasn't yet been. - classDef pending visibility:hidden - %% A token or node which is actively being used. - classDef moved fill:#0F0,color:#000 - - nodes ~~~ tokens - - node1 ~~~~ tokens - node3 --- node2 & node4 - node3 ~~~ tokens - node5 ~~~~ tokens - node7 ~~~ node6 & node8 - node7 ~~~ tokens - node9 ~~~ node1 & node3 & node5 & node7 - node9 ~~~ tokens +```ascii + +---+ +---+ +---+ +---+ +---+ + | = | | y | | + | | 1 | | ; | + +---+ +---+ +---+ +---+ +---+ +``` + +```ascii + +---+ +-----+ + | x | | i32 | + +---+ +-----+ + | | + +------+------+ + | ++-----+ +---+ +| var | | : | ++-----+ +---+ ``` We use the `=` as a separator (instead of a node with children like `:`) to help indicate the transition from binding to assignment expression, which is important for expression parsing during checking. -```mermaid -flowchart BT - subgraph tokens["Remaining tokens"] - token1[var]:::used - token2[x]:::used - token3[:]:::used - token4[i32]:::used - token5[=]:::moved - token6[y] - token7[+] - token8[1] - token9[;] - end - - subgraph nodes["Parsed nodes"] - direction BT - node1[var] - node2[x] - node3[:] - node4[i32] - node5[=]:::moved - node6[y]:::pending - node7[+]:::pending - node8[1]:::pending - node9[;]:::pending - end - - %% A token which has been used. - classDef used visibility:hidden - %% A node which will be used, but hasn't yet been. - classDef pending visibility:hidden - %% A token or node which is actively being used. - classDef moved fill:#0F0,color:#000 - - nodes ~~~ tokens - - node1 ~~~~ tokens - node3 --- node2 & node4 - node3 ~~~ tokens - node5 ~~~~ tokens - node7 ~~~ node6 & node8 - node7 ~~~ tokens - node9 ~~~ node1 & node3 & node5 & node7 - node9 ~~~ tokens +```ascii + +---+ +---+ +---+ +---+ + | y | | + | | 1 | | ; | + +---+ +---+ +---+ +---+ +``` + +```ascii + +---+ +-----+ + | x | | i32 | + +---+ +-----+ + | | + +------+------+ + | ++-----+ +---+ +---+ +| var | | : | | = | ++-----+ +---+ +---+ ``` The expression is a subtree with `+` as the parent, and the two operands as child nodes. -```mermaid -flowchart BT - subgraph tokens["Remaining tokens"] - token1[var]:::used - token2[x]:::used - token3[:]:::used - token4[i32]:::used - token5[=]:::used - token6[y]:::moved - token7[+]:::moved - token8[1]:::moved - token9[;] - end - - subgraph nodes["Parsed nodes"] - direction BT - node1[var] - node2[x] - node3[:] - node4[i32] - node5[=] - node6[y]:::moved - node7[+]:::moved - node8[1]:::moved - node9[;]:::pending - end - - %% A token which has been used. - classDef used visibility:hidden - %% A node which will be used, but hasn't yet been. - classDef pending visibility:hidden - %% A token or node which is actively being used. - classDef moved fill:#0F0,color:#000 - - nodes ~~~ tokens - - node1 ~~~~ tokens - node3 --- node2 & node4 - node3 ~~~ tokens - node5 ~~~~ tokens - node7 --- node6 & node8 - node7 ~~~ tokens - node9 ~~~ node1 & node3 & node5 & node7 - node9 ~~~ tokens +```ascii + +---+ + | ; | + +---+ +``` + +```ascii + +---+ +-----+ +---+ +---+ + | x | | i32 | | y | | 1 | + +---+ +-----+ +---+ +---+ + | | | | + +------+------+ +-----+-----+ + | | ++-----+ +---+ +---+ +---+ +| var | | : | | = | | + | ++-----+ +---+ +---+ +---+ ``` Finally, the `;` is used as the "root" of the variable declaration. It's explicitly tracked as the `;` for a variable declaration so that it's unambiguously bracketed by `var`. -```mermaid -flowchart BT - subgraph tokens["Remaining tokens"] - token1[var]:::used - token2[x]:::used - token3[:]:::used - token4[i32]:::used - token5[=]:::used - token6[y]:::used - token7[+]:::used - token8[1]:::used - token9[;]:::moved - end - - subgraph nodes["Parsed nodes"] - direction BT - node1[var] - node2[x] - node3[:] - node4[i32] - node5[=] - node6[y] - node7[+] - node8[1] - node9[;]:::moved - end - - %% A token which has been used. - classDef used visibility:hidden - %% A node which will be used, but hasn't yet been. - classDef pending visibility:hidden - %% A token or node which is actively being used. - classDef moved fill:#0F0,color:#000 - - nodes ~~~ tokens - - node1 ~~~~ tokens - node3 --- node2 & node4 - node3 ~~~ tokens - node5 ~~~~ tokens - node7 --- node6 & node8 - node7 ~~~ tokens - node9 --- node1 & node3 & node5 & node7 - node9 ~~~ tokens +```ascii + +---+ +-----+ +---+ +---+ + | x | | i32 | | y | | 1 | + +---+ +-----+ +---+ +---+ + | | | | + +------+------+ +-----+-----+ + | | ++-----+ +---+ +---+ +---+ +| var | | : | | = | | + | ++-----+ +---+ +---+ +---+ + | | | | + +--------------------+-----+-----------------+-----+ + | + +---+ + | ; | + +---+ ``` -Thus we have the parse tree: - -```mermaid -flowchart BT - root:::hidden - subgraph nodes["Parsed nodes"] - direction BT - node1[var] - node2[x] - node3[:] - node4[i32] - node5[=] - node6[y] - node7[+] - node8[1] - node9[;] - end - - classDef pending visibility:hidden - classDef moved fill:#0F0,color:#000 - classDef hidden visibility:hidden,display:none - - node1 ~~~~ root - node3 --- node2 & node4 - node3 ~~~ root - node5 ~~~~ root - node7 --- node6 & node8 - node7 ~~~ root - node9 --- node1 & node3 & node5 & node7 - node9 ~~~ root -``` +This is the completed parse tree. In storage, this tree will be flat and in postorder. Because the order hasn't changed much from the original code, we can do the reordering for postorder with a minimal number of nodes being delayed for later output: it will be linear with respect to the depth of the parse tree. -```mermaid -flowchart BT - subgraph tokens["Tokens"] - token1[var] - token2[x] - token3[:]:::moved - token4[i32]:::moved - token5[=] - token6[y] - token7[+]:::moved - token8[1]:::moved - token9[;] - end - - subgraph nodes["Parsed nodes"] - direction BT - node1[var] - node2[x] - node3[:]:::moved - node4[i32]:::moved - node5[=] - node6[y] - node7[+]:::moved - node8[1]:::moved - node9[;] - end - - %% A token which has been used. - classDef used visibility:hidden - %% A node which will be used, but hasn't yet been. - classDef pending visibility:hidden - %% A token or node which is actively being used. - classDef moved fill:#0F0,color:#000 - - nodes ~~~ tokens - - node1 ~~~~ tokens - node3 --- node2 & node4 - node3 ~~~ tokens - node5 ~~~~ tokens - node7 --- node6 & node8 - node7 ~~~ tokens - node9 --- node1 & node3 & node5 & node7 - node9 ~~~ tokens +**Tokens**: + +```ascii ++-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +| var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; | ++-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ ``` -```mermaid -flowchart BT - subgraph storage["Storage"] - storage1[var] - storage2[x] - storage4[i32]:::moved - storage3[:]:::moved - storage5[=] - storage6[y] - storage8[1]:::moved - storage7[+]:::moved - storage9[;] - end - - classDef moved fill:#0F0,color:#000 +**Parse tree**: + +```ascii + +---+ +-----+ +---+ +---+ + | x | | i32 | | y | | 1 | + +---+ +-----+ +---+ +---+ + | | | | + +------+------+ +-----+-----+ + | | ++-----+ +---+ +---+ +---+ +| var | | : | | = | | + | ++-----+ +---+ +---+ +---+ + | | | | + +--------------------+-----+-----------------+-----+ + | + +---+ + | ; | + +---+ +``` + +**Flattened for storage**: + +```ascii ++-----+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +---+ +| var | | x | | i32 | | : | | = | | y | | 1 | | + | | ; | ++-----+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +---+ ``` The structural concepts of bracketing nodes (`var` and `;`) and parent nodes From b6737fa990f006046f806ee77b49485807b3579e Mon Sep 17 00:00:00 2001 From: Jon Ross-Perkins Date: Tue, 3 Sep 2024 14:06:03 -0700 Subject: [PATCH 12/16] Committing for comparison Co-authored-by: Geoff Romer --- toolchain/docs/parse.md | 14 +++----------- 1 file changed, 3 insertions(+), 11 deletions(-) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index df614347405c8..698e28af747d3 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -291,17 +291,13 @@ changed much from the original code, we can do the reordering for postorder with a minimal number of nodes being delayed for later output: it will be linear with respect to the depth of the parse tree. -**Tokens**: - ```ascii +Tokens: +-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ | var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; | +-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ -``` -**Parse tree**: - -```ascii +Parse tree: +---+ +-----+ +---+ +---+ | x | | i32 | | y | | 1 | +---+ +-----+ +---+ +---+ @@ -317,11 +313,7 @@ respect to the depth of the parse tree. +---+ | ; | +---+ -``` - -**Flattened for storage**: - -```ascii +Flattened for storage: +-----+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +---+ | var | | x | | i32 | | : | | = | | y | | 1 | | + | | ; | +-----+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +---+ From 5589b0e120d16865f61789f5b4bcfe11381ffa76 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Tue, 3 Sep 2024 15:02:55 -0700 Subject: [PATCH 13/16] refactor diagrams --- toolchain/docs/parse.md | 188 ++++++++++++++++++++++------------------ 1 file changed, 102 insertions(+), 86 deletions(-) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index 698e28af747d3..e114373bdf098 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -175,9 +175,11 @@ Lexing creates distinct tokens for each syntactic element, which will form the basis of the parse tree: ```ascii -+-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ -| var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; | -+-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +Tokens: + ++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +| var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; | ++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ ``` First the `var` keyword is used as a "bracketing" node (VariableIntroducer). @@ -185,12 +187,14 @@ When this is seen in a postorder traversal, it tells us to expect the basics of a variable declaration structure. ```ascii - +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ - | x | | : | | i32 | | = | | y | | + | | 1 | | ; | - +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ -``` +Tokens: + + +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ + | x | | : | | i32 | | = | | y | | + | | 1 | | ; | + +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ + +Parse tree: -```ascii +-----+ | var | +-----+ @@ -202,21 +206,23 @@ two children, the name and type expression. Because it always has two direct children, it doesn't need to be bracketed. ```ascii - +---+ +---+ +---+ +---+ +---+ - | = | | y | | + | | 1 | | ; | - +---+ +---+ +---+ +---+ +---+ -``` +Tokens: -```ascii - +---+ +-----+ - | x | | i32 | - +---+ +-----+ - | | - +------+------+ - | -+-----+ +---+ -| var | | : | -+-----+ +---+ + +-----+ +-----+ +-----+ +-----+ +-----+ + | = | | y | | + | | 1 | | ; | + +-----+ +-----+ +-----+ +-----+ +-----+ + +Parse tree: + + +-----+ +-----+ + | x | | i32 | + +-----+ +-----+ + | | + +-------+-------+ + | ++-----+ +-----+ +| var | | : | ++-----+ +-----+ ``` We use the `=` as a separator (instead of a node with children like `:`) to help @@ -224,42 +230,46 @@ indicate the transition from binding to assignment expression, which is important for expression parsing during checking. ```ascii - +---+ +---+ +---+ +---+ - | y | | + | | 1 | | ; | - +---+ +---+ +---+ +---+ -``` +Tokens: -```ascii - +---+ +-----+ - | x | | i32 | - +---+ +-----+ - | | - +------+------+ - | -+-----+ +---+ +---+ -| var | | : | | = | -+-----+ +---+ +---+ + +-----+ +-----+ +-----+ +-----+ + | y | | + | | 1 | | ; | + +-----+ +-----+ +-----+ +-----+ + +Parse tree: + + +-----+ +-----+ + | x | | i32 | + +-----+ +-----+ + | | + +-------+-------+ + | ++-----+ +-----+ +-----+ +| var | | : | | = | ++-----+ +-----+ +-----+ ``` The expression is a subtree with `+` as the parent, and the two operands as child nodes. ```ascii - +---+ - | ; | - +---+ -``` +Tokens: -```ascii - +---+ +-----+ +---+ +---+ - | x | | i32 | | y | | 1 | - +---+ +-----+ +---+ +---+ - | | | | - +------+------+ +-----+-----+ - | | -+-----+ +---+ +---+ +---+ -| var | | : | | = | | + | -+-----+ +---+ +---+ +---+ + +-----+ + | ; | + +-----+ + +Parse tree: + + +-----+ +-----+ +-----+ +-----+ + | x | | i32 | | y | | 1 | + +-----+ +-----+ +-----+ +-----+ + | | | | + +-------+-------+ +-------+-------+ + | | ++-----+ +-----+ +-----+ +-----+ +| var | | : | | = | | + | ++-----+ +-----+ +-----+ +-----+ ``` Finally, the `;` is used as the "root" of the variable declaration. It's @@ -267,21 +277,23 @@ explicitly tracked as the `;` for a variable declaration so that it's unambiguously bracketed by `var`. ```ascii - +---+ +-----+ +---+ +---+ - | x | | i32 | | y | | 1 | - +---+ +-----+ +---+ +---+ - | | | | - +------+------+ +-----+-----+ - | | -+-----+ +---+ +---+ +---+ -| var | | : | | = | | + | -+-----+ +---+ +---+ +---+ - | | | | - +--------------------+-----+-----------------+-----+ - | - +---+ - | ; | - +---+ +Parse tree: + + +-----+ +-----+ +-----+ +-----+ + | x | | i32 | | y | | 1 | + +-----+ +-----+ +-----+ +-----+ + | | | | + +-------+-------+ +-------+-------+ + | | ++-----+ +-----+ +-----+ +-----+ +| var | | : | | = | | + | ++-----+ +-----+ +-----+ +-----+ + | | | | + +-----------------------+-------+-----------------------+-------+ + | + +-----+ + | ; | + +-----+ ``` This is the completed parse tree. @@ -293,30 +305,34 @@ respect to the depth of the parse tree. ```ascii Tokens: -+-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ -| var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; | -+-----+ +---+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ + ++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +| var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; | ++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ Parse tree: - +---+ +-----+ +---+ +---+ - | x | | i32 | | y | | 1 | - +---+ +-----+ +---+ +---+ - | | | | - +------+------+ +-----+-----+ - | | -+-----+ +---+ +---+ +---+ -| var | | : | | = | | + | -+-----+ +---+ +---+ +---+ - | | | | - +--------------------+-----+-----------------+-----+ - | - +---+ - | ; | - +---+ + + +-----+ +-----+ +-----+ +-----+ + | x | | i32 | | y | | 1 | + +-----+ +-----+ +-----+ +-----+ + | | | | + +-------+-------+ +-------+-------+ + | | ++-----+ +-----+ +-----+ +-----+ +| var | | : | | = | | + | ++-----+ +-----+ +-----+ +-----+ + | | | | + +-----------------------+-------+-----------------------+-------+ + | + +-----+ + | ; | + +-----+ + Flattened for storage: -+-----+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +---+ -| var | | x | | i32 | | : | | = | | y | | 1 | | + | | ; | -+-----+ +---+ +-----+ +---+ +---+ +---+ +---+ +---+ +---+ + ++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +| var | | x | | i32 | | : | | = | | y | | 1 | | + | | ; | ++-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ ``` The structural concepts of bracketing nodes (`var` and `;`) and parent nodes From 66fdc469e0185b666ff90345a82025395a2a7917 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Tue, 3 Sep 2024 15:04:04 -0700 Subject: [PATCH 14/16] empty --- toolchain/docs/parse.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index e114373bdf098..79c3a1c523dbf 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -277,6 +277,12 @@ explicitly tracked as the `;` for a variable declaration so that it's unambiguously bracketed by `var`. ```ascii +Tokens: + + + + + Parse tree: +-----+ +-----+ +-----+ +-----+ From b645d2b1c10105d63169bc6ba18eddfc5827044b Mon Sep 17 00:00:00 2001 From: jonmeow Date: Tue, 3 Sep 2024 15:07:06 -0700 Subject: [PATCH 15/16] pre --- toolchain/docs/parse.md | 56 ++++++++++++++++++++--------------------- 1 file changed, 28 insertions(+), 28 deletions(-) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index 79c3a1c523dbf..d363dd0542647 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -174,45 +174,45 @@ var x: i32 = y + 1; Lexing creates distinct tokens for each syntactic element, which will form the basis of the parse tree: -```ascii -Tokens: +
+Tokens:
 
 +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
 | var | |  x  | |  :  | | i32 | |  =  | |  y  | |  +  | |  1  | |  ;  |
 +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
-```
+
First the `var` keyword is used as a "bracketing" node (VariableIntroducer). When this is seen in a postorder traversal, it tells us to expect the basics of a variable declaration structure. -```ascii -Tokens: +
+Tokens:
 
         +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
         |  x  | |  :  | | i32 | |  =  | |  y  | |  +  | |  1  | |  ;  |
         +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
 
-Parse tree:
+Parse tree:
 
 +-----+
 | var |
 +-----+
-```
+
Next, we can consider the pattern binding. Here, `x` is the identifier and `i32` is the type expression. The `:` provides a parent node that must always contain two children, the name and type expression. Because it always has two direct children, it doesn't need to be bracketed. -```ascii -Tokens: +
+Tokens:
 
                                 +-----+ +-----+ +-----+ +-----+ +-----+
                                 |  =  | |  y  | |  +  | |  1  | |  ;  |
                                 +-----+ +-----+ +-----+ +-----+ +-----+
 
-Parse tree:
+Parse tree:
 
         +-----+ +-----+
         |  x  | | i32 |
@@ -223,20 +223,20 @@ Parse tree:
 +-----+                 +-----+
 | var |                 |  :  |
 +-----+                 +-----+
-```
+
We use the `=` as a separator (instead of a node with children like `:`) to help indicate the transition from binding to assignment expression, which is important for expression parsing during checking. -```ascii -Tokens: +
+Tokens:
 
                                         +-----+ +-----+ +-----+ +-----+
                                         |  y  | |  +  | |  1  | |  ;  |
                                         +-----+ +-----+ +-----+ +-----+
 
-Parse tree:
+Parse tree:
 
         +-----+ +-----+
         |  x  | | i32 |
@@ -247,19 +247,19 @@ Parse tree:
 +-----+                 +-----+ +-----+
 | var |                 |  :  | |  =  |
 +-----+                 +-----+ +-----+
-```
+
The expression is a subtree with `+` as the parent, and the two operands as child nodes. -```ascii -Tokens: +
+Tokens:
 
                                                                 +-----+
                                                                 |  ;  |
                                                                 +-----+
 
-Parse tree:
+Parse tree:
 
         +-----+ +-----+                 +-----+ +-----+
         |  x  | | i32 |                 |  y  | |  1  |
@@ -270,20 +270,20 @@ Parse tree:
 +-----+                 +-----+ +-----+                 +-----+
 | var |                 |  :  | |  =  |                 |  +  |
 +-----+                 +-----+ +-----+                 +-----+
-```
+
Finally, the `;` is used as the "root" of the variable declaration. It's explicitly tracked as the `;` for a variable declaration so that it's unambiguously bracketed by `var`. -```ascii -Tokens: +
+Tokens:
 
 
 
 
 
-Parse tree:
+Parse tree:
 
         +-----+ +-----+                 +-----+ +-----+
         |  x  | | i32 |                 |  y  | |  1  |
@@ -300,7 +300,7 @@ Parse tree:
                                                                 +-----+
                                                                 |  ;  |
                                                                 +-----+
-```
+
This is the completed parse tree. @@ -309,14 +309,14 @@ changed much from the original code, we can do the reordering for postorder with a minimal number of nodes being delayed for later output: it will be linear with respect to the depth of the parse tree. -```ascii -Tokens: +
+Tokens:
 
 +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
 | var | |  x  | |  :  | | i32 | |  =  | |  y  | |  +  | |  1  | |  ;  |
 +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
 
-Parse tree:
+Parse tree:
 
         +-----+ +-----+                 +-----+ +-----+
         |  x  | | i32 |                 |  y  | |  1  |
@@ -334,12 +334,12 @@ Parse tree:
                                                                 |  ;  |
                                                                 +-----+
 
-Flattened for storage:
+Flattened for storage:
 
 +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
 | var | |  x  | | i32 | |  :  | |  =  | |  y  | |  1  | |  +  | |  ;  |
 +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
-```
+
The structural concepts of bracketing nodes (`var` and `;`) and parent nodes with a known child count (`:` and `+` with 2 children, but also `=` with 0 From e32817d6eca662262a1bb8e9655b28b31d996d60 Mon Sep 17 00:00:00 2001 From: jonmeow Date: Tue, 3 Sep 2024 15:10:30 -0700 Subject: [PATCH 16/16] blanks --- toolchain/docs/parse.md | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/toolchain/docs/parse.md b/toolchain/docs/parse.md index d363dd0542647..397fc208739b3 100644 --- a/toolchain/docs/parse.md +++ b/toolchain/docs/parse.md @@ -195,9 +195,21 @@ a variable declaration structure. Parse tree: + + + + + + +-----+ | var | +-----+ + + + + + + Next, we can consider the pattern binding. Here, `x` is the identifier and `i32` @@ -223,6 +235,12 @@ children, it doesn't need to be bracketed. +-----+ +-----+ | var | | : | +-----+ +-----+ + + + + + + We use the `=` as a separator (instead of a node with children like `:`) to help @@ -247,6 +265,12 @@ important for expression parsing during checking. +-----+ +-----+ +-----+ | var | | : | | = | +-----+ +-----+ +-----+ + + + + + + The expression is a subtree with `+` as the parent, and the two operands as @@ -270,6 +294,12 @@ child nodes. +-----+ +-----+ +-----+ +-----+ | var | | : | | = | | + | +-----+ +-----+ +-----+ +-----+ + + + + + + Finally, the `;` is used as the "root" of the variable declaration. It's