Process comments (and extra nodes) separately from the rest of the code #500

nbacquey · 2023-06-07T14:02:13Z

This PR is an attempt at processing the comments separately from the rest of the code, so that formatting queries can be both resilient to the presence of random comments, and simple to write.

Identifying comments

This is a challenge in itself:

We can't use the same mechanism we use for leaf identification, i.e. a formatting directive in the query file that would look like (block_comment) @comment. Indeed, it would mean either running the query twice, or accounting for the presence of comments in the rest of the query file, both situations we want to avoid.
The current implementation selects comment nodes from the CST with the following heuristics:
node.is_extra() && node.kind().to_string().contains("comment"). I think it should capture all types of comments for all supported languages, but I haven't checked that yet.
Another option would be to have a separate query file for comments, one for each language, which would contain selecting and formatting directives for comments. It could also contain instructions on how to re-insert comments back in the code, once it's formatted.

Anchoring comments

Each comment should be anchored to a non-comment node of the code. Ideally, it should be the node it's supposed to comment, but such semantics can't be deduced from the CST alone. This PR uses the following heuristics instead:

If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.
If the comment node isn't alone on its line, anchor it to its previous sibling in the CST. If it's the first of its siblings, anchor it to its next sibling instead.
I haven't thought at what it would mean for a comment node to be an only child in the CST, nor if that case can actually happen. In that case, it would probably be a safe bet to anchor it to its parent node.

Extracting comments

tree-sitter grammars and queries don't offer lots of tools to edit an existing CST. The only reasonable way I've found is to use input.replace_range() to remove the comment's bytes from the input, then tree.edit() to mark a node as edited in the CST, then finally reparse(old_tree, new_input, grammar) to get the new CST, without comments. The query file would then be applied to this new CST.

Re-inserting comments

This is a part I haven't had time to experiment with, but I think re-insertion should be done after processing all append/prepend directives, but before post-processing. I imagine something like this should work:

If the anchor was before the comment, re-insert with (anchor).(space).(comment) (line breaks should already have been taken care of).
If the anchor was after the comment, re-insert with (comment).(line break).(anchor).

State of the PR

314eter · 2023-06-08T01:37:07Z

In order to fix #430, this should be done for OCaml attributes too. They are not included with the current heuristic.

Attributes can appear almost everywhere in OCaml. To avoid overly complicating the grammar, tree-sitter-ocaml really allows them everywhere, even in some invalid places. If you replace the comments in the example of #489 with attributes, the idempotency rule will still be violated (the input will be parseable by tree-sitter, but the result will not be). But since the code was invalid OCaml to begin with, that's not a huge problem.

ErinvanderVeen · 2023-06-15T12:43:46Z

I think this heuristic can be avoided by using the languages.toml file.

Xophmeister

I mostly follow this and it feels like it's on the right track. I did a commit-wise review, so some of my suggestions against earlier commits may no longer apply, or apply elsewhere.

I appreciate the modularisation of comments and types and the comments you added to the source to describe what's going on. This is quite a tricky operation, so those comments are super-encouraged to save our future selves' sanity 😅

topiary-core/src/tree_sitter.rs

topiary-core/src/comments.rs

If there were the following atoms in succession: `[Space, Hardline, IndenStart]` then the function would process them as `[Empty, Hardline, IndentStart]` instead of `[Empty, IndentStart, Hardline]` because the whitespace collation code would skip a step.

Instead of using ad hoc logic to identify comment nodes, this commit uses a separate query file.

Xophmeister

This is as awesome as it is epic! Amazing work 🙇

I've gone through this pretty thoroughly, but I must confess to not fully understanding everything. That said, you've named things very clearly, so I've used those names as a crux wherever I don't follow. I've added comments if/when this "method" breaks down a bit.
The PR description is very clear and I think -- if updated to reflect some of the changes since it was written -- it would make for good documentation in the README. (I've added some notes about the PR description below.)
With regard to line-embedded comments /* like this */ what's the reason why they can't be anchored to the node (or the sibling node) immediately before the comment?

(Similarly, for "orphaned" comments: Can they just assume the formatting of the current context, appended with a hardline, or is that not a safe fallback?)
Meta: It's a real shame that Tree-sitter doesn't have a way, AFAIK, to introspect grammars. It would be really nice if one could just get a list of all "extra" nodes in any grammar, then you wouldn't need to bother with the language.comment.scm files... Perhaps worth an issue/feature request in the Tree-sitter repo 🤷
Meta: @Niols, would you be able to experiment with this branch against your OCaml codebase(s), paying attention to how it affects comment formatting? It would be really interesting to see how it plays with real world code.

Notes on the PR description, so it can be converted into documentation easily:

Identifying comments

It would be good to expand on why this isn't ideal, for the sake of the documentation:
- We can't use the same mechanism we use for leaf identification, i.e. a formatting directive in the query file that would look like (block_comment) @comment. Indeed, it would mean either running the query twice, or accounting for the presence of comments in the rest of the query file, both situations we want to avoid.
Update this to match that you've gone with the second option; that is, comment query files:
- The current implementation selects comment nodes from the CST with the following heuristics:
  node.is_extra() && node.kind().to_string().contains("comment"). I think it should capture all types of comments for all supported languages, but I haven't checked that yet.
- Another option would be to have a separate query file for comments, one for each language, which would contain selecting and formatting directives for comments. It could also contain instructions on how to re-insert comments back in the code, once it's formatted.

Anchoring comments

Your heuristics make perfect sense to me and seem eminently sensible. Some ASCII-art diagrams for the first two cases would make this section ideal for the documentation.

It would also be worth documenting the:
- interleaved comment edge-case (modulo my question, above);
- and the case that the OCaml sample exposes, when comments are put after their node.
Presumably, the second sentence in this heuristic isn't implemented:
- If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.
That's why we see final child comments losing their indent level, for example.
It looks like you bail out in the only child case:
- I haven't thought at what it would mean for a comment node to be an only child in the CST, nor if that case can actually happen. In that case, it would probably be a safe bet to anchor it to its parent node.
```
$ cargo run -- fmt -lbash <<'SH'
> (
>   # foo
> )
> SH
[2025-01-13T16:10:48Z ERROR topiary] Found an anchored comment, but could not attach it back to its anchor
    # foo
    The anchor was CommentedAfter { section: InputSection { start: Position { row: 2, column: 3 }, end: Position { row: 2, column: 3 } }, blank_line_after: false, blank_line_before: false }
```
I agree with you that instances of only child comments are likely to be rare. Perhaps an option could be to just assume the formatting imposed by the parent as a fallback. Does that even make sense?

Re-inserting comments

I think this reads well; it just needs to be tweaked to change from the subjunctive to the indicative.

Xophmeister · 2025-01-13T16:31:38Z

topiary-cli/tests/samples/expected/rust.rs

 }

 // Empty block inside of impl function
 impl MyTrait for MyStruct {
    fn foo() {
-        // ... logic ...
+    // ... logic ...


Is this not a case of an only child comment? How come this one doesn't bail out?

Xophmeister · 2025-01-13T16:34:31Z

topiary-cli/tests/samples/expected/tree_sitter_query.scm

+; just doing _ above doesn't work, because it matches the final named node as
+; well as the final non-named node, causing double indentation.


Shouldn't the second sentence of this part of the heuristic happen?

If the comment node is alone on its line, anchor it to its next sibling in the CST. If it's the last of its siblings, anchor it to its previous sibling instead.

I haven't looked at the implementation yet, so perhaps that part is not done.

Xophmeister · 2025-01-13T16:39:54Z

topiary-queries/queries/ocaml.comment.scm

@@ -0,0 +1,2 @@
+; Identify nodes for comment processing
+(comment) @comment


I guess this PR is designed specifically for comments, but I wonder if it can apply more generally to other extra nodes, like OCaml attributes? @314eter's comment suggests that the heuristic you've defined wouldn't be sufficient, so perhaps not...

Xophmeister · 2025-01-13T16:55:35Z

topiary-cli/src/io.rs

                        // The user specified a query file
-                        Some(p) => p,
+                        Some(p) => (p, None),


So if --query is used, then the comment query is None? That feels a bit surprising and that ought to be clearly documented.

Alternatively, what do you think about: If the query file is p, then the comment query file is:

Some(q), where q := replace(p, '${language}.scm', '${language}.comment.scm'), if q exists.

None, otherwise.

Similar to what you do in topiary-config/src/language.rs... Too magical?

Xophmeister · 2025-01-13T17:12:20Z

topiary-core/src/language.rs

@@ -13,6 +13,9 @@ pub struct Language {
    /// The Query Topiary will use to get the formating captures, must be
    /// present. The topiary engine does not include any formatting queries.
    pub query: TopiaryQuery,
+    /// The Query Topiary will use to determine which nodes are comments.
+    /// When missing, ther ewill be no separate comment processing.


Suggested change

/// When missing, ther ewill be no separate comment processing.

/// When missing, there will be no separate comment processing.

Xophmeister · 2025-01-14T12:02:52Z