feat: Add comments to format output #4397

max-sixty · 2024-04-20T21:27:44Z

This needs some work, as it's really hacky. But wanted to get back into making progress. Will refine before merging.

Any feedback welcome; comments are inline describing a couple of the issues...

max-sixty · 2024-04-20T22:30:12Z

One basic problem here is that we're trying to use PL. But that doesn't actually fit with our approach for using comments.

For example, if a binary expression has a comment within it:

from foo
derive bar = x + # comment
  \ y

...it can't fit into our current approach, because the code for writing the binary expr doesn't allow writing any comments in the middle of that binary expression:

prql/prqlc/prqlc/src/codegen/ast.rs

Lines 109 to 121 in c7f1235

    
           Binary(BinaryExpr { op, left, right }) => { 
        
               let mut r = String::new(); 
        
               let left = write_within(left.as_ref(), self, opt.clone())?; 
        
               r += opt.consume(&left)?; 
        
               r += opt.consume(" ")?; 
        
               r += opt.consume(&op.to_string())?; 
        
               r += opt.consume(" ")?; 
        
               r += &write_within(right.as_ref(), self, opt)?; 
        
               Some(r) 
        
           }

So we may need to rethink the current approach — for example something that iterates through the lexer output, and looks up their meaning in the PL.

aljazerzen · 2024-05-06T09:40:54Z

I don't see a problem here. The comment would be emitted by the write_within(right).

max-sixty · 2024-05-06T17:05:09Z

I don't see a problem here. The comment would be emitted by the write_within(right).

But how would the example above be written? Currently BinaryOp is written as a single item — how would we write a comment within a BinaryOp?

aljazerzen · 2024-05-06T19:14:53Z

I was thinking something like this:

     // writing out BinOp(x, +, y)

     let mut r = String::new(); 
  
     let left = write_within(left.as_ref(), self, opt.clone())?;
     r += opt.consume(&left)?;            // writes "x"
  
     r += opt.consume(" ")?;              // writes " "
     r += opt.consume(&op.to_string())?;  // writes "+"
     r += opt.consume(" ")?;              // writes " "
  
     r += &write_within(right.as_ref(), self, opt)?; // writes "# comment\n \ y"
     Some(r)

max-sixty · 2024-05-06T19:26:36Z

My question is — what if the comment is in the middle of these? For example:

     r += opt.consume(" ")?;              // writes " "
     r += opt.consume(&op.to_string())?;  // writes "+"
     // What if there's a comment here??
     r += opt.consume(" ")?;              // writes " "

aljazerzen · 2024-05-07T11:31:12Z

Oh I see.

A way back when we talked about emitting comments, I imagined to have a wrapper around expr codegen that would:

compare the span of the expr and span of the next comment that needs to be printed,
if expr span is after the comment, this means that the comment should actually come before the expr,
in this case we consume the comment and write it before the expr.

Your case would then look like:

comments:
- 26:35 '# comment'

ast:
...
  - BinOp:
      left: Ident: x
      op: +
      Ident: y

     // writing out BinOp(x, +, y)
     let mut r = String::new(); 
  
     let left = write_within(left.as_ref(), self, opt.clone())?; // this wrapper would see the next comment, but determine that is not ready to be written
     r += opt.consume(&left)?;            // writes "x"
  
     r += opt.consume(" ")?;              // writes " "
     r += opt.consume(&op.to_string())?;  // writes "+"
     r += opt.consume(" ")?;              // writes " "
  
     r += &write_within(right.as_ref(), self, opt)?; // this wrapper would see the next comment and see that it appears before `y`, so it should be written before `y`
     Some(r)

max-sixty · 2024-05-07T19:35:02Z

~~Yes, good idea — that could work — though at the cost of not honoring the original placement of comments or line-wraps. So:~~

from foo
derive bar = x + # comment
  \ y

~~...becomes...~~

from foo
# comment
derive bar = x + y

What do you think the best path forward is? This seems like a tractable change which could let us land prql fmt. Should I try and implement this so we have something that works for the moment? I do find it a bit annoying when auto-formatting stops me expressing something (I think there's a TOML formatter out there which moves comment around!), but it wouldn't be terrible.

Whereas rewriting this to iterate through tokens — which I think is the closest approach which honors exact comment / line-wrap placement — might be a big rewrite...

aljazerzen · 2024-05-07T19:55:56Z

I don't think you understood me: approach I've described would not move comments "trough" expressions. In your example it would keep the comment before y.

It would however sometimes move it "trough" tokens that are not full expressions (+ for example).

aljazerzen · 2024-05-07T19:59:13Z

My general approach would be to make something, even if we later decide that it needs to be rewritten for whatever reason.

But this quite a hard thing to do in general, so if it not something you would enjoy doing, there are things that would be more beneficial to the project. At least that's my perception of importance.

max-sixty · 2024-05-07T20:25:58Z

I don't think you understood me: approach I've described would not move comments "trough" expressions. In your example it would keep the comment before y.

It would however sometimes move it "trough" tokens that are not full expressions (+ for example).

Sorry, you're right, my example wasn't good.

A better one is from

prql/prqlc/prqlc/src/codegen/ast.rs

Line 372 in c7f1235

r += opt.consume(&format!("let {} {}", var_def.name, typ))?;

let foo # comment
  \ <int> = bar

...but this is very esoteric.

OK great, thanks, let's go ahead with this! I will work on it

for more information, see https://pre-commit.ci

max-sixty · 2024-05-12T23:20:40Z

I think this mostly "works", but is extremely bad code, some of the worst I've written :). I simplified the before / after comment functions, but this means that I need to do a quick change to allow comments that appear before any tokens to work.

One major issue is that in:

from employees  # comment

...from employees and employees are both expressions, both of which are followed by # comment. Because we write comments when they follow an expression we're writing, we somehow need to tell one of these expressions "don't write the comment after yourself".

I'm not sure the best way to do that — I currently have some really hacky flag in the opt struct which gets turned off for expressions within another expression.

It's also really inefficient, which isn't super important at this stage, but as an FYI.

I'm still thinking that iterating through the tokens and deciding where to put them based on looking them up in the PL would be a better model. But as above — that probably requires a big rewrite, and I'd be quite keen to land something.

aljazerzen · 2024-05-13T20:30:25Z

prqlc/prqlc/src/codegen/ast.rs

+        // FIXME: why isn't this working?
+        // .position(|t| t.1.start == span.start && t.1.end == span.end)


Because if expression 1 + x has span=0..5, the tokens will have spans 1: 0..1 +: 2..3 x: 4..5

aljazerzen · 2024-05-13T20:39:12Z

prqlc/prqlc/src/codegen/ast.rs

+        .unwrap_or_else(|| panic!("{:?}, {:?}", &tokens, &span));
+
+    let mut out = vec![];
+    for token in tokens.0.iter().skip(index + 1) {


You probably want this here:

tokens.0.iter().skip_while(|t| t.span.start < span.end)

... skipping all tokens that were before or within the current expression.

aljazerzen · 2024-05-13T20:40:25Z

prqlc/prqlc/src/codegen/ast.rs

+        match token.kind {
+            TokenKind::NewLine | TokenKind::Comment(_) => out.push(token.clone()),
+            _ => break,
+        }


oh, and this is:

.take_while(|token| matches!(token.kind, TokenKind::NewLine | TokenKind::Comment(_))

aljazerzen · 2024-05-13T20:42:33Z

prqlc/prqlc/src/lib.rs

+    assert_snapshot!(format_prql( r#"
+        from employees 
+        # test comment
+        select {name}
+        "#
+    ).unwrap(), @r###"
+    from employees
+    # test comment
+
+    select {name}
+
+    "###);
+
+    assert_snapshot!(format_prql( r#"
+        # test comment
+        from employees # inline comment
+        # another test comment
+        select {name}"#
+    ).unwrap(), @r###"
+    from employees  # inline comment
+    # another test comment
+
+    select {name}
+    "###);


Nice, this is got me excited about this feature!

aljazerzen · 2024-05-13T20:44:22Z

prqlc/prqlc/src/codegen/mod.rs

+    /// The lexer tokens that were used to produce this source; used for
+    /// comments.
+    pub tokens: TokenVec,


Just a question if we could prematurely optimize this:

Do we need all of the tokens? Wouldn't just comments (and new lines maybe) be enough to get the same functionality?

aljazerzen · 2024-05-19T07:54:29Z

A heads up: the WriteOpt is not meant to be a mutable "singleton context" but as a clonable object that changes with each call.

It was supposed to describe in what conditions something should be written in:

A few examples:

|----- this is the length of the whole line -----|
let a_long_name = |-- this is remaining width ---|
^
 \___ this is current indentation level

|----- this is the length of the whole line -----|
let my_func = func param_a rel -> (
  rel
  filter this.x == |-- this is remaining width --|
  ^
   \___ this is current indentation level

An important notion here is that when you enter an context of, for example, parenthesis, the new indentation is present only within the parenthesis. When that call ends, the indentation needs to be reset. That's why it is convenient that these WriteOpts are cloned when entering a context, so parent context cannot be affected.

I assume you wanted to have global place for the tokens vec so it is not cloned all the time. I suggest you use Rc<TokensVec> instead. It is cheap to clone but does not allow mutation.

max-sixty · 2024-05-19T19:20:42Z

I assume you wanted to have global place for the tokens vec so it is not cloned all the time. I suggest you use Rc<TokensVec> instead. It is cheap to clone but does not allow mutation.

It wasn't for the perf! It was because I think need a way to hold state, to pass down whether we should be grabbing comments from before the expr. We only want to do that for the first expr of the statement, and need to tell the other exprs that they shouldn't look before themselves, only after¹.

I think possibly we should iterate through the tokens, but then defer to this when we hit something that isn't a comment. That way it'll have decent perf and a tractable way of handling tokens that aren't in the PL — seems like a reasonable way to get something decent without a big rewrite. Lmk if I'm missing some obvious way of doing this though...

The only comments that gets missed are those that come before the first expr, all others are covered by the rule "write comments after yourself, unless you're the last expr in a nested context, because your parent has written those" ↩

aljazerzen · 2024-05-21T12:18:11Z

Yup, iterating trough the tokens sounds like a good idea. It would benefit the perf and would also take care of the problem you describe where child exprs find comments already consumed by parent expr.

max-sixty · 2024-05-29T06:56:10Z

OK so I did some light research and one approach seems close to our current one, and will avoid the current issues in the PR:

Iterate over the tokens, and add each token to a node in the PL AST. That way, we don't have the ambiguity of whether the comment in a + b # c comes after a + b or after b; we add it to one of them in the pass over the tokens.

It does mean that we either need:

a spare field in the PL AST which we mutate
another struct which holds PL Expr (and can hold itself for nested structs)
a HashMap-type object which we use as a lookup

The HashMap seems like the easiest to do; though also the least inspectable and the least easy to move to the full "iterate over both the PL and the tokens". I'll probably do that....

This is an attempt to get around the complications of managing lexer + parser output, which PRQL#4397 has hit in a few incarnations by just adding comments ('aesthetics') to PL. This very very nearly works -- with chumsky we can create a function that wraps anything that might have a comment, implement a trait on the AST items that contain it, and away we go (though it did require a lot of debugging in the end). This would then be really easy to write back out. I think there's literally a single case where it doesn't work -- where a comment doesn't come directly before or directly after an AST item -- in the final trailing comma of a tuple or array. So tests fail at the moment. Next we need to consider: - Can we workaround that one case? We don't actually care about whether there's a trailing comma, so we could likely hack around it... - Are there actually other cases of this model failing? I know this approach -- of putting aesthetic items into AST -- is not generally favored, and it's really rare that there's even a single case of something not working.

max-sixty · 2024-06-24T18:16:27Z

To keep this updated: in #4639 I tried a version of this (though slightly different implementation — instead it adds them at parse time):

OK so I did some light research and one approach seems close to our current one, and will avoid the current issues in the PR:

Iterate over the tokens, and add each token to a node in the PL AST. That way, we don't have the ambiguity of whether the comment in a + b # c comes after a + b or after b; we add it to one of them in the pass over the tokens.

The main issue is that trailing commas in lists aren't contiguous to any expression, since they fall between two punctuation tokens (more explanation on this in the PR).

[
  foo,
  bar, # comment
]

Additionally, even in non-trailing commas, we'd need to think about where to attach comments. Since these two appear at similar locations in the approach of "put the comment on the nearest token" but probably comment on different items:

[
  foo,
  # comment about bar
  bar,
]

vs.

[
  foo, # comment about foo
  bar,
]

...in the latter case, the expr that's "contiguous with the comment apart from the whitespace" is bar but the comment is actually likely about foo.

(FWIW I now understand why my TOML formatter moves the latter case into the former case; a quite annoying feature!)

max-sixty added 6 commits February 17, 2024 21:32

feat: Add comments to format output

a45287c

Merge branch 'main' into format-comments

1b351c8

Merge branch 'main' into format-comments

8435108

Merge branch 'main' into format-comments

161c28a

4e8a29f

.

1a4858d

max-sixty changed the title ~~feat: Add comments to format output~~ feat: Add comments to format output Apr 20, 2024

max-sixty added 2 commits May 9, 2024 10:09

Merge branch 'main' into format-comments

7f51d48

Merge branch 'main' into format-comments

e7ad240

max-sixty mentioned this pull request May 12, 2024

feat: Add prqlc lex command to the CLI #4467

Merged

max-sixty added 2 commits May 12, 2024 15:24

Merge branch 'main' into format-comments

07eacad

.

73c0504

max-sixty force-pushed the format-comments branch from f7157c5 to 73c0504 Compare May 12, 2024 22:44

pre-commit-ci bot and others added 4 commits May 12, 2024 22:44

[pre-commit.ci] auto fixes from pre-commit.com hooks

51a85b3

for more information, see https://pre-commit.ci

7021c49

[pre-commit.ci] auto fixes from pre-commit.com hooks

99328e8

for more information, see https://pre-commit.ci

.

47f4fb2

e0c02e7

aljazerzen reviewed May 13, 2024

View reviewed changes

4e47f09

max-sixty force-pushed the format-comments branch from 8263423 to 4e47f09 Compare May 19, 2024 20:35

Merge branch 'main' into format-comments

94e7955

max-sixty mentioned this pull request Jun 20, 2024

feat(fmt): An attempt at aesthetic items into PL #4639

Open

max-sixty mentioned this pull request Jul 1, 2024

Add comments to fmt #4116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add comments to format output #4397

feat: Add comments to format output #4397

max-sixty commented Apr 20, 2024

max-sixty commented Apr 20, 2024

aljazerzen commented May 6, 2024

max-sixty commented May 6, 2024

aljazerzen commented May 6, 2024

max-sixty commented May 6, 2024

aljazerzen commented May 7, 2024

max-sixty commented May 7, 2024 •

edited

Loading

aljazerzen commented May 7, 2024

aljazerzen commented May 7, 2024

max-sixty commented May 7, 2024

max-sixty commented May 12, 2024 •

edited

Loading

aljazerzen May 13, 2024 •

edited

Loading

aljazerzen May 13, 2024

aljazerzen May 13, 2024

aljazerzen May 13, 2024

aljazerzen May 13, 2024

aljazerzen commented May 19, 2024 •

edited

Loading

max-sixty commented May 19, 2024 •

edited

Loading

aljazerzen commented May 21, 2024

max-sixty commented May 29, 2024

max-sixty commented Jun 24, 2024 •

edited

Loading

		// FIXME: why isn't this working?
		// .position(\|t\| t.1.start == span.start && t.1.end == span.end)

feat: Add comments to format output #4397

Are you sure you want to change the base?

feat: Add comments to format output #4397

Conversation

max-sixty commented Apr 20, 2024

max-sixty commented Apr 20, 2024

aljazerzen commented May 6, 2024

max-sixty commented May 6, 2024

aljazerzen commented May 6, 2024

max-sixty commented May 6, 2024

aljazerzen commented May 7, 2024

max-sixty commented May 7, 2024 • edited Loading

aljazerzen commented May 7, 2024

aljazerzen commented May 7, 2024

max-sixty commented May 7, 2024

max-sixty commented May 12, 2024 • edited Loading

aljazerzen May 13, 2024 • edited Loading

Choose a reason for hiding this comment

aljazerzen May 13, 2024

Choose a reason for hiding this comment

aljazerzen May 13, 2024

Choose a reason for hiding this comment

aljazerzen May 13, 2024

Choose a reason for hiding this comment

aljazerzen May 13, 2024

Choose a reason for hiding this comment

aljazerzen commented May 19, 2024 • edited Loading

max-sixty commented May 19, 2024 • edited Loading

Footnotes

aljazerzen commented May 21, 2024

max-sixty commented May 29, 2024

max-sixty commented Jun 24, 2024 • edited Loading

max-sixty commented May 7, 2024 •

edited

Loading

max-sixty commented May 12, 2024 •

edited

Loading

aljazerzen May 13, 2024 •

edited

Loading

aljazerzen commented May 19, 2024 •

edited

Loading

max-sixty commented May 19, 2024 •

edited

Loading

max-sixty commented Jun 24, 2024 •

edited

Loading