Grievances with the tree-sitter query language #558

nbacquey · 2023-02-09T17:22:04Z

nbacquey
Feb 9, 2023
Collaborator

This issue will only serve to discuss weaknesses of the tree-sitter query language. We will be able to address them separately if needed.

No negation predicate

We often want to express something like "match any node that hasn't that particular type". However, the query language has no negation predicate yet, so the only way to do that is to enumerate all node types that can appear in the given context, and remove the one we want to exclude.
It makes for unnecessarily cumbersome code.

This is a known issue in the tree-sitter repo, which had been dormant for 2+ years.

The anchor `.` is useful in practice, but useless in theory

The . anchor operator is very useful for performance and correctness of queries, but it is ultimately impossible to use correctly.

This is because most language grammars allow some special types of nodes, that can pop anywhere in the syntax tree (e.g. comments). The anchor operator doesn't ignore those special nodes.

Consider this query for OCaml:

(
  (number)
  .
  ";" @tag
)

It will match all three semicolons in the following code:

[
  1.0;
  2.0;
  3.0;
]

But only two in this code:

[
  1.0;
  2.001 (*sic*);
  3.0;
]

The same holds for anchors marking the beginning or end of a node's children.

Worse: even if we decide to bite the bullet and replace every . by .(comment)*., it won't work as expected either (see next point).

The anchor `.` ignores non-named nodes, except when it doesn't

Consider this OCaml query:

(
  (#delimiter! "#")
  _ @append_delimiter
  .
  "]"
)

Ran on this code:

[
  1;
  2;
  3;
]

Which has the following syntax tree:

{Node list_expression (0, 0) - (4, 1)} - Named: true
  {Node [ (0, 0) - (0, 1)} - Named: false
  {Node number (1, 2) - (1, 3)} - Named: true
  {Node ; (1, 3) - (1, 4)} - Named: false
  {Node number (2, 2) - (2, 3)} - Named: true
  {Node ; (2, 3) - (2, 4)} - Named: false
  {Node number (3, 2) - (3, 3)} - Named: true
  {Node ; (3, 3) - (3, 4)} - Named: false
  {Node ] (4, 0) - (4, 1)} - Named: false

The result is:

[
  1;
  2;
  3#;#
]

Which means that both (number) and ";" were matched by the query (i.e. were considered "just before" "]"). I understand that we may want to ignore non-named nodes when checking adjacency, but having more that one node being the immediate left neighbor of another node, severely breaks some invariants you want to have when writing queries.

The Kleene star `*` can match non-consecutive sequences of nodes

The behavior of the * operator is quite surprising. Consider this OCaml query:

(
  "["
  .
  (number)*
  .
  _ @append_delimiter
  (#delimiter! "#")
)

Ran on this code:

[
  1;
  2;
  3;
]

Which has the following syntax tree:

{Node list_expression (0, 0) - (4, 1)} - Named: true
  {Node [ (0, 0) - (0, 1)} - Named: false
  {Node number (1, 2) - (1, 3)} - Named: true
  {Node ; (1, 3) - (1, 4)} - Named: false
  {Node number (2, 2) - (2, 3)} - Named: true
  {Node ; (2, 3) - (2, 4)} - Named: false
  {Node number (3, 2) - (3, 3)} - Named: true
  {Node ; (3, 3) - (3, 4)} - Named: false
  {Node ] (4, 0) - (4, 1)} - Named: false

The result is:

[
  1#;#
  2#;
  3;
]

This seems to indicate that the (number)* predicate matches:

the empty sequence (expected)
(number) (expected)
(number).";" (unexpected)
but not (number).";".(number) (very unexpected)

I don't know if the exact behavior for the Kleene star is documented anywhere, but it seems very unusual to me.

We can't use the (apparent?) full set of tree-sitter queries features

The documentation states that we can use predicates of the form

(
  (identifier) @constant
  (#match? @constant "^[A-Z][A-Z_]+")
)

However, we can't currently compile them in the Rust library.

Inabilty to specify disjunctions in encompassing node types

We sometimes need to write queries of the form:

(X
  Y
  ; etc.
)

Where X ranges over a few nodes (say, a, b and c) and Y is constant. It would be nice if it were possible to alternate over head nodes (e.g., ([a b c] Y)), rather than rewrite the rule several times and/or use scopes, or being less precise and using a wildcard in the head position.

Tokens that are captured with regexes never appear in the CST

In tree-sitter grammars, tokens can either be hardcoded, or captured through regular expressions.
For instance, here is the definition of the additive infix operators in the OCaml grammar:

    _add_operator: $ => choice(
      '+', '-', '+.', '-.',
      token(choice(
        seq('+', repeat1(OP_CHAR)),
        seq('-', choice(repeat1(/[!$%&*+\-./:<=?@^|~]/), repeat2(OP_CHAR)))
      ))
    ),

+, +., -, and -. are hardcoded, while another additive operator like -% isn't. This impacts what appears in the CST:

This is the CST of 1 + 2:

{Node infix_expression (0, 0) - (0, 5)} - Named: true
  {Node number (0, 0) - (0, 1)} - Named: true
  {Node infix_operator (0, 2) - (0, 3)} - Named: true
    {Node + (0, 2) - (0, 3)} - Named: false
  {Node number (0, 4) - (0, 5)} - Named: true

This is the CST of 1 -% 2:

{Node infix_expression (0, 0) - (0, 6)} - Named: true
  {Node number (0, 0) - (0, 1)} - Named: true
  {Node infix_operator (0, 2) - (0, 4)} - Named: true
  {Node number (0, 5) - (0, 6)} - Named: true

Note that there is no {Node -% (0, 2) - (0, 4)} - Named: false.

This means that tree-sitter queries cannot match infix operators that are defined in regular expressions.
For instance, the following tree-sitter query may have matches:

(infix_operator
  "+" @do_something
)

While the following will never match anything, and isn't even a correct query:

(infix_operator
  "-%" @do_something
)

See #418 and #462 for further reference

Xophmeister · 2023-02-09T17:39:16Z

Xophmeister
Feb 9, 2023
Maintainer

Great list 👍 Just to expand a little, from my experience:

Negation is amongst the Tree-Sitter issues (Adding a not quantifier to the tree query syntax tree-sitter/tree-sitter#705), but has been dormant for 2+ years.
Anecdotally, the Kleene star appears to ignore anonymous nodes in its definition of "consecutive". (That mightn't be the exact constraint; more data needed!)
At a glance, the #match? and #eq? predicates feel like they could be useful. However, I wonder how Topiary would use them, as we're using capture names as formatting directives, rather than arbitrary labels.

0 replies

Xophmeister · 2023-02-13T10:02:18Z

Xophmeister
Feb 13, 2023
Maintainer

Another one:

I have found myself needing, a few times, to write queries of the form:
```
(X
  Y
  ; etc.
)
```
Where X ranges over a few nodes (say, a, b and c) and Y is constant. It would be nice if it were possible to alternate over head nodes (e.g., ([a b c] Y)), rather than rewrite the rule several times and/or use scopes, or being less precise and using a wildcard in the head position.

0 replies

nbacquey · 2023-02-13T15:20:04Z

nbacquey
Feb 13, 2023
Collaborator Author

Updated the initial message, and added your use-case @Xophmeister

0 replies

nbacquey · 2023-02-14T09:46:42Z

nbacquey
Feb 14, 2023
Collaborator Author

Added another one. I'm thinking about committing this file to the repository itself, what do you think?

0 replies

aspiwack · 2023-02-14T09:53:27Z

aspiwack
Feb 14, 2023
Maintainer

I don't see why not. But on the other hand, what purpose would serve a file that gathers this information?

0 replies

nbacquey · 2023-02-14T09:58:15Z

nbacquey
Feb 14, 2023
Collaborator Author

The same as this issue, I think:

Warn contributors about the known shortcomings of the query language.
Give pointers on which contributions to tree-sitter would be appreciated.

Except that it would stay visible, and its modifications would be git-tracked, which makes following its evolution easier than tracking a particular issue.

0 replies

nbacquey · 2023-02-14T10:02:52Z

nbacquey
Feb 14, 2023
Collaborator Author

I suppose another way to achieve that would be to add a query language issue tag, and create one issue per point

0 replies

Xophmeister · 2023-02-14T10:16:26Z

Xophmeister
Feb 14, 2023
Maintainer

This is more of a meta-issue, but I figured it's worth mentioning here as it's somewhat relevant:

Our language query files are declarative formatting directives, where each query targets some syntactic structure attested by the grammar. Targets can overlap and, in general (AFAIK) there is no limit to the number of queries that can target a specific node. For non-trivial languages, this quickly flaunts the usual good practices of writing code. For example, language query files can become hundreds-to-thousands of lines long and it's up to whoever wrote them to organise them coherently and remember the state, such that any new queries don't conflict or cause unintended interactions.

By in large, this is doable rather than intractable, because programming languages' syntactic structures are (anecdotally) relatively well siloed. However, edge cases certainly exist where it's not immediately obvious which queries are being applied.

0 replies

aspiwack · 2023-02-14T11:03:33Z

aspiwack
Feb 14, 2023
Maintainer

@nbacquey I think it depends on the objective (or maybe we need both). If the goal is to discuss how they can be addressed, to create workarounds in our workflows or upstream contributions, then one-issue-per-grievance is the way to go. If the goal is to document gotchas when writing query files, then this should be part of our documentation.

Maybe we need a bit of both.

@Xophmeister I don't think that there is a fundamental solution to this. In that there probably isn't an ideal formatting DSL that doesn't exhibit the problem. Maybe what we could think about is to have debugging tools that make clear which queries have matched a particular piece of code?

0 replies

Xophmeister · 2023-02-14T11:06:32Z

Xophmeister
Feb 14, 2023
Maintainer

I don't think that there is a fundamental solution to this. In that there probably isn't an ideal formatting DSL that doesn't exhibit the problem. Maybe what we could think about is to have debugging tools that make clear which queries have matched a particular piece of code?

Indeed; some kind of tooling to aid development is what I had in mind 👍

0 replies

nbacquey · 2023-02-14T11:07:03Z

nbacquey
Feb 14, 2023
Collaborator Author

I was thinking about a debugging tool, or a debugging mode, as well.

0 replies

torhovland · 2023-02-27T11:47:26Z

torhovland
Feb 27, 2023
Maintainer

One more point:

There is no way to define constants to reduce duplication. For example, the Bash grammar uses this particular construct 9 times:

[(command) (list) (pipeline) (compound_statement) (subshell) (redirected_statement) (variable_assignment)]

0 replies

Xophmeister · 2023-02-27T12:09:57Z

Xophmeister
Feb 27, 2023
Maintainer

There is no way to define constants to reduce duplication.

A metalanguage that compiles down to Tree-sitter queries sounds quite attractive. (Leveraging Nickel to do the job would be the icing on the cake!)

0 replies

torhovland · 2023-02-27T12:18:34Z

torhovland
Feb 27, 2023
Maintainer

Good point. And actually this particular problem (reusable constants) could be solved with the most basic templating solution imaginable.

0 replies

aspiwack · 2023-03-17T07:41:13Z

aspiwack
Mar 17, 2023
Maintainer

@nbacquey I was reading some documentation, and, for completeness, there is a limited form of negation in the query language

Negated Fields

You can also constrain a pattern so that it only matches nodes that lack a certain field. To do this, add a field name prefixed by a ! within the parent pattern. For example, this pattern would match a class declaration with no type parameters:
(class_declaration
  name: (identifier) @class_name
  !type_parameters)

0 replies

aspiwack · 2023-03-17T07:55:22Z

aspiwack
Mar 17, 2023
Maintainer

However, we can't currently compile them in the Rust library.

The documentation explicitly calls out #match? and #eq? as being understood by the Rust crate (as far as tree-sitter proper is concerned, both are supposed to be just understood as uninterpreted predicates, but the Rust crate is supposed to impose semantics on top of them). So it would suggest that maybe we're calling the API wrong, or that there is a bug in the Rust crate implementation.

0 replies

aspiwack · 2023-03-17T08:57:40Z

aspiwack
Mar 17, 2023
Maintainer

The anchor . ignores non-named nodes, except when it doesn't

I haven't seen an example of unnamed nodes not being ignored here. Is this grievance's title misleading?

I think that it's clear from the (comment) issue that unnamed nodes are not quite what you would need to skip, though.

Your example does document a pretty important pitfall though: it's typically a bad idea to use _ as an operand of .

0 replies

aspiwack · 2023-03-17T09:02:29Z

aspiwack
Mar 17, 2023
Maintainer

The Kleene star * can match non-consecutive sequences of nodes

I'm not sure that your example demonstrates this. It looks like it may be an occurrence of the somewhat counter intuitive behaviour of . discussed above (both above in your issue, and above as in the previous comment 🙂 ).

Maybe we should try a more constrained query

(
  "["
  .
  (number)
  .
  (number)*
  .
  (number) @append_delimiter
  .
  "]"
  (#delimiter! "#")
)`

And run it on various length of lists and see which one match (if the * is well-behaved, I expect this query to match lists of size 2 and of size 3. But not of size 4. If (number)* matches non-consecutive number, then size 4 and above should be matched.

0 replies

torhovland · 2023-03-17T09:15:40Z

torhovland
Mar 17, 2023
Maintainer

However, we can't currently compile them in the Rust library.

The documentation explicitly calls out #match? and #eq? as being understood by the Rust crate (as far as tree-sitter proper is concerned, both are supposed to be just understood as uninterpreted predicates, but the Rust crate is supposed to impose semantics on top of them). So it would suggest that maybe we're calling the API wrong, or that there is a bug in the Rust crate implementation.

Can't remember the exact details now, but I do remember not being able to use built-in predicates, and finding that they were indeed not implemented in the Rust binding. Worth another look.

0 replies

Xophmeister · 2023-03-17T10:25:16Z

Xophmeister
Mar 17, 2023
Maintainer

[snip] there is a limited form of negation in the query language
Negated Fields
You can also constrain a pattern so that it only matches nodes that lack a certain field. To do this, add a field name prefixed by a ! within the parent pattern. For example, this pattern would match a class declaration with no type parameters:
(class_declaration
  name: (identifier) @class_name
  !type_parameters)

AFAIK, the need for this kind of negation has never come up. It arises in a more general sense, when there's a pattern that applies to a strict subset of what the grammar affords (i.e., you end up with "exceptional nodes") and @do_nothing can't be applied. So far, the only solution we've found is to replicate that pattern for every node in that subset.

0 replies

nbacquey · 2023-05-16T07:41:32Z

nbacquey
May 16, 2023
Collaborator Author

Added a paragraph on tokens captured by regular expressions in the grammar

0 replies

314eter · 2023-05-27T20:53:35Z

314eter
May 27, 2023

I never realised the difference between strings and regular expressions in a grammar until tree-sitter/tree-sitter-ocaml#63. I created tree-sitter/tree-sitter-ocaml#77 to make the behaviour for operators more consistent. @nbacquey do you think that's a good idea, or will it make it even more complicated for topiary?

0 replies

nbacquey · 2023-05-30T08:32:47Z

nbacquey
May 30, 2023
Collaborator Author

Hi @314eter, thanks for taking an interest in this issue!
I think removing the "||" and "&&" from the syntax trees make sense from a consistency point of view.
Sure, it will break Topiary as it is, but I think your proposal of making _or_operator and the likes named nodes instead would enable us to fix the issue properly.

Basically, it would help us a great deal if all the nodes from this rule could appear in the syntax tree:

    infix_operator: $ => choice(
      $._pow_operator,
      $._mult_operator,
      $._add_operator,
      $._concat_operator,
      $._rel_operator,
      $._and_operator,
      $._or_operator,
      $._assign_operator
    ),

0 replies

314eter · 2023-05-30T21:41:43Z

314eter
May 30, 2023

Done in v0.20.2.

0 replies

torhovland · 2023-06-27T10:00:57Z

torhovland
Jun 27, 2023
Maintainer

Also see #537 (comment)

0 replies

josharian · 2023-12-20T21:24:51Z

josharian
Dec 20, 2023

Another one I learned the hard way in a different context: After three captures on any given node, further captures are silently ignored.

E.g. given

(foo) @a @b @c @d

@d will just be silently ignored.

Right now, there aren't enough orthogonal captures in topiary to hit this...but as it grows, this might crop up, and boy does it hurt.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grievances with the tree-sitter query language #558

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 26 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Grievances with the tree-sitter query language #558

nbacquey Feb 9, 2023 Collaborator

Replies: 26 comments

Xophmeister Feb 9, 2023 Maintainer

Xophmeister Feb 13, 2023 Maintainer

nbacquey Feb 13, 2023 Collaborator Author

nbacquey Feb 14, 2023 Collaborator Author

aspiwack Feb 14, 2023 Maintainer

nbacquey Feb 14, 2023 Collaborator Author

nbacquey Feb 14, 2023 Collaborator Author

Xophmeister Feb 14, 2023 Maintainer

aspiwack Feb 14, 2023 Maintainer

Xophmeister Feb 14, 2023 Maintainer

nbacquey Feb 14, 2023 Collaborator Author

torhovland Feb 27, 2023 Maintainer

Xophmeister Feb 27, 2023 Maintainer

torhovland Feb 27, 2023 Maintainer

aspiwack Mar 17, 2023 Maintainer

aspiwack Mar 17, 2023 Maintainer

aspiwack Mar 17, 2023 Maintainer

aspiwack Mar 17, 2023 Maintainer

torhovland Mar 17, 2023 Maintainer

Xophmeister Mar 17, 2023 Maintainer

nbacquey May 16, 2023 Collaborator Author

nbacquey May 30, 2023 Collaborator Author

torhovland Jun 27, 2023 Maintainer

nbacquey
Feb 9, 2023
Collaborator

Xophmeister
Feb 9, 2023
Maintainer

Xophmeister
Feb 13, 2023
Maintainer

nbacquey
Feb 13, 2023
Collaborator Author

nbacquey
Feb 14, 2023
Collaborator Author

aspiwack
Feb 14, 2023
Maintainer

nbacquey
Feb 14, 2023
Collaborator Author

nbacquey
Feb 14, 2023
Collaborator Author

Xophmeister
Feb 14, 2023
Maintainer

aspiwack
Feb 14, 2023
Maintainer

Xophmeister
Feb 14, 2023
Maintainer

nbacquey
Feb 14, 2023
Collaborator Author

torhovland
Feb 27, 2023
Maintainer

Xophmeister
Feb 27, 2023
Maintainer

torhovland
Feb 27, 2023
Maintainer

aspiwack
Mar 17, 2023
Maintainer

aspiwack
Mar 17, 2023
Maintainer

aspiwack
Mar 17, 2023
Maintainer

aspiwack
Mar 17, 2023
Maintainer

torhovland
Mar 17, 2023
Maintainer

Xophmeister
Mar 17, 2023
Maintainer

nbacquey
May 16, 2023
Collaborator Author

nbacquey
May 30, 2023
Collaborator Author

torhovland
Jun 27, 2023
Maintainer