-
-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tree-sitter rolling fixes: 1.121 edition #1085
Conversation
…and out of `WASMTreeSitterLanguageMode` for reasons of encapsulation.
Fixes issue with parsing of selectors in `:has`, `:is`, and other pseudoclasses.
(Update: we now have a spec for |
…including more consistent firing of the `did-suggest-indent` event.
…and fix a bug with the indentation reparse budget.
…as booleans. Handle `NULL` in a similar way.
…and update our API usages. Tree-sitter harmonized the API differences between `web-tree-sitter` and `node-tree-sitter` in version 0.22.0. This is the first time we’ve had to deal with that. Luckily, the changes were mostly palatable. The biggest API difference is in `Query#captures`; two positional arguments for defining the extent of the query have been moved to keyword arguments. We’ve updated our internal usages, but any community packages that relied on the old function signature would break if we didn’t do anything about it. So we’ve wrapped the `Query#captures` method in one of our own; it detects usages that expect the old signature and rearranges their arguments, issuing a deprecation warning in the process. Hopefully this generates enough noise that any such packages understand what’s going on and can update. Other API changes are more obscure — which is good, because we can’t wrap them the way we wrapped `Query#captures`. They involve conversion of functions to getters (`node.hasErrors` instead of `node.hasErrors()`), and there’s no good way to make both usages work… short of wrapping nodes in `Proxy` objects, and that’s not on the table. Since lots has changed in `tree-sitter` since we last upgraded `web-tree-sitter`, I updated our documentation about building a custom version of `web-tree-sitter`.
Just bumped our Immediately after I bumped The slowness had to do with the fact that every query against the tree was accidentally being executed against the entire tree rather than against the specified range. (Even this was fast on the We can monkeypatch Everything went well locally, including specs, so I'll watch the CI carefully just in case there's something really obscure that regressed. |
…only injections with a shallower depth.
…and modify query files in response to all non–backwards-compatible changes.
…after `tree-sitter-javascript` bump.
…to conform to new `tree-sitter-(java|type)script` node structure for template strings.
…at new repo location.
Enough last-minute changes; time to take this out of draft so that I stop clipping at the edges like it's a bonsai tree. |
…for `tree-sitter-html` and `tree-sitter-embedded-template`. Fixed an inscrutable out-of-memory error I was getting in an EJS file.
…related to updating open documents’ syntax highlighting as grammars are processed.
Just kidding! I ran into a strange WASM memory error when working on an EJS document, and I couldn't figure out why it was happening, so eventually I decided “might as well update the relevant Tree-sitter parsers” and that solved it. Along the way I discovered another subtle bug, so that's also been dealt with. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without diving too deep into the code, with all tests passing and the history of these bumps, I'm confident in merging this one.
But love to see lots of little changes in seemingly every grammar, it's quite impressive tbh
// We do this because folds aggregate, so we can always either (a) delegate | ||
// a job to an arbitrary layer's `FoldResolver`, or (b) ask _all_ layers to | ||
// do something and then assemble the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick comment that this paragraph is not terribly clear to someone (me, lol) unfamiliar with Tree-sitter nuts and bolts.
It's a little unclear to me how the "aggregating of folds" leads to the ability to "delegate a job to an arbitrary layer's [resolver]" or "ask all layers to do something and then assemble the results." Related to this: I have some questions: In this case, what is "something" as in "do something"? Can we do fairly arbitrary stuff, for example related to indenting or highlighting, in the Domain Specific Language Tree-sitter provides (seems like Tree-sitter grammars are in a Tree-sitter-specific DSL??) or is it stuff some arbitrary JavaScript consumer (community package) can do when getting information back from... Tree-sitter? The grammar itself? (Not sure how this works, TBH). Who (what code, what consumer) is "doing something" and who is "assembling the results" and what becomes of those "[assembled] results"?
Probably too many questions, but the type of questions someone (it's me again, lol) who doesn't already know the context in which these terms are being used might ask you if they were trying to dig into it for the first time.
One last thing, "aggregate" (verb) might be a precise word choice in this situation, but feels like it's at a high reading level without providing me clarity/insight, from the perspective of not already knowing what this actually means. Consider easier or lower-reading-level wording? If there are multiple sub-concepts being stated tersely in one go, consider unpacking them out? I take it to mean folds "combine" in a sense that either the result of the folding (decision??) is combined with the result of another fold (decision??) or perhaps that (if I recall correctly) there can be only one fold per line, so any fold reason is effectively more or less equivalent to there being any other fold reason on a given line... ??? But unclear where/when this "aggregation" is happening that we're referring to. Is this a code short-circuit quick path that saves us computing time, without downside to consumers/end-users (they don't have to think about it)? Does an end-user or consumer have to know the internals of this short-circuiting/aggregating, to predict the behaviors they would be interested in as a user/consumer?
Sorry to deep-dive, but I did want to give some feedback as to the readability of the comment. I like to praise comments, but here I'll giv a slightly hard time as you appear to be asking for this type of feedback in the PR body. So I'm letting 'er rip! Hope this is the kind of feedback you were hoping for.
Best,
-DeeDeeG
P.S. still reviewing, just wanted to finish this comment and post it separately since it turned out so long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little unclear to me how the "aggregating of folds" leads to the ability to "delegate a job to an arbitrary layer's [resolver]" or "ask all layers to do something and then assemble the results."
Yeah, I struggle to explain this stuff sometimes.
A buffer consists of anywhere from 1 to Infinity “language layers.” An HTML file has a “root” layer for parsing/highlighting HTML. If the file has a single SCRIPT block with inline JS, that’s a second layer. If it also has a STYLE block with inline CSS, that’s a third. And so on.
If we want to get all the folds in a buffer, we ask all the language layers, and then we assemble their results. By and large, we just aggregate what each layer tells us. That’s why each layer has a FoldResolver
. (I'm simplifying, but probably less than you think!)
By contrast, if we want to figure out the proper indentation level after a newline is inserted, that’s not something we find out by asking all the layers and aggregating the results. It’s a two-phase process, and in each phase we find one layer to ask. The layer is not guaranteed to be the same in both phases.
So it’s not a task for which we gain any simplicity by splitting up the work among different layers. That’s what I was trying to say. The task itself requires being aware of multiple language layers, so we'd actually be complicating the task if we gave each layer an IndentResolver
.
Related to this: I have some questions: In this case, what is "something" as in "do something"? Can we do fairly arbitrary stuff, for example related to indenting or highlighting, in the Domain Specific Language Tree-sitter provides (seems like Tree-sitter grammars are in a Tree-sitter-specific DSL??) or is it stuff some arbitrary JavaScript consumer (community package) can do when getting information back from... Tree-sitter? The grammar itself? (Not sure how this works, TBH). Who (what code, what consumer) is "doing something" and who is "assembling the results" and what becomes of those "[assembled] results"?
Tree-sitter parses our code into a syntax tree. Highlighting, indentation, code folding, and symbol listing are all tasks that we can do by querying that tree. The query language is that Lisp-y stuff you see in the contents of .scm
files.
If we found another useful way to query the tree and make users’ lives easier, we could do it. If a community package wanted to query the tree for its own purposes, it could; the API’s not very public or documented, but I’ve purposefully left some extension points.
The terminology is confusing because Tree-sitter also uses the word “grammar,” but I’ll use the word “parser” instead:
- A language layer has a single
WASMTreeSitterGrammar
. That’s a Pulsar-specific thing. - A single
WASMTreeSitterGrammar
has aLanguage
(a representation of a parser inweb-tree-sitter
) and some number ofQuery
instances (also aweb-tree-sitter
interface) — one for highlights, one for folds, etc. Query
instances are created via (IIRC)Language::query
; you give it a string with some query syntax (the contents of an SCM file).- You can use that
Query
instance to get matches viaQuery::captures(tree, options)
, wheretree
is a Tree-sitter tree andoptions
can include things like an (optional) range of the buffer to query. - We get back results really fast and can do whatever we want with them. The results include node metadata (node name, text contents, buffer range) and capture metadata (capture name, any asserted/refuted properties — i.e., the scope tests we use to winnow results).
- A community package could use
WASMTreeSitterGrammar::createQuery
to pass in some Tree-sitter query syntax and get aQuery
instance in return. To get a tree to query against, they can use some APIs onWASMTreeSitterLanguageMode
to grab a given buffer's tree at “safe” points (i.e., the ends of buffer transactions). They can either analyze the tree immediately (since it won't be around indefinitely!) or copy it so that they can analyze the copy in their own time.
Sorry to deep-dive, but I did want to give some feedback as to the readability of the comment. I like to praise comments, but here I'll giv a slightly hard time as you appear to be asking for this type of feedback in the PR body. So I'm letting 'er rip! Hope this is the kind of feedback you were hoping for.
Yeah, all of that is fair. I’ll probably revise this comment in the future. Code comments don’t provide a clear “entry point” into this knowledge, so it’s hard to know how much knowledge to assume at a given location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurred to me when writing this up, that some sort of gentle ramp up the learning curve, in the form of docs, might be the right way to do it.
If we have some "2 + 2 = 4" fundamentals as the "easiest/lowest-assumptions-made" level of that on-ramp, that'd be great for me, heh. But whatever adds context that's essential for being comfortable with this stuff but might be overwhelming as code comments. (Even if it's linking to some Tree-sitter project docs that they have, if any.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tapping out reading all of this refactor line by line before review.
I can appreciate the motivation given in the PR body as for why this is being done. So, I will take some of the content on trust, it is simply too large to review closely in a timely manner, IMO. But hopefully this brings benefits as laid out in the PR body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm angry at the diff; it really does overstate the change. I took three or four methods' worth of logic and put them in a new class at the bottom. The diff makes it seem like I changed this entire file!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments about code comments. (How meta? Not that kind of Meta, though.)
|
||
let indentCapturePosition = null; | ||
let indentDelta = 0; | ||
let dedentNextDelta = 0; | ||
// DEPRECATED | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a blank line above and below this line reading // DEPRECATED
. Is the code above or below the // DEPRECATED
line deprecated, or both, or? Somewhat ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two methods below that line are what is deprecated. They're a very TextMate-like way of envisioning the buffer; they're there because some community packages (and probably some builtin packages) expect them to be there.
// Look up an `indent.` capture property applied with a `#set!` directive, | ||
// optionally coercing to a specified type or falling back to a default | ||
// value. | ||
// | ||
// `names` can be an array in cases where the property may have several | ||
// aliases. The first one that exists will be returned. (Omit the leading | ||
// `indent.` when passing property names.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
capture property applied with a
#set!
directive, [ . . . ]
Is this a verb "capture" or a noun "capture"? If a verb: Who captures the property? And dare I ask: why? Or if it's a noun "capture": is this telling us a capture property was applied with a #set!
directive, and how does type coercion or fallbacks play into all that?
https://i.kym-cdn.com/photos/images/newsfeed/002/029/841/a26.png
Sorry, just surfacing some of my specific confusion areas around this subject, as I see it was requested in the PR body to ask clarifying questions if I don't understand something. And joking around a bit, I hope you don't mind the humor. (See PNG above.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the following line adds quite a bit of context, just from the function name and argument names alone, but still. I'm not sure I totally understand.
getProperty(capture, names, coercion = null, fallback = null) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this telling us a capture property was applied with a
#set!
directive, and how does type coercion or fallbacks play into all that?
It's a noun capture! Capture properties are things like the #set!
clause in
((foo) @bar
(#set! indent.offset 1))
Despite the fact that it doesn't require quotes, the 1
in this example will be returned to us as a string. It's all strings to Tree-sitter. Hence the coercion.
We also have indent.force
(which should be coerced to a boolean), and indent.match
) (which should stay a string). So three different data types — hence the coercion.
And joking around a bit, I hope you don't mind the humor. (See PNG above.)
STOP DOING HUMOR
- words were not intended to be spoken deceitfully
- YEARS of satire yet all we ever do is get MORE SARCASTIC
- “An escalator can never break; it can only Become Stairs. I spilled spot remover on my dog; he’s gone now.” — Statements dreamed up by the Utterly Deranged
“Why did the chicken cross the road???“
they have played us for absolute fools
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR does seem potentially a lot smaller than it looks, assuming the diff for src/wasm-tree-sitter-language-mode.js is in fact mostly just refactoring existing code into a slightly different shape/format. (And should provide the opportunity to do some good things, so, worth it, it sounds like.)
Bumping various third-party dependencies for the grammars should be a good thing.
The tree-sitter-web related changes seem reasonable as explained.
I'm once again not in much of a position to deeply review the content, but I am trying to keep abreast of changes at the very least.
Nothing stands out at me that should need changes, to the extent I can review, at least without doing a ton of background learning that seems rather involved to catch up on first.
Anyway, 👍 from me as well.
This one starts off big — a sizable diff that probably overstates the scope of the underlying refactor.
Indentation logic is complex. It's not as complex as it seems, but it's still pretty complex, and I haven't exactly helped the issue because I haven't found an intuitive way to explain it. The first commit in this PR represents the beginning of an effort to demystify indentation queries and make them easier for others to reason about.
The main initial benefit is that indentation logic is now encapsulated in its own
IndentResolver
class. UnlikeFoldResolver
— a different instance of which exists for eachLanguageLayer
— aWASMTreeSitterLanguageMode
instance has a single instance ofIndentResolver
. That's because the logic of indentation hinting crossesLanguageLayer
s; we don't gain anything by subdividing the labor further.But the encapsulation itself is useful. It lets us write new methods to reduce repetition without increasing the API surface of
WASMTreeSitterLanguageMode
.There are two other ideas that I've added in the process:
The
@match
capture is good, but a recent bug report suggested to me that we could benefit from a similar kind of capture that works in the first phase of indentation hinting. So I created@match.next
; it's a different approach to phase-one hinting from what you get with@indent
/@dedent
/@dedent.next
captures.“Phase one” of indentation hinting is where we determine what the “baseline” indentation is for line X before its own content is considered; we look at the content of line X-1 and decide whether line X should start with the same indentation level, one level more, or one level less. I'll illustrate its usefulness below — much easier when I'm not in an unordered list.
For a long time I've wanted to make the indentation decision process more visible. It's not something anyone needs to care about… until they're writing a modern Tree-sitter grammar or trying to figure out why a grammar someone else wrote isn't making the right indentation decision. When I've found myself in this position, I've had no choice but to open dev tools and set some debugger breakpoints.
Now I'm experimenting with a different approach: an obscure API that, once an indentation decision is made, fires an event with metadata that could help a user understand the logic behind an indentation decision. The
IndentResolver::onDidSuggestIndent
method accepts a callback and provides an object with a bunch of properties. It's still a bit intimidating to consume directly, but my eventual goal is to write a community package that can interpret the data (or else add such functionality totree-sitter-tools
).About
@match.next
The impetus for
@match.next
was this example:This is a contrived example, but the idea is that a user should be able to insert whatever sort of hanging indent they want when they're spreading a statement over multiple lines. Since we have Tree-sitter, we should be able to understand that line 4 shouldn't keep the deep indentation of line 3; it should match the indentation level of line 2!
I didn't think about this until a user reported an issue with hanging indents in C. I came up with a fix that's most of the way to what I want, but it uses a
@match
capture, so operates in the second phase of indentation hinting — the phase that considers the content on the current line. For this reason, it doesn't let us correct the indentation until after the user starts typing.With
@match.next
, that fix looks a bit different:I'm still playing around with this, and I'm only 96% sure it's a good idea. But this could be expressed as follows: “the presence of
expression_statement
on row X-1 suggests that row X should start with the same level of indentation as that of the row whereexpression_statement
started.” (Also true forreturn_statement
and all other node types within the square brackets.)Hence in the example above:
expression_statement
, so it produces a@match.next
capture.expression_statement
didn't end on row 3. But theindent.matchesComparisonRow
test passes, so we proceed.suggestedIndentForBufferRow
interprets the@match.next
capture by using theindent.matchIndentOf
predicate to figure out which line's indentation should be copied. In this case, it usesstartPosition
— i.e., the starting position of theexpression_statement
node.I still sense that I'm not explaining this well enough. I'm not certain that these captures have the right names; one further advantage of the
IndentResolver
encapsulation is that it would make it a bit easier to introduce new aliases for these captures and predicates while preserving backward compatibility. (For instance,indent.matchIndentOf
is now aliased to the terserindent.match
, andindent.offsetIndent
is aliased toindent.offset
.)Another reason to introduce this new encapsulation is to give us a place to hang new indentation-specific query tests. In order to implement what I described above, I needed a way to introduce the concepts of “current row” and “comparison row” to query tests, and those are indentation-specific ideas. (The current row is the row whose suggested indentation we're determining; the comparison row is the one we're using as reference; typically it's the row just above the current row, but we tend to skip over whitespace-only rows.)
Anyway, if any of this doesn't make sense, and you want to understand it better, please ask! In the process of answering your questions I hope I find ways to make this subject less intimidating.
In other news: a fix I made for
tree-sitter-css
months ago has been accepted, so I've also bumped ourtree-sitter-css
to the latest release.Changelog
web-tree-sitter
to version 0.23.0.tree-sitter-css
to the latest version.tree-sitter-markdown
to the latest version.tree-sitter-html
andtree-sitter-embedded-template
to their latest versions.tree-sitter-javascript
to the latest version.tree-sitter-typescript
to the latest version.@match.next
capture for advanced control of how indentation should change from one line to the next.Added a new event that is emitted when Pulsar makes an indentation decision, with the ultimate goal of making indentation hinting easier to understand for grammar authors. The metadata on the event is preliminary and subject to change.(Maybe I'll mention this somewhere, but probably not in the changelog, since it's not part of a public API yet.)indent.matchesComparisonRow
andindent.matchesCurrentRow
for comparing arbitrary positions in a Tree-sitter node tree to the operative rows in an indentation suggestion query. Makes it possible to say things like “decrease the indent on line 10 if a statement ends on line 9.”indent.matchIndentOf
andindent.offsetIndent
toindent.match
andindent.offset
, respectively. The old names still work as aliases.