-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modifying the tree-sitter grammar #4
Comments
Not sure what "teach the mode about some macros" means. Would you mind elaborating on that? Is that referring to something other than highlighting? |
I was thinking both of highlighting (as I assume that tree-sitter by itself can't tell apart a function from a macro and those have to specified explicitly) and semantic indentation (for macros that take forms as arguments). |
Thanks for the clarification. Yes, the current grammar does not distinguish between functions and macros. It also doesn't try to identify special forms. There is a summary of some of the background about why here. The short of it is that the (multiple) attempts I made before to add support for things like To me the (more) correct parsing was more important, but there's nothing that says that technically you can't have more than one grammar. It's just that I wanted something that would work well to be able to do things like structural editing / navigation decently. Perhaps you are already familiar with the folllowing, but for the sake of clarity...In the tree-sitter world, one grammar can inherit from another (e.g. tree-sitter-commonlisp inherits from tree-sitter-clojure), so that's one path. What you mentioned earlier about customizing an existing grammar is another path. In either case though, I don't think one can currently expect any kind of runtime tuning (i.e. editing the grammar file, generating parser source from that, and finally creating another dynamic library seems unavoidable [1]). Also, at least for the pre-Emacs-29 world that used elisp-tree-sitter, I asked the maintainer at one point about using more than one grammar in a single buffer, and my understanding was that it probably could be made to work:
I bring up this idea because this route might make it possible to use one grammar for one purpose while using another for another, each possibly more suited (e.g. being more accurate) for a different sort of use. Of course, a single choice that worked well would be nicer. No idea how things are in Emacs 29+ though. Now that @dannyfreeman is up-to-speed on how the grammar works, may be he'd be interested in seeing if adding support for additional constructs is feasible / practical. Supporting more than the bare basics is done by one of the Fennel grammars and one of the Janet grammars [2], but as far as I know, there is a cost involved in accuracy (I wrote simpler versions of both of those and compared at one point -- but it's been a while so it's possible things have changed). Compared to Clojure I think both of those lisp-likes have smaller cores and I believe neither grammar tried to cover all constructs but I'd need to check what the status is currently. If it's found that adding a few things is feasible, I would imagine one of the next things to consider might be what else. We all know how large clojure.core is... [1] This currently requires the [2] IIUC, there are also grammars for Emacs Lisp and Racket that might be worth examining. I don't know how much testing has been done for these. |
To elaborate more, there really isn't a way to extend the grammar without creating new tree-sitter binaries. Supporting things like fancy macros will best be done in emacs-lisp, or whatever platform is consuming the platform. Queries can be written to match forms like This sort of thing is going to be required for semantic indentation. I already do it to some extent for syntax highlighting (example) |
Thanks for the detailed explanations from both of you. Now I understand the situation a lot better. I think it would definitely make sense to document some of those design decisions and limitations, so it's easier for the end users to understand why certain things were done they way they were. I'll also take a closer look at the resources shared by @sogaiu. Btw, it might also be a good idea to add some general "understanding tree-sitter and how major modes based on tree-sitter work" sections with some pointers to external resources, so potential new contributors would have a good starting point. |
Re:
There is a list of grammar repositories here along with some tree-sitter-related questions / summaries. In addition to the grammars mentioned earlier, there's at least one grammar for "scheme" and a fork-with-changes of the elisp one mentioned above. There may be still others. Re:
I don't know how up-to-date the following is, but apart from looking through existing On the Emacs end of things, paying attention to the emacs-devel mailing list and the source repository seems to work for keeping up-to-date. I don't have a good idea of how stable things are / will be -- may be that's something that could be queried about at the mailing list. In the not-so-distant past, a discord server was started for discussing tree-sitter things. It was announced here. One of the maintainers (though AFAIK, not the original creator of tree-sitter) hangs out there and has been helpful. This channel of communication might be preferred over the tree-sitter repository's issues / discussions for some types of queries. One thing I didn't mention earlier is that the grammar currently being used in clojure-ts-mode has existing users in other programs -- here is an incomplete list. I mention this as making major changes to this grammar at this point from a feature perspective may lead to breakage elsewhere, so I'm not so inclined to go in that direction [1]. That's not to say that a different one couldn't be created of course :) [1] At least not without some way to find out who is using the grammar in what way and establishing good communication channels with those folks...not something that seems practical unfortunately. |
I'll work on this, probably over the weekend. Either expanding on the README or a new markdown doc linked from the README.
And just to add to this, it is very hard to make changes to the grammar that are not breaking in some way to one of the downstream users of the grammar. Even adding new nodes could be breaking in some way, because before the change the same text capture by the new node was captured by a different type of node. |
Re:
In the last few days I revisited:
It turns out there is a feature that allows one to perform queries of supertypes. AFAIK, this isn't included in the official docs, though it is mentioned in the Tree-sitter 1.0 Checklist:
I have tried it out a bit:
and at least so far it hasn't resulted in large numbers of parse errors. I'm not sure yet whether this would be compatible with the existing grammar, but it might be worth further investigation. Addenum: It looks like I tried this out a bit back and wrote about it here. It's not clear to me with the above approach whether one will be able to tell apart a use of |
This is the main problem with trying to apply semantic meaning to clojure code with tree-sitter. Potentially you could detect a def inside a quoted list, with basically 2 divergent parse paths, where everything (lists, symbols, keywords, etc) have a
and then calling it with
because tree-sitter has no way of understanding that To really get an accurate understanding of clojure source code, tooling needs context of what has already been parsed. It basically needs to execute the code. Knowing that I'm content with the grammar as is and writing some simple queries at run time to guess when something is a definition. It's more flexible to do it in emacs lisp. When we inevitable parse some weird All of this is good info to have in the documentation bbatsov is requesting. Going to try to work on it today. I'll probably draw from a lot of your prior art @sogaiu. |
@dannyfreeman FYI, may be you've seen already but I've updated some of the docs in one of the pre-release branches. Possibly the content may be helpful / more accurate than what is currently "released". |
I'd focus first on the more common cases (provide accurate indentation the known Clojure special forms and built-in macros that do something with code) and not fret too much on what macros might do in general. I've always been fond of tackling problems a step at a time - getting something perfect from the get go is a pretty tall order. If we can add some mechanism when end users can specifies the indentation rules for their own macros as in the current |
That's my general strategy. You can see it somewhat in the current clojure-ts-mode font-lock rules that check for common definition forms (def, defn, defmacro, etc). For the time being that's all clojure-ts-mode will be capable of.
I agree, it would be nice to get there. I do not believe we will be able to use the provided in |
I've got a document going for this now BTW https://github.com/clojure-emacs/clojure-ts-mode/blob/main/doc/design.md I plan on expanding it more soon, but it is a good start I think. |
Possibly useful to make a distinction between concrete syntax trees and abstract syntax trees. See this section for some details. |
Added. The named vs anonymous nodes were useful to describe as well. |
Some minor things: Looks like there's a bit about "abstract" left from before:
Also a stray character at the end of the following may be?
|
Great catches, I fixed that up. Thank you :) |
2db016dc64f287fa541c97b922d20f493fedf403 |
As a response to my article https://metaredux.com/posts/2023/03/12/clojure-mode-meets-tree-sitter.html someone asked if it'd be possible/easy to modify the tree-sitter grammar used by
clojure-ts-mode
. (e.g. teach the mode about some macros) I know that obviously they fork the grammar and build custom binaries, but I'm wondering if there's a simpler way to make some changes. I guess we should document this somewhere.The text was updated successfully, but these errors were encountered: