Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntax highlighting is broken for some languages #6668

Closed
lildude opened this issue Jan 2, 2024 · 36 comments
Closed

Syntax highlighting is broken for some languages #6668

lildude opened this issue Jan 2, 2024 · 36 comments
Labels

Comments

@lildude
Copy link
Member

lildude commented Jan 2, 2024

Opening this to have a referable issue.

As @Alhadis reported there is an issue with separate syntax highlighting engine that is causing syntax highlighting issues…

Something's changed with GitHub's handling of TextMate-powered highlighting. There's now a runaway-match issue with my .gitconfig grammar, and I distinctly recall seeing no issues with it on GitHub until relatively recently. Compare GitHub's rendering of the file to that of Atom's:

GitHub screenshot Atom screenshot
GitHubAtom

This doesn't appear to be an issue with regex engine flavours, or how recent they are. I can see @jtbandes's example is affected by the same highlighting failing to terminate at the end pattern specified by the TextMate grammar.

This is not an issue with linguist and not likely to be an issue with the grammars we bundle.

An issue has been opened with the team that manages the highlighting engine and I'll keep this issue updated as I hear more.

@jtbandes
Copy link
Contributor

jtbandes commented Jan 2, 2024

After looking into the reported examples a bit, I believe one possible root cause of these issues is that (?!\G) is not working in end patterns.

This would explain the following issues:

Swift

class A: B {
  let x: Int
}
protocol Foo {
  associatedtype T: Equatable
  var x: String { get }
}
protocol Foo {
  associatedtype T = Int
  var x: String { get }
}
protocol Foo {
  associatedtype T: Equatable = Int
  var x: String { get }
}

gitconfig

[good]
b = c
[bad]

TOML

[good]
b = c  # comment
[bad]

@jtbandes
Copy link
Contributor

jtbandes commented Jan 2, 2024

Here are a couple others which I'm not able to trace to a specific instance of (?!\G), so there might be other types of end patterns that also don't work... or maybe the problem is that \G isn't working in general?

class Foo<A, B> where A: Identifiable, B: Identifiable {
  var x: A
}
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
macro IntroduceArbitraryPeers<T>(_: T) = #externalMacro(...)
let SE0168 = """   illegal
        foo (this part should not be illegal)
    illegal"""

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 2, 2024

@lildude Just an FYI, \G and (?!\G) are two of the most important idioms in TextMate grammars, as they're essentially all authors have by way of maintaining state and structure, especially in multiline patterns.

It might be that the location of the last match normally dereferenced by \G is getting reset between pattern lookups.

@Kaspik
Copy link

Kaspik commented Jan 4, 2024

Not sure if this is the right place but tons of problems with Swift files recently...

Screenshot 2024-01-04 at 18 24 24 Screenshot 2024-01-04 at 18 24 33

@lildude
Copy link
Member Author

lildude commented Jan 4, 2024

@Kaspik yes, it was the PR that changed the Swift grammar which first brought this to light.

@sethrj
Copy link

sethrj commented Feb 20, 2024

I can also report a recent issue in C++ highlighting: textmate/c.tmbundle#79

One-line comments now cause the next token to be ignored, which can cause catastrophic failures:

// Here is a one-line comment
/*
 * This is a multi-line comment that's
 * broken because the opening token was missed.
 * Now because of the apostrophe, everything
 * looks like a character string.
 */
int main();

versus

/*
 * This is a multi-line comment that's
 * not broken.
 */
int main();

The former currently renders as:
Screenshot 2024-02-20 at 15 47 04

EDIT: it does look like textmate C.plist is also using (?!\G) for its comments: https://github.com/textmate/c.tmbundle/blob/80c8e886b67227096a56aca6b92fe1587f76d603/Syntaxes/C.plist#L684

@sethrj
Copy link

sethrj commented Feb 22, 2024

Also broken for CMake:

# This is a comment
if(${higlighting} STREQUAL "it works")
endif()

versus

# This is a comment

if(${higlighting} STREQUAL "it works")
endif()

@patrickt
Copy link

patrickt commented Feb 27, 2024

Thanks, everyone, for your reports in this thread, and for your patience with these regressions. The issue is indeed the handling of ?!\G tokens; these were provided by a custom fork of PCRE v1, made by someone who’s since departed the company, that had been bitrotting for nine years (and unsurprisingly did not work on Apple Silicon). We’re currently investigating how we can support these tokens without sacrificing speed or maintainability; in the meantime, we’re moving as many languages as we can to use tree-sitter based highlighting, which does not suffer from these problems and provides a noticeable speedup in the blob view (a win all around).

@Alhadis
Copy link
Collaborator

Alhadis commented Feb 27, 2024

in the meantime, we’re moving as many languages as we can to use tree-sitter based highlighting

Uh… what languages, exactly? How many? I'm interested in how many of my TextMate grammars might be substituted for a lower-quality or buggier tree-sitter version…

@jtbandes
Copy link
Contributor

jtbandes commented Feb 27, 2024

FWIW, VS Code seems to support these patterns. Perhaps if a new regex/scoping implementation is needed then that team could help with integrating theirs, which seems to be actively maintained: https://github.com/microsoft/vscode-textmate

Uh… what languages, exactly? How many? I'm interested in how many of my TextMate grammars might be substituted for a lower-quality or buggier tree-sitter version…

And unfortunately, it’s probably many of the highest-quality TM grammars that will need to be replaced, if these TM patterns are no longer supported 🙁

@Alhadis
Copy link
Collaborator

Alhadis commented Feb 27, 2024

Also, I forgot to ask: why is there an outdated fork of PCRE being used instead of, y'know, something contemporary? This isn't a fault with the TextMate grammar format itself, but the regex engine that powers its pattern-matching. Portable grammars (or at least those that I've written) can leverage a subset of regex features that's supported by both modern PCRE, Perl, and Oniguruma. At no point should a fork be needed…

@patrickt
Copy link

patrickt commented Feb 27, 2024

@Alhadis We have, for some time, been using tree-sitter for C, C#, CSS, Elixir, Gleam, Go, HTML, Java, JavaScript, JSDoc, Nix, PHP, Python, CodeQL, regular expressions, Ruby, Rust, TLA+, and TypeScript. This week I’m hoping to get to C++, .gitconfig and .gitmodules, XML, and R.

We were using an old PCRE fork because, as far as I can tell, neither PCRE1 nor PCRE2 provides equivalents of ONIG_OPTION_NOTBOS, ONIG_OPTION_NOTEOS, and ONIG_OPTION_NOTGPOS. As you can tell from the regressions, certain grammars depended on these flags. (VS Code forks and vendors its own version of Onigurama for this reason.)

As to the technical reasons behind this change, tree-sitter grammars are simply a better fit for the GitHub highlighting service. TextMate grammars made sense for their original use-case, but for the purpose of our highlighting service (a non-interactive, stateless RPC server, very different from any text editor), regular expressions are not competitive, speed-wise, with tree-sitter’s generated C, and speed is one of our most important desiderata, given the scale at which this service has to operate. The speed increase when switching from a TM to TS grammar is noticeable from an end-user perspective. However, as I mentioned above, we’re looking into how we can add back support for \G and ?!\G qualifiers as a stopgap measure.

@jtbandes If you see regressions, we’d be grateful if you’d report them so we can improve the whole tree-sitter ecosystem! VS Code is actually a bit of an exception w/r/t its continued use of TextMate grammars; neovim, Pulsar, Zed, and now Emacs are all using tree-sitter for highlighting. tree-sitter is very much the lingua franca at GitHub for anything parsing-related, and that’s unlikely to change. For what it’s worth, we’ve been very happy with the capabilities of tree-sitter for syntactically rich languages like Ruby (and these syntax highlighting queries are much more concise than the equivalent .tmlanguage JSON files; Ruby’s is only a few hundred SLOC). Tree-sitter grammars also compose well, which is great for languages like TSX where you have multiple grammars/highlighters that need to work in concert.

@tevelee
Copy link

tevelee commented Feb 27, 2024

Thanks for the update on your continued work on this @patrickt.
You mentioned a few languages on your tree-sitter migration roadmap, but I noticed that Swift (which the original issue was reported about) is not on the list. Where can I follow your progress for the upcoming work?

@jtbandes
Copy link
Contributor

jtbandes commented Feb 27, 2024

@patrickt Is the Swift stuff using https://github.com/alex-pinkus/tree-sitter-swift or a different grammar? Let me know if any help with that migration is needed — while I have not worked on the tree-sitter side (yet!), I have intimate knowledge of the TM grammar and recently updated it with support for highlighting regex literals, as well as some other newer language features which I believe the tree-sitter grammar currently does not support. If I can be of assistance, drop me an email or DM.

@Alhadis
Copy link
Collaborator

Alhadis commented Feb 27, 2024

@patrickt Hypothetically, if one were to write a proof-of-concept that demonstrated how TextMate grammars could be used to provide a structured parser grammar based on a combination of their regex patterns and properties associated with scopes that capture groups are tagged with… would that be something of interest to GitHub?

@patrickt
Copy link

patrickt commented Feb 27, 2024

@jtbandes Thank you for your help! We’re using the @alex-pinkus version, yes, and I believe (though don’t quote me on this) he and we have been talking about making it the official Swift grammar in the tree-sitter org. Highlighting of components in regex literals are exactly the kind of thing I don’t want to regress—though it’s EOD for me, so that’s a job for tomorrow morning.

@Alhadis I think that sounds extremely cool, and I’d like to see your results, but if by “of interest to GitHub” you mean “sufficient for us to prefer TextMate grammars over tree-sitter grammars,” then I’m afraid that the answer is no. If there are missing syntax-highlighting features or developer amenities in tree-sitter as compared to what TextMate provides, then that’s a bug in tree-sitter that we'd very much like to fix.

@Alhadis
Copy link
Collaborator

Alhadis commented Feb 27, 2024

but if by “of interest to GitHub” you mean “sufficient for us to prefer TextMate grammars over tree-sitter grammars,” then I’m afraid that the answer is no.

No, I mean “sufficient enough to disincentivise mass-migration of existing TextMate grammars to potentially inferior tree-sitter versions”.

If there are missing syntax-highlighting features or developer amenities in tree-sitter as compared to what TextMate provides, then that’s a bug in tree-sitter that we'd very much like to fix.

It's not a bug in tree-sitter. It's the very design of the system itself that feels over-engineered and extremely hostile to grammar developers. I don't doubt that tree-sitter is performant and well-optimised to GitHub's needs, but TextMate's simplicity and flexibility are its key strengths to a grammar author. Moreover, it's not bound by the restrictions of a formal parser grammar, and it lends itself well to informal, unstructured formats like logs and program output.

@sethrj
Copy link

sethrj commented Feb 27, 2024

Be cool man, let's leave this thread to report issues with the grammar.

Thanks @patrickt for the work you put in to syntax highlighting and for the detailed update! It would've been very easy to leave us in the dark so your posts are much appreciated.

@patrickt
Copy link

No, I mean “sufficient enough to disincentivise mass-migration of existing TextMate grammars to potentially inferior tree-sitter versions”.

I’m afraid not, no. I appreciate your concerns and sympathize with your frustration, but this decision is set in stone at this point. The set of trade-offs with which we are working render us willing to accept temporary regressions (all of which, in my experience, have been trivially fixable) for the several axes on which tree-sitter provides concrete improvements.

@psmskelton
Copy link

github_matlab_issue

MATLAB source is also having issues in that GitHub classifies anything in the document after the first % comment to be a comment.

@rybalkoss
Copy link

Review of matlab code is quite challenging on GitHub:
image

@patrickt
Copy link

Hey, everyone 👋 Thanks to some absolutely heroic work on the part of @dcreager, we’ve addressed the regressions people have been observing. Please let us know if you spot any further infelicities, and thank you all very much for your patience while we fixed these issues.

@Alhadis
Copy link
Collaborator

Alhadis commented Mar 12, 2024

@patrickt I was literally about to point out how the regression appears to be fixed. Thanks for all your hard work!

@jtbandes
Copy link
Contributor

jtbandes commented Mar 12, 2024

Thanks for your hard work. I’m sad to see Swift.tmLanguage go since it’s my baby, but I guess this will put more pressure on tree-sitter-swift to support all the language features. (Just look at all these things that are no longer highlighted: https://github.com/jtbandes/swift-tmlanguage/blob/main/grammar-test.swift) Maybe I will get around to contributing to it some day.

@Kaspik
Copy link

Kaspik commented Mar 12, 2024

@patrickt As of now, still seeing some issues with Swift files (like 'lazy', or 'dataSource' / 'delegate'), but overall seems much better! (or current PR just has no specific code that was problematic 😄 )

Screenshot 2024-03-12 at 18 59 51

@jtbandes
Copy link
Contributor

@Kaspik Those issues would be due to the switch from Swift.tmLanguage to tree-sitter-swift.

@jtbandes
Copy link
Contributor

Interestingly, I'm still seeing the Swift tmLanguage grammar being used in some places. It looks like it's used for diffs in Markdown files but not Swift files: https://gist.github.com/jtbandes/34a7a9fcc972f1d4d3f3ca6a39b86ccd/revisions

image

@patrickt
Copy link

@jtbandes Keen eye regarding Markdown—yes, the fact that embedded Markdown goes through the legacy TextMate path is an implementation detail we hope to fix someday (right now Markdown itself does not go through tree-sitter, which would be a first step, but after that it also requires some thought re. implementation). Thank you also for the Swift grammar test file you’ve written—I hadn’t seen this before, and this will greatly assist getting tree-sitter-swift to parity with what we had in Markdown. I’ve already landed a few fixes (here, here), and am going to knock more out tomorrow and also update the grammar GitHub-side. Hoped to get to that today but, as always, so much to do.

@Kaspik
Copy link

Kaspik commented Mar 13, 2024

I think having Markdown done through old implementation is actually a good thing because you have direct comparison what's needed and left to be improved in the implementation. The Markdown code above is much easier to read than the Swift file. I wouldn't migrate Markdown till the Swift one fixed.

@patrickt
Copy link

@lildude I think this can be closed absent further reports of TM grammars breaking.

@lildude
Copy link
Member Author

lildude commented Mar 18, 2024

Thanks @patrickt.

@lildude lildude closed this as completed Mar 18, 2024
@lildude lildude unpinned this issue Mar 19, 2024
@jtbandes
Copy link
Contributor

jtbandes commented Mar 21, 2024

Edit: seems like this is fixed now

Hey folks, it looks like single-line comments in C++ are broken:

image

https://github.com/jackma/mcap_mqtt_recorder/blob/main/main.cpp#L26

Although it does work properly in a markdown code snippet and gist:

// Return the stamp property if set, otherwise use the current time
mcap::Timestamp GetTimestampFromProperties(
    mqtt::properties properties, const std::string& stamp_property)

https://gist.github.com/jtbandes/6b632b45e4a88ce178d7c9b7b1810b91

@ddelange
Copy link

ddelange commented Apr 10, 2024

Hi 👋

Python code blocks in rst are now 100% underline: https://github.com/piskvorky/smart_open/blob/v7.0.4/README.rst

Is it a remnant of this issue, or should I create a new one?

@lildude
Copy link
Member Author

lildude commented Apr 10, 2024

Is it a remnant of this issue, or should I create a new one?

This is a whole new issue and nothing to do with Linguist or the original issue. github/markup is the repo responsible for rendering codeblocks in rst files so an issue needs to be logged there or directly with GitHub support.

@lildude
Copy link
Member Author

lildude commented Apr 10, 2024

Locking this issue as all of the issues originally reported have been resolved. Any new issues need to be logged with the upstream grammar repo as normal.

@github-linguist github-linguist locked as resolved and limited conversation to collaborators Apr 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

10 participants