Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 Parser: Add scanner #340

Merged
merged 22 commits into from
Oct 4, 2023
Merged

V2 Parser: Add scanner #340

merged 22 commits into from
Oct 4, 2023

Conversation

kubukoz
Copy link
Owner

@kubukoz kubukoz commented Oct 2, 2023

Adds a scanner for all language tokens.

The design is inspired by rowan, although most of that will be visible in the parser stage, which will come in the future.

This scanner is:

  • total: there is no input that should fail to produce tokens

  • lossless: you can render the tokens back to the original code they were produced from. Note that full utf-8 isn't explicitly supported, so codepoints that don't fit in a char may break here. However, preliminary testing with property-based testing hasn't shown any cases that would fail to parse and re-render.

  • full fidelity: the tokens include comments and whitespace.

    • worth noting, newlines are treated specifically: normally whitespace tokens consist of any number of white characters, but newlines form their own tokens. See tests for examples, but the main reason this is being done is to make sync points easier in the parser (by relying on newline tokens).
  • Scanner

    • Single-char tokens (punctuation)
      • missed: colons
    • Identifiers
    • Single-line comments
    • Multi-character tokens (keywords)
      • Keywords after comments, after whitespace etc.
    • String literals (unescaped, multi-line: consistent with current parser)
    • Numeric literals (full JSON number syntax)
      • Using Cats Parse for this, we can switch to a custom implementation later on. I'm not gonna waste hours of my life just to avoid using a third-party library ;P
    • Boolean/null literals
    • Parity testing
      • include scanner test in all generative tests
      • test for a non-empty list of non-error tokens
      • any error tokens in valid inputs should be reported as test failures
    • support arbitrary utf-8 codepoints?
      • Maybe later. For now, not a priority.

@kubukoz kubukoz mentioned this pull request Oct 3, 2023
8 tasks
@kubukoz kubukoz changed the title V2 Parser V2 Parser: Add scanner Oct 3, 2023
@kubukoz kubukoz marked this pull request as ready for review October 3, 2023 02:15
@kubukoz kubukoz merged commit 7d9e6b1 into main Oct 4, 2023
@kubukoz kubukoz deleted the parser-v2 branch October 4, 2023 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant