Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Struggling with what I thought should be a simple grammar #230

Open
smontanaro opened this issue Jan 20, 2023 · 1 comment
Open

Struggling with what I thought should be a simple grammar #230

smontanaro opened this issue Jan 20, 2023 · 1 comment

Comments

@smontanaro
Copy link

I'm trying to parse Gmail-like queries, stuff like

from:[email protected] OR subject:vintage bikes

Since I'm dealing entirely with strings and parsimonious doesn't seem to have a tokenizer, I'm happy for now to use something like this, so the character set of the operators is completely distinct from the character set of words:

from:[email protected] || subject:vintage bikes

Still, I'm not getting that to work. My grammar looks like this:

grammar = Grammar("""
    term = factor add*
    factor = string mult*
    mult = and string
    add = or factor
    and = "&&"
    or = "||"
    word = ~"[-a-z0-9:.@]+"i
    string = word (spc word)*
    spc = ~"\s*"
    """)

When trying to parse the second query string I get:

parsimonious.exceptions.IncompleteParseError: Rule 'term' matched in its entirety, but it didn't consume all the text. The non-matching portion of the text begins with ' || faliero masi' (line 1, column 13).

If I tack optional whitespace around the and/or operators to gobble up the space before the || operator, it parses:

grammar = Grammar("""
    term = factor add*
    factor = string mult*
    mult = and string
    add = or factor
    and = spc* "&&" spc*
    or = spc* "||" spc*
    word = ~"[-a-z0-9:.@]+"i
    string = word (spc word)*
    spc = ~"\s*"
    """)

but that seems like a crude hack. While I'm not totally averse to the idea of hacking my way to a solution, it still seems there should be a cleaner way to define the grammar. None of the parsimonious examples I found dealt with anything like this. Am I missing something?

@erikrose
Copy link
Owner

erikrose commented Feb 2, 2023

The typical practice with Parsimonious grammars is to add a whitespace term (e.g. spc) to the right side of every "leaf" node of a grammar. Take a look at what I did in the grammar that describes Parsimonious grammars themselves:

rule_syntax = (r'''
# Ignored things (represented by _) are typically hung off the end of the
# leafmost kinds of nodes. Literals like "/" count as leaves.
rules = _ rule*
rule = label equals expression
equals = "=" _
literal = spaceless_literal _
# So you can't spell a regex like `~"..." ilm`:
spaceless_literal = ~"u?r?b?\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\""is /
~"u?r?b?'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'"is
expression = ored / sequence / term
or_term = "/" _ term
ored = term or_term+
sequence = term term+
not_term = "!" term _
lookahead_term = "&" term _
term = not_term / lookahead_term / quantified / atom
quantified = atom quantifier
atom = reference / literal / regex / parenthesized
regex = "~" spaceless_literal ~"[ilmsuxa]*"i _
parenthesized = "(" _ expression ")" _
quantifier = ~r"[*+?]|\{\d*,\d+\}|\{\d+,\d*\}|\{\d+\}" _
reference = label !equals
# A subsequent equal sign is the only thing that distinguishes a label
# (which begins a new rule) from a reference (which is just a pointer to a
# rule defined somewhere else):
label = ~"[a-zA-Z_][a-zA-Z_0-9]*(?![\"'])" _
# _ = ~r"\s*(?:#[^\r\n]*)?\s*"
_ = meaninglessness*
meaninglessness = ~r"\s+" / comment
comment = ~r"#[^\r\n]*"
''')
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants