Struggling with what I thought should be a simple grammar #230

smontanaro · 2023-01-20T15:07:13Z

I'm trying to parse Gmail-like queries, stuff like

from:[email protected] OR subject:vintage bikes

Since I'm dealing entirely with strings and parsimonious doesn't seem to have a tokenizer, I'm happy for now to use something like this, so the character set of the operators is completely distinct from the character set of words:

from:[email protected] || subject:vintage bikes

Still, I'm not getting that to work. My grammar looks like this:

grammar = Grammar("""
    term = factor add*
    factor = string mult*
    mult = and string
    add = or factor
    and = "&&"
    or = "||"
    word = ~"[-a-z0-9:.@]+"i
    string = word (spc word)*
    spc = ~"\s*"
    """)

When trying to parse the second query string I get:

parsimonious.exceptions.IncompleteParseError: Rule 'term' matched in its entirety, but it didn't consume all the text. The non-matching portion of the text begins with ' || faliero masi' (line 1, column 13).

If I tack optional whitespace around the and/or operators to gobble up the space before the || operator, it parses:

grammar = Grammar("""
    term = factor add*
    factor = string mult*
    mult = and string
    add = or factor
    and = spc* "&&" spc*
    or = spc* "||" spc*
    word = ~"[-a-z0-9:.@]+"i
    string = word (spc word)*
    spc = ~"\s*"
    """)

but that seems like a crude hack. While I'm not totally averse to the idea of hacking my way to a solution, it still seems there should be a cleaner way to define the grammar. None of the parsimonious examples I found dealt with anything like this. Am I missing something?

The text was updated successfully, but these errors were encountered:

erikrose · 2023-02-02T01:46:19Z

The typical practice with Parsimonious grammars is to add a whitespace term (e.g. spc) to the right side of every "leaf" node of a grammar. Take a look at what I did in the grammar that describes Parsimonious grammars themselves:

parsimonious/parsimonious/grammar.py

Lines 220 to 256 in d5636a6

    
           rule_syntax = (r''' 
        
               # Ignored things (represented by _) are typically hung off the end of the 
        
               # leafmost kinds of nodes. Literals like "/" count as leaves. 
        
               rules = _ rule* 
        
               rule = label equals expression 
        
               equals = "=" _ 
        
               literal = spaceless_literal _ 
        
               # So you can't spell a regex like `~"..." ilm`: 
        
               spaceless_literal = ~"u?r?b?\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\""is / 
        
                                   ~"u?r?b?'[^'\\\\]*(?:\\\\.[^'\\\\]*)*'"is 
        
               expression = ored / sequence / term 
        
               or_term = "/" _ term 
        
               ored = term or_term+ 
        
               sequence = term term+ 
        
               not_term = "!" term _ 
        
               lookahead_term = "&" term _ 
        
               term = not_term / lookahead_term / quantified / atom 
        
               quantified = atom quantifier 
        
               atom = reference / literal / regex / parenthesized 
        
               regex = "~" spaceless_literal ~"[ilmsuxa]*"i _ 
        
               parenthesized = "(" _ expression ")" _ 
        
               quantifier = ~r"[*+?]|\{\d*,\d+\}|\{\d+,\d*\}|\{\d+\}" _ 
        
               reference = label !equals 
        
               # A subsequent equal sign is the only thing that distinguishes a label 
        
               # (which begins a new rule) from a reference (which is just a pointer to a 
        
               # rule defined somewhere else): 
        
               label = ~"[a-zA-Z_][a-zA-Z_0-9]*(?![\"'])" _ 
        
               # _ = ~r"\s*(?:#[^\r\n]*)?\s*" 
        
               _ = meaninglessness* 
        
               meaninglessness = ~r"\s+" / comment 
        
               comment = ~r"#[^\r\n]*" 
        
               ''')

.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Struggling with what I thought should be a simple grammar #230

Struggling with what I thought should be a simple grammar #230

smontanaro commented Jan 20, 2023

erikrose commented Feb 2, 2023 •

edited

Loading

Struggling with what I thought should be a simple grammar #230

Struggling with what I thought should be a simple grammar #230

Comments

smontanaro commented Jan 20, 2023

erikrose commented Feb 2, 2023 • edited Loading

erikrose commented Feb 2, 2023 •

edited

Loading