-
-
Notifications
You must be signed in to change notification settings - Fork 7
Understanding Peppa PEG
Writing a parser is a lot of fun, especially writing a PEG parser in ANSI C, e.g. Peppa PEG. In case you are unfamiliar with parsing at all, this is a good introduction post. PEG is a unique parsing algorithm that was invented relatively late until 2004. Here are some more information hosted by Bryan: link.
Unlike old parsing techniques such as lex/yacc, PEG parsers usually don't separate lexing and parsing. All are complete in one scan. PEG can be seen as a more advanced dialect of EBNF.
PEG parsers, if not cached, will be out-performed because of unlimited backtracking. Some PEG parsers introduces a so-called Packrat caching mechanism to boot the performance, e.g, holding a big table in memory for faster lookup of (rule, position) => parse_result. This can enable PEG parsers run in O(length-of-text) time.
- Least Surprise. The PEG grammar is designed with considering of not looking "odd" and self-explanatory.
- Human-Readable Errors. The parser should generate meaningful errors and help both language designers and users quickly identifying the root cause.
- Portability. Write once and run it where. Most modern languages support C bindings. Besides, Peppa PEG can even be compiled and run on embedded devices since it's written in ANSI C (C89).
- Develop a code linter.
- Develop a code highlighter.
- Develop a static analysis tool.
- Test new grammars.
- Write a DSL.
The concept of "cut" should be familiar by whom writes code in Prolog Programming Language. This is not a "conventional" PEG feature but has been extended through some papers, such as Cut points in PEG, Extended Abstract (Frankly speaking, I tried but never fully understood those mathematical symbols). It also exists in Python PEG Parser. You can see some discussions in the wild. I also like the discussion in this issue.
A cutting point is meaningful when among the sequence, as you will most likely see rules like array = "[" @cut arr_elem "]";
.
It's especially useful when multiple sequences are in an ordered choice, such as value = "[" @cut arr_elem "]" / "{" @cut obj_elem "}" / num;
.
More often, we break big rules into some small rules, such as array = "[" @cut arr_elem "]";
, object = "{" @cut obj_elem "}"
, and value = array / object / num
. In this case, @cut
still commits to the choice alternative when failure occurs.
A cutting point as a standalone rule is meaningless, as you won't see a PEG rule like rule = @cut;
; it's a syntactically valid rule although Peppa PEG will refuse to apply it.
Internally, expression @cut
always succeeds and updates a local state, e.g. set frame->cut = true
.
The scope of @cut
is limited to only one single frame in the stack, e.g. the "parent" sequence where @cut
resides.
An error will propagate to the top and crash the whole parse when a failure frame with cut tripped is popped from the stack.
A special case is made when @cut
meets negative. Say expression !("[" @cut arr_elem "]")
, when a CutError is raised, we know the lookahead match is definitely failed, so it will recover the error propagation.
Consider rule S = S a / a
. Although it's syntactically correct, the parser will trap in an endless recursion calls - Parsing S requires parsing S' first, then parsing S' requires parsing S'' first. There will be no stack address left for function calls.
Nonetheless, this is pretty common way in BNF form. Actually, it's recommended to be used by any CFG grammar.
I think introducing left recursion into PEG would benefit end users.
In Peppa PEG, such a rule is written as S = lhs | S rhs
. Parsing algorithm can be described as:
- Parse lhs_node. If failed, S is failed.
- Keep parsing:
- If EOF, stop parsing.
- Parse rhs_node. If not matching, stop parsing.
- Create a node assembling lhs_node and rhs_node.
- The new node becomes lhs_node.
- Return lhs_node.
This implementation is in essence identical to S = lhs rhs*
except it produces a different shape of AST.
There are some attempts in academic introducing full left recursion support (both direct & indirect) by growing seeds. The algorithm is way complex than the above proposed one. I think the left recursion in Peppa PEG, although not covering indirect left recursion, is good enough for 90% of use cases.
It's simple and high performant.
Many programming languages have a deep precedence levels, say Java has twentyish. If we use left recursion and right recursion, the performance would be so bad. Instead, many parsers choose to implement precedence climbing. Here lists some useful links: 1, 2, 3.
peg/leg was developed in old days. Its use case is to read a mixed input from PEG + C code, then produces a C file. Unlike peg/leg, Peppa PEG produces a runtime P4_Grammar
object that holds all grammar rules. The main consideration here is Peppa PEG wants to be a generic PEG parser without an extra C codegen step.
I like Python PEG Parser code. The code is easy to understand and maintain. It's helpful to compare Python PEG Parser v/s Peppa PEG a little bit.
The proposal of shifting from the Python old LL parser to the PEG parser is published as PEP-0617. Guido's PEG Parsers series is helpful for understanding the shifting.
The Python Parser C source code (cpython/Parser/parser.c) is generated from Python Grammar file and is for parsing Python source code to AST (not CST). Here lists some interesting implementation details:
- An embedded tool in Python repo called "pegen" produces C source code for Python Parser. See pegen. Peppa PEG does not have a utility similar to "pegen" for producing "C" code. Peppa PEG produces a runtime
P4_Grammar
object that holds all grammar rules. The main consideration here is Peppa PEG wants to be a generic PEG parser without an extra C codegen step. The downside is Peppa PEG needs to bootstrap a PEG Parser before parsing. - Python Parser C code does not include tokenizing process. This is why you don't see some definitions like
NAME
,INDENT
in the Python Grammar file. I think it's for a historical reason - the old tokenizing code is mature and has no need to be replaced. See Tokens, Tokenizer, - Despite the large file size (>1MB), the code isn't hard to read (Actually, I didn't understand how to implement "cut" correctly until read through Python PEG Parser code). Each function is for matching a PEG expression and has a similar logic:
- If the parser is tainted with an error during the middle of any parsing step, fail fast and return no AST node (Same in Peppa PEG).
- For sub-expressions in the PEG expression, attempts to get all of the corresponding AST nodes (Same in Peppa PEG).
- Return a new AST node by wrapping all the gathered AST nodes. Peppa PEG produces a Parse Tree (CST), not AST. This gives the user power to traverse/manipulate the tree. On the other hand, It makes sense for Python PEG Parser to skip the CST step for a performance reason.
Look at the pest syntax, you'll find some extraordinary rule modifiers:
-
WHITESPACE
andCOMMENT
. silent = _{ ... }
atomic = @{ ... }
compound_atomic = ${ ... }
non_atomic = !{ ... }
Peppa PEG is inspired from these pest features and offers similar features but with a different taste: @squashed, @lifted, @spaced, @tight, @scoped, @nonterminal.
Let's look at each one of them.
- Having special names like
WHITESPACE
orCOMMENT
is nice. Pest gives these two names some implicit meanings. But what if a user want to define a 3rd, 4th spacing rules? Peppa PEG prefers doing things explicitly. If we want a rule to be spacing rule, just write it as@spaced rule1 = ...;
. This allows as many spacing rules as user wants. - I feel
silent
isn't a good name as it may confuse people with "ignoring match errors silently". Peppa PEG replaces it to "@lifted", which means the node will be "lifted" from the parsed tree when the rule matches the input. - The names
atomic
/non_atomic
v/scompound_atomic
are even more confusing as you can't simply guess the meaning from the name. Again, Peppa PEG replaces the names based on the effects on the parse tree.-
@tight
-@spaced
rules are not inserted between sequence & repeat rules. Bothatomic
andcompound_atomic
have this effect. -
@squashed
- all of the children nodes are squashed, leaving the parent node left only. This feature applies to the node's all descents, including children, grand-children, etc. Pestatomic
has this effect. -
@scoped
- regardless of all the effects inherited from "parent" rules, this flag creates a unique scope from the parse. This feature is especially helpful when most of node descendent are squashed while leaving some, such as parsing Python f-string:f"ignore {a_node_is_created} ignore"
.
-
Here is a table of the feature mappings:
Function | Pest Modifiers | Peppa Peg Flags |
---|---|---|
Insert whitespace to repeat/sequence rules | WHITESPACE | @spaced @lifted |
Insert comment to repeat/sequence rules | COMMENT | @spaced @lifted |
Insert a rule to repeat/sequence while keeping the node | 🙅 | @spaced |
Lift the node from parse tree | Silent | @lifted |
Don't insert @spaced rules to repeat/sequence. Don't generate children nodes. | Atomic | @tight @squashed |
Don't insert @spaced rules to repeat/sequence. | Compound Atomic | @tight |
Stop the effect of @squashed inherit from parent rule. | Non Atomic | @scoped |
Generate node but don't generate children nodes. | 🙅♀️ | @squashed |
... | 🙅 🙅♀️ | ... |
Since Peppa PEG has brought orthogonality into the tree modification, it can cover even more cases where Pest can't do.