syncode/syncode/parsers/grammars at main · uiuc-focal-lab/syncode

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
c_grammar.lark		c_grammar.lark
calc_grammar.lark		calc_grammar.lark
go_grammar.lark		go_grammar.lark
grammar.py		grammar.py
invariants.lark		invariants.lark
java_grammar.lark		java_grammar.lark
json_grammar.lark		json_grammar.lark
prover9_grammar.lark		prover9_grammar.lark
python_grammar.lark		python_grammar.lark
sql_grammar.lark		sql_grammar.lark
tiny_grammar.lark		tiny_grammar.lark

README.md

ℹ️ Adding new Grammars

SynCode utilizes Lark Extended Backus–Naur form (EBNF) grammars to guide generation, ensuring outputs adhere to specific formats. These grammars construct parsers that filter out incompatible tokens during the generation process, resulting in output that follows the grammar's production rules.

Creating grammar for SynCode requires a solid understanding of Lark EBNF grammar. Here's how you can get started:

Read Lark's grammar documentation here.
Review existing grammars for SynCode here.

LALR(1) or LR(1) Grammars

SynCode supports both LALR(1) and LR(1) parser generators, which require grammars to be unambiguously recognizable by these parser generators. An ambiguous grammar may lead to shift-reduce or reduce-reduce conflicts. LR(1) is more powerful in terms of representing certain syntax, however common features of most popular formal languages are known to be representable as LALR(1) (and hence also LR(1))

Currently, similar to Lark, we handle shift-reduce conflicts by prioritizing shift
In the case of reduce-reduce conflicts, SynCode will halt with an error. Fixing these grammars requires a good understanding of how bottom-up parser generators work. In most cases, it should be possible to fix these errors by rewriting some of the grammar rules. However, in some rare cases it is possible that it is impossible to represent the grammar as LR(1)

Debugging Grammars

SynCode provides a flag --debug to help debug grammars. This flag will print out the parsed terminals and their corresponding values in generation. (Refer to this notebook for code example)

For example, consider the following grammar:

start: "foo" "(" ident ")" ";"
ident: [a-z]+

Given the output foo(abc);, the debug flag will print:

--------------------------------------------------
Parsed terminals:
(type: 'FOO', value: 'foo')
(type: 'LPAR', value: '(')
(type: 'IDENT', value: 'abc')
(type: 'RPAR', value: ')')
(type: 'SEMI', value: ';')
--------------------------------------------------

Lexer Ambiguity

When defining a grammar, be cautious of lexer ambiguities that arise when one terminal is a substring of another. In some cases, these ambiguities can lead to unexpected behavior in the parser. We are working on catching these ambiguities and providing more informative error messages. However, it is always a good practice to avoid such ambiguities in the first place. Consider the following example grammar: (Refer to this notebook for code example)

start: "random" "(" ident ")" ";"
ident: chars*                                
chars: "a"..."z"

In this grammar, the ident terminal matches any sequence (chars*) of lowercase letters (chars). However, this setup can lead to ambiguities because "random" and "ident" share the substring "r". To avoid such ambiguities, refine terminal definitions using specific regular expressions. For example:

start: "random" "(" ident ")" ";"
ident: [a-z]+

Here, ident is defined using [a-z]+, ensuring it only matches sequences of lowercase letters without overlap with other terminals.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grammars

grammars

README.md

ℹ️ Adding new Grammars

LALR(1) or LR(1) Grammars

Debugging Grammars

Lexer Ambiguity

Backreferences and Lookarounds

1-character Lookahead Lexer Assumption

Handling indentation and other non-context-free constraints

Files

grammars

Directory actions

More options

Directory actions

More options

Latest commit

History

grammars

Folders and files

parent directory

README.md

ℹ️ Adding new Grammars

LALR(1) or LR(1) Grammars

Debugging Grammars

Lexer Ambiguity

Backreferences and Lookarounds

1-character Lookahead Lexer Assumption

Handling indentation and other non-context-free constraints