Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add a chapter on start states to the book #468

Merged
merged 1 commit into from
Aug 26, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
- [Lexing](lexing.md)
- [Lex compatibility](lexcompatibility.md)
- [Hand-written lexers](manuallexer.md)
- [Start States](start_states.md)
- [Parsing](parsing.md)
- [Yacc compatibility](yacccompatibility.md)
- [Return types and action code](actioncode.md)
Expand Down
4 changes: 2 additions & 2 deletions doc/src/lexcompatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ There are several major differences between Lex and grmtools:
* Both Lex and grmtools lex files support start conditions as an optional prefix
to regular expressions, listing necessary states for the input expression to
be considered for matching against the input. Lex uses a special action
expression `BEGIN(state)` to switch to the named `state`. grmtools lex files
use a token name prefix.
expression `BEGIN(state)` to switch to the named `state`. Start states in grmtools
are described in [start_states](start_states.md).

* Character sets, and changes to internal array sizes are not supported by grmtools.

Expand Down
125 changes: 125 additions & 0 deletions doc/src/start_states.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# Start States

The following explains the syntax and semantics of Start States in lrlex.<br>
A working example can be found in the repository at [lrpar/examples/start_states][1]

[1]: https://github.com/softdevteam/grmtools/tree/master/lrpar/examples/start_states
## Motivation

Start states are a feature from lex which can be used for context sensitive lexing.
For instance, they can be used to implement nested comments (see the example in the repository).
Such that the tokens start/end markers of tokens maintain balance.

This is achieved by making rules which are qualified to match only when the lexer is in a
particular state. Additionally the lexer has a stack of states, and matching rules perform actions
which modify the stack.

## The INITIAL start state
Unless specified otherwise all lex rules are members of the *INITIAL* start state.

```
%%
<INITIAL>a "A"
<INITIAL>[\t \n]+ ;
```

This is the lex file below with no start states specified.

```
%%
a "A"
[\t \n]+ ;
```

## Rules matching multiple states

Rules can be matched in multiple states, just separate the states a rule should match in commas.
The following matches the `a` character when in either of the states `FirstState` or `SecondState`.

```
<FirstState,SecondState>a "A"
```

## Differences from POSIX lex

In posix lex start states are entered via code in the action, through either `BEGIN(STATE)` and
calling combinations of `yy_push_state`, and `yy_pop_state`.

Because lrlex is actionless, and does not support code actions, instead we have operators to
perform the common modifications to the stack of start states.

### Push
The push operator is given by the adding '+' to the target state on the right hand side within
angle brackets. The following when regex matches in the *CURRENT_STATE* pushes *TARGET_STATE* to
the top of a stack of states.

```
<CURRENT_STATE>Regex <+TARGET_STATE>;
```

### Pop
The pop operator is given by the adding '-' to the target state on the right hand side within angle
brackets. Similarly when in the current state, the following pops the current state off of the
stack of states. Similarly to calling `yy_pop_state` from action code.
```
<CURRENT_STATE>Regex <-CURRENT_STATE>;
```

### ReplaceStack
The ReplaceStack operator is given by naming the target state within angle brackets.
The ReplaceStack op clears the entire stack of states, then pushing the target state.

```
<CURRENT_STATE>Regex <TARGET_STATE>;
```

### Returning a token while performing an operator.
Start state operators can be combined with returning a token for example:

```
<CURRENT_STATE>Regex <+TARGET_STATE>"TOKEN"
```

## Adding a start state
Start stats come in two forms, *exclusive* and *inclusive*. These are given by `%x` and `%s`
respectively.

### Exclusive states
In an exclusive state, the rule can be matched *only* if the rule begins with the state specified.
In the following because `ExclState` is *exclusive*, the `#=` rule is only matched during the
`INITIAL` state, while the `a` and `=#` characters are only matched while in the `ExclState`.

```
%x ExclState
%%
#= <+ExclState>;
<ExclState>a "A"
<ExclState>=# <-ExclState>;
```

### Inclusive states

Inclusive states are added to the set of rules to be matched when the start state is unspecified.

```
%s InclusiveState
%%
a "A"
<InclusiveState>b "B"
<INITIAL>#= <+InclusiveState>;
<InclusiveState>=# <-InclusiveState>;
```

Is equivalent to the following using exclusive states.

```
%x Excl
%%
<INITIAL, Excl>a "A"
<Excl>b "B"
<INITIAL>#= <+Excl>;
<Excl>=# <-Excl>;
```