-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Proposal : Change ANSI to Token parsing from regex to state machine #130
base: master
Are you sure you want to change the base?
Conversation
Move from a regex base parsing to a state-machine base parsing
def tokenize_ansi( # pylint: disable=too-many-locals, too-many-branches, too-many-statements | ||
text: str, | ||
def _tokenize_ansi_color( | ||
params: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use list[str]
here as we have __annotations__
imported for lower Python versions.
for matchobj in RE_ANSI.finditer(text): | ||
start, end = matchobj.span() | ||
if part in ("38", "48"): | ||
state = "COLOR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use an Enum for state representations; makes it harder to have a typo lead to undefined behaviour.
Sorry for the delay. I like the idea! Truth is I've wanted to do this for ages, just never got around to it. I'd say you can go ahead with a full blown implementation. I marked the bits I had (small) issues with, but generally it looks good already. Thanks for the help! :) |
Any updates here? :) |
Hi @bczsalba ! Sorry, for the long time without any updates, was quite busy lately and will probably be for the next few months but, if the issue is still open at that time, I think I'll come back to it :) |
No worries! I've also been away a while, and I'm not really actively working on this project other than a few fixes here and there anyways. Just ping me if you have anything! |
This pull request is a proposal showcasing a functional implementation of ANSI parsing using a state machine instead of regex
Motivation
Up until now the parsing of ANSI to
Token
representation was done using regexpytermgui/pytermgui/regex.py
Lines 7 to 9 in 70f2f58
Basically the regex are matched against the ANSI and if it trigger a Token is emitted. Problem is that it easily cause issue for corner cases. Last one was that hyperlink closing sequence
OSC 8;; ESC \
needed to be paired with an hyperlink opening sequence, causing un-opened hyperlink closing sequence to be written in plain as a result.The behavior documented here, used as a reference for pytermgui behavior, specify that hyperlink closing sequence should always be sent to the terminal
So changes had to be made to follow this specification.
This change could be made using regex but would quickly become hairy with the need to save the context somewhere, regex becoming very big and unreadable, etc...
I think that for implementing future feature, and for readability, it's better to use a state machine and a character per character reading of the ANSI.
Implementation
Showcased implementation is not as cleaned as it could be but I didn't want to go too deep while I'm not sure this change is wanted.
The state are defined like this:
You can notice I differentiate
CSI
andOSC
commands. I think it will help implementation of new feature to have this granularity at the parsing level.The state machine implementation is in
tokenize_ansi
. Some cases are not taken into account yetCSI
is never terminated by a known character ?OSC
if followed by an unknow command ?Stability
All the test are currently passing.