[WIP] Proposal : Change ANSI to Token parsing from regex to state machine #130

leonard-IMBERT · 2023-08-23T14:43:00Z

This pull request is a proposal showcasing a functional implementation of ANSI parsing using a state machine instead of regex

Motivation

Up until now the parsing of ANSI to Token representation was done using regex

pytermgui/pytermgui/regex.py

Lines 7 to 9 in 70f2f58

    
           RE_LINK = re.compile(r"(?:\x1b\]8;;([^\\]*)\x1b\\([^\\]*?)\x1b\]8;;\x1b\\)") 
        
           RE_ANSI_NEW = re.compile(rf"(\x1b\[(.*?)[mH])|{RE_LINK.pattern}|(\x1b_G(.*?)\x1b\\)") 
        
           RE_ANSI = re.compile(r"(?:\x1b\[(.*?)[mH])|(?:\x1b\](.*?)\x1b\\)|(?:\x1b_G(.*?)\x1b\\)")

Basically the regex are matched against the ANSI and if it trigger a Token is emitted. Problem is that it easily cause issue for corner cases. Last one was that hyperlink closing sequence OSC 8;; ESC \ needed to be paired with an hyperlink opening sequence, causing un-opened hyperlink closing sequence to be written in plain as a result.

The behavior documented here, used as a reference for pytermgui behavior, specify that hyperlink closing sequence should always be sent to the terminal

As such, in terminal emulators an OSC 8 escape sequence just changes the hyperlink (or lack thereof) to the new value. It is perfectly legal to switch from one hyperlink to another without explicitly closing the first one. It is also perfectly legal to close a hyperlink when it's not actually open (e.g. to make sure to clean up after a potentially unclean exit of an application).

So changes had to be made to follow this specification.

This change could be made using regex but would quickly become hairy with the need to save the context somewhere, regex becoming very big and unreadable, etc...

I think that for implementing future feature, and for readability, it's better to use a state machine and a character per character reading of the ANSI.

Implementation

Showcased implementation is not as cleaned as it could be but I didn't want to go too deep while I'm not sure this change is wanted.

The state are defined like this:

ESC="\x1b"

CSI="["
SGR="m"
CURSOR="H"

OSC="]"
HYPERLINK="8"

ST="\\"

SEP=";"

You can notice I differentiate CSI and OSC commands. I think it will help implementation of new feature to have this granularity at the parsing level.

The state machine implementation is in tokenize_ansi. Some cases are not taken into account yet

For example what happens if after a CSI is never terminated by a known character ?
What if and OSC if followed by an unknow command ?
What error should we throw and when to throw them ?

Stability

All the test are currently passing.

Move from a regex base parsing to a state-machine base parsing

bczsalba · 2023-09-06T07:45:49Z

pytermgui/markup/parsing.py

-def tokenize_ansi(  # pylint: disable=too-many-locals, too-many-branches, too-many-statements
-    text: str,
+def _tokenize_ansi_color(
+    params: List[str],


You can use list[str] here as we have __annotations__ imported for lower Python versions.

bczsalba · 2023-09-06T07:46:35Z

pytermgui/markup/parsing.py

-    for matchobj in RE_ANSI.finditer(text):
-        start, end = matchobj.span()
+            if part in ("38", "48"):
+                state = "COLOR"


I'd use an Enum for state representations; makes it harder to have a typo lead to undefined behaviour.

bczsalba · 2023-09-06T07:48:55Z

Sorry for the delay.

I like the idea! Truth is I've wanted to do this for ages, just never got around to it. I'd say you can go ahead with a full blown implementation. I marked the bits I had (small) issues with, but generally it looks good already.

Thanks for the help! :)

bczsalba · 2024-08-31T21:40:30Z

Any updates here? :)

leonard-IMBERT · 2024-09-02T14:42:08Z

Hi @bczsalba ! Sorry, for the long time without any updates, was quite busy lately and will probably be for the next few months but, if the issue is still open at that time, I think I'll come back to it :)

bczsalba · 2024-09-02T15:40:55Z

No worries! I've also been away a while, and I'm not really actively working on this project other than a few fixes here and there anyways. Just ping me if you have anything!

Change the how ANSI is parsed

581809b

Move from a regex base parsing to a state-machine base parsing

leonard-IMBERT mentioned this pull request Aug 23, 2023

Fix multiline Label widget to incorrectly display message with hyperlink #108

Merged

bczsalba reviewed Sep 6, 2023

View reviewed changes

bczsalba marked this pull request as draft February 24, 2024 12:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Proposal : Change ANSI to Token parsing from regex to state machine #130

[WIP] Proposal : Change ANSI to Token parsing from regex to state machine #130

leonard-IMBERT commented Aug 23, 2023

bczsalba Sep 6, 2023

bczsalba Sep 6, 2023

bczsalba commented Sep 6, 2023

bczsalba commented Aug 31, 2024

leonard-IMBERT commented Sep 2, 2024

bczsalba commented Sep 2, 2024

	RE_LINK = re.compile(r"(?:\x1b\]8;;([^\\])\x1b\\([^\\]?)\x1b\]8;;\x1b\\)")
	RE_ANSI_NEW = re.compile(rf"(\x1b\[(.?)[mH])\|{RE_LINK.pattern}\|(\x1b_G(.?)\x1b\\)")
	RE_ANSI = re.compile(r"(?:\x1b\[(.?)[mH])\|(?:\x1b\](.?)\x1b\\)\|(?:\x1b_G(.*?)\x1b\\)")

[WIP] Proposal : Change ANSI to Token parsing from regex to state machine #130

Are you sure you want to change the base?

[WIP] Proposal : Change ANSI to Token parsing from regex to state machine #130

Conversation

leonard-IMBERT commented Aug 23, 2023

Motivation

Implementation

Stability

bczsalba Sep 6, 2023

Choose a reason for hiding this comment

bczsalba Sep 6, 2023

Choose a reason for hiding this comment

bczsalba commented Sep 6, 2023

bczsalba commented Aug 31, 2024

leonard-IMBERT commented Sep 2, 2024

bczsalba commented Sep 2, 2024