Skip to content
olivier edited this page Mar 8, 2018 · 22 revisions

Generic Lexer

The generic lexer aims at solving the performance issues with the Regex Lexer. The idea is to start from a limited set of classical lexemes and to refine this set to fit your needs. Those lexemes are recognize through a Finite State Machine, way more efficient than looping through a set of regexes.

The basic lexemes are :

  • GenericToken.Identifier: an identifier. From version 2.0.3 Identifier accepts an extra parameter to specify an identifier pattern :
    • IdentifierType.Alpha : only alpha characters (default value, only pattern available before version 2.0.3)
    • IdentifierType.AlphaNum : starting with an alpha char and then alpha or numeric char
    • IdentifierType.AlphaNumDash : starting with an alpha or ''(underscore) char and then alphanumeric or '-'(minus) or '' (underscore) char
  • GenericToken.String : a classical string delimited by double quotes ". See below for more details.
  • GenericToken.Int : an int (i.e. a serie of one or more digit)
  • GenericToken.Double : a float number (decimal separator is dot '.' )
  • GenericToken.keyWord : a keyword is an identifier with a special meaning (it comes with the same constraint as the GenericToken.Identifier. here again performance comes at the price of less flexibility. This lexeme is configurable.
  • GenericToken.SugarToken : a general purpose lexeme with no special constraint except the use of a leading alpha char. this lexer is configurable.

To build a generic lexer Lexeme attribute we have 2 different constructors:

  • static generic lexeme. this constructor allows to do a 1 to 1 mapping between a generic token and your lexer token. It uses only one parameter that is the mapped generic token : [Lexeme(GenericToken.String)] (static lexemes are String, Int , Double and Identifier)
  • configurable lexemes (KeyWord and SugarToken). It takes 2 parameters :
    • the mapped GenericToken
    • the value of the keyword or sugar token.

strings

Strings lexeme definitions take 2 parameters :

  • a string delimiter char. Default is " (double quote)
  • an escape delimiter char to allow the use of the delimiter char inside a string. Default is \ (backslash). Use of the same char for delimiter and escape char is allowed.

examples

    // matches 'hello \' world' => 'hello ' world'
  [Lexeme(GenericToken.String,"'","\\")]
  STRING

or

  // matches 'that''s my hello world' => 'that's my hello world'
  [Lexeme(GenericToken.String,"'","'")]
  STRING

Many string patterns

Many string patterns are allowes in the same lexer. For instance you should want to match double quote delimited string as well as single quote delimiter string. For this you can simply apply many lexeme attribute with to the same enum value :

    // matches 'hello \' world' => 'hello ' world'
    // as well as "hello \" world" => "hello " world"
    [Lexeme(GenericToken.String,"'","'")]
    [Lexeme(GenericToken.String,"'","\\")]
    STRING

comments

The generic lexer offers support for comments. It introduces a special```c# [Comments(singleline, multilinestart, multilineend] ``àttribute on enum value that declares the comments delimiters : * singleline : the single line comment delimiter ( "//" for all C derived languages) * multilinestart : the starting multi line comment delimiter ( "/" in all C derived language) * multilineend : the closing multi line delimiter ( "/" in all C derived language)

Comments are removed from the token stream before the parse start to ignore them. Nevertheless you can get them, for any special purpose, using directly the lexer.

full example, for a dumb language (generic token based)

  public enum WhileTokenGeneric
    {

        #region keywords 0 -> 19
        
        [Lexeme(GenericToken.KeyWord,"if")]
        IF = 1,

        [Lexeme(GenericToken.KeyWord, "then")]
        THEN = 2,

        [Lexeme(GenericToken.KeyWord, "else")]
        ELSE = 3,

        [Lexeme(GenericToken.KeyWord, "while")]
        WHILE = 4,

        [Lexeme(GenericToken.KeyWord, "do")]
        DO = 5,

        [Lexeme(GenericToken.KeyWord, "skip")]
        SKIP = 6,

        [Lexeme(GenericToken.KeyWord, "true")]
        TRUE = 7,

        [Lexeme(GenericToken.KeyWord, "false")]
        FALSE = 8,
        [Lexeme(GenericToken.KeyWord, "not")]
        NOT = 9,

        [Lexeme(GenericToken.KeyWord, "and")]
        AND = 10,

        [Lexeme(GenericToken.KeyWord, "or")]
        OR = 11,

        [Lexeme(GenericToken.KeyWord, "(print)")]
        PRINT = 12,

        #endregion

        #region literals 20 -> 29

        // identifier with IdentifierType.AlphaNumDash pattern
        [Lexeme(GenericToken.Identifier, IdentifierType.AlphaNumDash)]
        IDENTIFIER = 20,

        [Lexeme(GenericToken.String)]
        STRING = 21,

        [Lexeme(GenericToken.Int)]
        INT = 22,

        #endregion

        #region operators 30 -> 49

        [Lexeme(GenericToken.SugarToken,">")]
        GREATER = 30,

        [Lexeme(GenericToken.SugarToken, "<")]
        LESSER = 31,

        [Lexeme(GenericToken.SugarToken, "==")]
        EQUALS = 32,

        [Lexeme(GenericToken.SugarToken, "!=")]
        DIFFERENT = 33,

        [Lexeme(GenericToken.SugarToken, ".")]
        CONCAT = 34,

        [Lexeme(GenericToken.SugarToken, ":=")]
        ASSIGN = 35,

        [Lexeme(GenericToken.SugarToken, "+")]
        PLUS = 36,

        [Lexeme(GenericToken.SugarToken, "-")]
        MINUS = 37,


        [Lexeme(GenericToken.SugarToken, "*")]
        TIMES = 38,

        [Lexeme(GenericToken.SugarToken, "/")]
        DIVIDE = 39,

        #endregion 

        #region sugar 50 -> 99

        [Lexeme(GenericToken.SugarToken, "(")]
        LPAREN = 50,

        [Lexeme(GenericToken.SugarToken, ")")]
        RPAREN = 51,

        [Lexeme(GenericToken.SugarToken, ";")]
        SEMICOLON = 52,

    	#endregion
        
        #region comments : C like comments
        
        [Comment("//","/*","*/")]
        COMMENTS = 100
        
        #endregion

        EOF = 0

        #endregion

    }