test harness can't deal with Unicode regular expressions #395

otterley · 2020-02-14T23:42:47Z

The test harness (cfn test) can't handle regex patterns in Resource Provider schemas that use Unicode matchers like \p{L} (Unicode letter).

Here's an example that is based on AWS::IAM::Role from the CloudFormation documentation:

        "Description": {
            "description": "A description of the role",
            "type": "string",
            "maxLength": 1000,
            "pattern": "[\\p{L}\\p{M}\\p{Z}\\p{S}\\p{N}\\p{P}]*"
        },

This is likely due to a limitation of Python's built-in re module.

Consider using the more modern https://pypi.org/project/regex/ library, which understands these.

Error follows:

source = <sre_parse.Tokenizer object at 0x10bbc6850>, escape = '\\p'

    def _class_escape(source, escape):
        # handle escape code inside character class
        code = ESCAPES.get(escape)
        if code:
            return code
        code = CATEGORIES.get(escape)
        if code and code[0] is IN:
            return code
        try:
            c = escape[1:2]
            if c == "x":
                # hexadecimal escape (exactly two digits)
                escape += source.getwhile(2, HEXDIGITS)
                if len(escape) != 4:
                    raise source.error("incomplete escape %s" % escape, len(escape))
                return LITERAL, int(escape[2:], 16)
            elif c == "u" and source.istext:
                # unicode escape (exactly four digits)
                escape += source.getwhile(4, HEXDIGITS)
                if len(escape) != 6:
                    raise source.error("incomplete escape %s" % escape, len(escape))
                return LITERAL, int(escape[2:], 16)
            elif c == "U" and source.istext:
                # unicode escape (exactly eight digits)
                escape += source.getwhile(8, HEXDIGITS)
                if len(escape) != 10:
                    raise source.error("incomplete escape %s" % escape, len(escape))
                c = int(escape[2:], 16)
                chr(c) # raise ValueError for invalid code
                return LITERAL, c
            elif c == "N" and source.istext:
                import unicodedata
                # named unicode escape e.g. \N{EM DASH}
                if not source.match('{'):
                    raise source.error("missing {")
                charname = source.getuntil('}', 'character name')
                try:
                    c = ord(unicodedata.lookup(charname))
                except KeyError:
                    raise source.error("undefined character name %r" % charname,
                                       len(charname) + len(r'\N{}'))
                return LITERAL, c
            elif c in OCTDIGITS:
                # octal escape (up to three digits)
                escape += source.getwhile(2, OCTDIGITS)
                c = int(escape[1:], 8)
                if c > 0o377:
                    raise source.error('octal escape value %s outside of '
                                       'range 0-0o377' % escape, len(escape))
                return LITERAL, c
            elif c in DIGITS:
                raise ValueError
            if len(escape) == 2:
                if c in ASCIILETTERS:
>                   raise source.error('bad escape %s' % escape, len(escape))
E                   re.error: bad escape \p at position 1

../../.pyenv/versions/3.8.1/lib/python3.8/sre_parse.py:349: error

The text was updated successfully, but these errors were encountered:

johnttompkins · 2020-02-18T17:56:42Z

Thanks for raising this. We are using hypothesis strategies to generate the examples. Let me see if the suggested library would play well with hypothesis.

PatMyron · 2021-04-29T17:27:02Z

JSON schema itself recommends sticking to a minimal subset of regular expression syntax, we're now encouraging the same:
#675 (comment)

johnttompkins added the contract tests I'll make you an offer you can't refuse label Feb 18, 2020

PatMyron closed this as completed May 3, 2021

PatMyron added the schema processing label May 5, 2021

PatMyron mentioned this issue May 5, 2021

catch common resource schema issues in cfn validate #675

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test harness can't deal with Unicode regular expressions #395

test harness can't deal with Unicode regular expressions #395

otterley commented Feb 14, 2020

johnttompkins commented Feb 18, 2020

PatMyron commented Apr 29, 2021

test harness can't deal with Unicode regular expressions #395

test harness can't deal with Unicode regular expressions #395

Comments

otterley commented Feb 14, 2020

johnttompkins commented Feb 18, 2020

PatMyron commented Apr 29, 2021