safety: `safeify_url()` function can't handle (invalid) URLs with `[hostname]` placeholder #2644

dgw · 2024-11-17T00:56:48Z

If someone sends a link while safety is active in the channel, and the URL contains a placeholder hostname in square brackets, Sopel will spit out an "Unexpected ValueError" message. Note: Seems to happen only on Python 3.11 or higher.

This is from the safeify_url() function added in #2279, which uses urllib.parse.urlparse() to make sanitizing URL parts easier, which in turn uses ipaddress.ip_address() to raise an error for bracketed IPv4 addresses—and trips on this weird edge case:

sopel/sopel/builtins/safety.py

Lines 127 to 133 in e7d8648

    
           def safeify_url(url: str) -> str: 
        
               """Replace bits of a URL to make it hard to browse to.""" 
        
               parts = urlparse(url) 
        
               scheme = "hxx" + parts.scheme[3:]  # hxxp 
        
               netloc = parts.netloc.replace(".", "[.]")  # google[.]com and IPv4 
        
               netloc = netloc.replace(":", "[:]")  # IPv6 addresses (bad lazy method) 
        
               return urlunparse((scheme, netloc) + parts[2:])

Simple examples using the Python console:

>>> # Python 3.10
>>> safety.safeify_url('http://[Target-IP]/cgi-bin/account_mgr.cgi')
'hxxp://[Target-IP]/cgi-bin/account_mgr.cgi'

>>> # Python 3.11
>>> safety.safeify_url('http://[Target-IP]/cgi-bin/account_mgr.cgi')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dgw/github/sopel-irc/sopel/sopel/builtins/safety.py", line 129, in safeify_url
    parts = urlparse(url)
            ^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dgw/.pyenv/versions/3.11.10/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: 'Target-IP' does not appear to be an IPv4 or IPv6 address

The version inconsistency is going to be the worst part of designing a "correct" fix for this. A simple fallback approach (such as return url.replace('http', 'hxxp', 1) if url.startswith('http') else url) will miss more complicated cases that are still handled fine in older Python versions (output using 3.10 shown here):

>>> safety.safeify_url('http://[Target.IP]/cgi-bin/account_mgr.cgi')  # dot gets bracketed
'hxxp://[Target[.]IP]/cgi-bin/account_mgr.cgi'
>>> safety.safeify_url('http://[Target:IP]/cgi-bin/account_mgr.cgi')  # colon gets bracketed
'hxxp://[Target[:]IP]/cgi-bin/account_mgr.cgi'

Do note though that all this is an edge case of an edge case. People must intentionally construct these invalid URLs, and can be trained to simply use another type of bracket for placeholders instead, such as http://<Target-IP>/cgi-bin/account_mgr.cgi.

The text was updated successfully, but these errors were encountered:

half-duplex · 2024-11-21T02:34:28Z

PR opened to be more graceful about it, but considering http://<foo>/ and http://999.999.999.999/ are parsed successfully, I think this is a urllib bug?

>>> urlparse("http://<test>/")
ParseResult(scheme='http', netloc='<test>', path='/', params='', query='', fragment='')

dgw · 2024-11-21T03:40:48Z

but considering http://<foo>/ and http://999.999.999.999/ are parsed successfully, I think this is a urllib bug?

Square brackets are special. The reported "error" URL is actually invalid per RFC 3986 § 3.2.2:

A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax.

The only thing that's supposed to go in square brackets in a URI is an IPv6 (or later) literal, full stop. You could probably even get away with not "safeifying" these invalid links at all, since they can't be followed anyway.

And yes, I know that still leaves us with inconsistent behavior between different Python versions. But Python apparently decided urllib should be stricter about following the URI spec. 🤷‍♂️

dgw added the Bug Things to squish; generally used for issues label Nov 17, 2024

half-duplex linked a pull request Nov 21, 2024 that will close this issue

safety: fix safeify_url() exception on python 3.11 #2646

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

safety: `safeify_url()` function can't handle (invalid) URLs with `[hostname]` placeholder #2644

safety: `safeify_url()` function can't handle (invalid) URLs with `[hostname]` placeholder #2644

dgw commented Nov 17, 2024

half-duplex commented Nov 21, 2024

dgw commented Nov 21, 2024

safety: safeify_url() function can't handle (invalid) URLs with [hostname] placeholder #2644

safety: safeify_url() function can't handle (invalid) URLs with [hostname] placeholder #2644

Comments

dgw commented Nov 17, 2024

half-duplex commented Nov 21, 2024

dgw commented Nov 21, 2024

safety: `safeify_url()` function can't handle (invalid) URLs with `[hostname]` placeholder #2644

safety: `safeify_url()` function can't handle (invalid) URLs with `[hostname]` placeholder #2644