Support non-unicode hostname #153

frankdilo · 2023-09-27T13:29:34Z

URLExtract does not match this URL as it should: сайт.com

The text was updated successfully, but these errors were encountered:

Olaf- · 2023-10-27T11:20:13Z

This also applies to other examples like rohlík.cz or neovlivní.cz.

lipoja · 2023-12-26T20:14:17Z

@frankdilo, @Olaf-: Unfortunately those URLs are not valid according to RFC.

RFC3986
host = IP-literal / IPv4address / reg-name
where
reg-name = *( unreserved / pct-encoded / sub-delims )
and from that
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
and from that and RFC2234
ALPHA = %x41-5A / %x61-7A ; A-Z / a-z

As you can see domain name can't contain characters from UTF-8 (with some accents, hooks, ... )

I am open to discussion but I would suggest a workaround to convert all characters to ASCII an then use URLExtract to find the URLs and its position and extract the URLs from original text.

hwo411 · 2024-03-04T08:28:54Z

Also applies to fully Cyrillic domains like сайт.рф (even if you prepend it with https://). Would be great to see it fixed.

E.g., twitter-text in Ruby handles this properly: https://github.com/twitter/twitter-text/blob/master/rb/lib/twitter-text/regex.rb#L257)

lipoja added the need info label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-unicode hostname #153

Support non-unicode hostname #153

frankdilo commented Sep 27, 2023

Olaf- commented Oct 27, 2023

lipoja commented Dec 26, 2023

hwo411 commented Mar 4, 2024

Support non-unicode hostname #153

Support non-unicode hostname #153

Comments

frankdilo commented Sep 27, 2023

Olaf- commented Oct 27, 2023

lipoja commented Dec 26, 2023

hwo411 commented Mar 4, 2024