HTML character references &foo; are cut at the semi-colon #62

jayvdb · 2020-02-27T18:04:43Z

A URL containing an XML entity/HTML character reference, such as http://.../..?foo&bar;baz, will be cut at the semi-colon.

The text was updated successfully, but these errors were encountered:

jayvdb · 2020-04-05T03:57:38Z

Another is ' in https://docs.red-dove.com/cfg/python.html results in a 404 at https://freeotp.github.io/&#39 (the ; is omitted, but it is the & which causes the 404 as it doesnt follow a ?)

jayvdb · 2020-04-13T09:00:19Z

It might useful to have the caller inform the parser what type of text is being provided, such as html, xml, md, rst, which give clues to the parser when it is trying to find the start and end of urls, and what decoding to perform on the url.

Or have a hook in the class which is given the location of the hostname, so the hook can decide the start and end of the url which surrounds the hostname. Then I could override the URLExtract class several times to implement this hook for various doctypes. _complete_url almost does this, but it would need to be a public member of the API.

jayvdb mentioned this issue Apr 5, 2020

pypidb issues #68

Open

lipoja added bug high labels Apr 11, 2020

lipoja added this to the 0.15.0 milestone Apr 11, 2020

lipoja modified the milestones: 1.0.0, 1.1.0 Jun 20, 2020

lipoja removed this from the 1.1.0 milestone Oct 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML character references &foo; are cut at the semi-colon #62

HTML character references &foo; are cut at the semi-colon #62

jayvdb commented Feb 27, 2020

jayvdb commented Apr 5, 2020 •

edited

Loading

jayvdb commented Apr 13, 2020

HTML character references &foo; are cut at the semi-colon #62

HTML character references &foo; are cut at the semi-colon #62

Comments

jayvdb commented Feb 27, 2020

jayvdb commented Apr 5, 2020 • edited Loading

jayvdb commented Apr 13, 2020

jayvdb commented Apr 5, 2020 •

edited

Loading