Refactor to allow content_length limitation to webpreview generation #22

inkhey · 2022-02-28T10:15:59Z

Hello and thank for this nice library,
This proposal is a big refactoring to limit resource usage of webpreview, we do need for the Tracim software :

Add a possible content_length limit to limit the amount of data loaded by a query.
reduce to only one query per request to webpreview function (not anymore request for each preview type).
do not check anymore url scheme with an extra query (this will fix duplicate requests sometimes not necessary #15).
better separation between the query code and the extract code, making easier for external program to reuse the response to get extra information if necessary.

) Co-authored-by: Mathis Perrier <[email protected]>

vduseev · 2022-05-25T17:31:29Z

Hi @inkhey,

Would you be open to the idea of retrieving the content only up to the content_limit_length instead of dropping it all together? Below is some reasoning behind this idea.

BeautifulSoup is able to parse even a broken half completed HTML content:

In [6]: example = """
 ...: <!DOCTYPE html>
 ...: <html>
 ...:     <head>
 ...:         <meta charset="utf-8">
 ...:         <meta name="viewport" content="width=device-width">
 ...:         <title></title>
 ...:         <meta property="og:title" content="a title" />
 ...:         <meta property="og:price:amo
 ...: """

In [7]: soup = BeautifulSoup(example, "html.parser")

In [9]: soup.title
Out[9]: <title></title>

Tying the implementation to inheriting from requests.Response limits future options to easily add support for async libraries such as aiohttp.
There might be valid cases where users just want the first 1,000-10,000 characters of the webpage for parsing, saving on memory and retrieval speeds.

Do you think that could a valid approach?

inkhey and others added 2 commits February 28, 2022 10:59

Refactor to allow content_length limitation to webpreview generation (#1

726f4b5

) Co-authored-by: Mathis Perrier <[email protected]>

few adjustement to the new organisation

0f0f7f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor to allow content_length limitation to webpreview generation #22

Refactor to allow content_length limitation to webpreview generation #22

inkhey commented Feb 28, 2022

vduseev commented May 25, 2022

Refactor to allow content_length limitation to webpreview generation #22

Are you sure you want to change the base?

Refactor to allow content_length limitation to webpreview generation #22

Conversation

inkhey commented Feb 28, 2022

vduseev commented May 25, 2022