Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to allow content_length limitation to webpreview generation #22

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

inkhey
Copy link

@inkhey inkhey commented Feb 28, 2022

Hello and thank for this nice library,
This proposal is a big refactoring to limit resource usage of webpreview, we do need for the Tracim software :

  • Add a possible content_length limit to limit the amount of data loaded by a query.
  • reduce to only one query per request to webpreview function (not anymore request for each preview type).
  • do not check anymore url scheme with an extra query (this will fix duplicate requests sometimes not necessary #15).
  • better separation between the query code and the extract code, making easier for external program to reuse the response to get extra information if necessary.

@vduseev
Copy link
Collaborator

vduseev commented May 25, 2022

Hi @inkhey,

Would you be open to the idea of retrieving the content only up to the content_limit_length instead of dropping it all together? Below is some reasoning behind this idea.

  • BeautifulSoup is able to parse even a broken half completed HTML content:

    In [6]: example = """
     ...: <!DOCTYPE html>
     ...: <html>
     ...:     <head>
     ...:         <meta charset="utf-8">
     ...:         <meta name="viewport" content="width=device-width">
     ...:         <title></title>
     ...:         <meta property="og:title" content="a title" />
     ...:         <meta property="og:price:amo
     ...: """
    
    In [7]: soup = BeautifulSoup(example, "html.parser")
    
    In [9]: soup.title
    Out[9]: <title></title>
  • Tying the implementation to inheriting from requests.Response limits future options to easily add support for async libraries such as aiohttp.

  • There might be valid cases where users just want the first 1,000-10,000 characters of the webpage for parsing, saving on memory and retrieval speeds.

Do you think that could a valid approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

duplicate requests sometimes not necessary
2 participants