Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise exceptions when receiving 400-5xx responses #38

Merged
merged 23 commits into from
May 20, 2022
Merged

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Apr 25, 2022

Built on top of #22.

TODO:

  • Tutorial Docs
  • Include execute() and batch_execute()
  • have the ability to allow all status codes
  • Explore both inclusion and exclusion list for status codes
    • opting for the default errors on 400-5xx and having an exclusion list using allow_status param

@BurnzZ BurnzZ requested a review from kmike April 25, 2022 14:12
@BurnzZ BurnzZ mentioned this pull request Apr 25, 2022
4 tasks
For responses that are not really errors like in the 100-3xx status code range,
this exception shouldn't be raised at all. However, for responses with status
codes in the 400-5xx range, the implementing framework should properly raise
the exception.
Copy link
Member

@kmike kmike Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! What do you think about a slightly different approach?

  1. State that frameworks must not raise exceptions for 400-5xx status code. They must return an HttpResponse when a response is received, with a proper status code set. I'm actually not sure how allow_status works in this PR, if a framework raises an exception for 400-5xx codes.
  2. Frameworks must raise HttpRequestError for connections errors, etc. I'm not sure about timeouts though; it's fine to make frameworks handle it for now, but we'd need to think about it further.
  3. We should define framework behavior for 3xx status codes, for redirects. For example, we can say that frameworks must follow redirects.
  4. Probably it's a right time to add another exception class, to be able to distinguish errors without a response, and errors where response is present. We can even have a response as an attribute of this exception: because of (1), this information would be available.

Copy link
Contributor Author

@BurnzZ BurnzZ Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kmike , let me know what do you think about the points below?

  1. Sounds good. It's currently the expectation on the master branch. So not much changes here aside from perhaps improving the docs to reflect on it more clearly.

    • EDIT: To clarify the discussion, web-poet should still be raising errors for 400-5xx responses. The framework could utilize the allow_status param here.
    • I was also thinking of perhaps having a similar interface to aiohttp and request raise_for_status() method could cover 400-5xx errors. This shifts the focus of having the errors being the default into an intentional one dictated by the developer.
  2. We'll need to improve the docs on this regard regarding raising HttpRequestError for connection errors. For timeouts, we can create a new exception subclassing from HttpRequestError.

  3. Got it, we'll need to improve the docs for this one. Perhaps create an entirely new section for a framework's checklist.

  4. I think we can use the current HttpRequestError for errors without a response and create a new HttpReponseError for errors with responses. In addition, we could also create a parent class for both of them, perhaps calling it HttpError.

We can create a new PR for this as it has a different approach due to (1).

Copy link
Member

@kmike kmike Apr 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @BurnzZ! Regarding (1) - I was thinking that frameworks would not be raising these exceptions, but web-poet would. It could be a place where allow_status is checked as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I got that wrong. Thanks for the clarification! Will edit my points above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding exceptions: +1 to have 3 exceptions - a base exception, and 2 exceptions for these conditions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for timeouts, the main issue is who sets the timeout value - is it a framework, or is it something which can be controlled in PO? I think so far we've been side-stepping it, and letting framework to manage it.

But it means that using a specific timeout for a specific request, e.g.

  • decreasing it for requests which should be fast, or
  • increasing for requests which are expected to take long, or
  • setting a timeout if a library doesn't have a timeout

can't be done. I don't think this is something to solve in this PR, but it's something to think about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kmike ! I've reorganized the docs to better document all of these things. The main change is to properly distinguish the docs for Page Object developers vs developers that would implement a framework for web-poet. 3ebb297.

The new base class exception named HttpError was also introduced. 470bc6f.

Kindly let me know if there are other specific things we need to explicitly document here that were missed out.


Regarding the timeouts, yeah, that's indeed tricky. It also adds some additional complexity to Page Objects, since aside from knowing how to construct the requests, it would also need to consider (and establish an expectation) how long it would take the response to arrive.

There would be some factors that the Page Objects might not have access to, like the number of requests made to the same domain in the past minute (which could timeout due to rate limiting) or the shifting tide of non-200 responses for a given domain (where the site could throttle down requests as well). Letting a Page Object decide on timeouts based on limited knowledge is difficult.

I think at least for the short-term, we can let the framework handle it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good points about timeouts, I haven't considered those!

Base automatically changed from additional-requests to master April 28, 2022 04:03
@codecov
Copy link

codecov bot commented Apr 29, 2022

Codecov Report

Merging #38 (43bb5d9) into master (f2dce80) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #38   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines          289       320   +31     
=========================================
+ Hits           289       320   +31     
Impacted Files Coverage Δ
web_poet/exceptions/http.py 100.00% <100.00%> (ø)
web_poet/page_inputs/client.py 100.00% <100.00%> (ø)

@BurnzZ BurnzZ marked this pull request as ready for review April 29, 2022 08:33
with pytest.raises(HttpResponseError) as excinfo:
await method(url_or_request)
assert isinstance(excinfo.value.request, HttpRequest)
assert isinstance(excinfo.value.response, HttpResponse)
Copy link
Contributor Author

@BurnzZ BurnzZ Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmike could you refresh my memory on why the HttpRequest and HttpResponse classes were set using @attr.define(..., eq=False)? link to code

Setting eq=True could allow quick equality comparisons here in the test. For example, assert excinfo.value.request == expected_request.

Copy link
Member

@kmike kmike Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry for that, I should have added a comment. I was probably thinking it'd be reflected in a commit history, but then commited in a single commit.

My goal was to make these objects hashable, so that they can be used as keys in dictionaries, or put into sets. As per https://www.attrs.org/en/stable/hashing.html recommendations, I was choosing between frozen=True and eq=False. For some reason I was unable to make frozen=True work (e.g. it'd probably require request headers to be an immutable dict), so I decided to use eq=False. Give up equality by value, get hashing support. It's probably possible to have both, but it requires more thought.

(see :ref:`framework-exception-handling`).

The Downloader should be able to properly resolve **redirections** except when
the method is ``HEAD``. This means that the :class:`~.HttpResponse` that it'll
Copy link
Member

@kmike kmike May 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except when the method is HEAD

That's an interesting behavior, a nice touch 👍 . It seems right to me. I wonder how easy would it be to implement it though. For example, Scrapy seems to redirect in case of HEAD: https://github.com/scrapy/scrapy/blob/b2afcbfe2bf090827540d072866bef0d1ab3a3e8/scrapy/downloadermiddlewares/redirect.py#L102

Copy link
Contributor Author

@BurnzZ BurnzZ May 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this @kmike ! I didn't realize that this was the default behavior of Scrapy.

I think the easiest route would be for scrapy-poet to set meta={'dont_redirect'=True} when it sees a HEAD request method. Added a note in the TODO section of scrapinghub/scrapy-poet#62 so we wouldn't forget.

web_poet/exceptions/http.py Outdated Show resolved Hide resolved
web_poet/exceptions/http.py Outdated Show resolved Hide resolved
web_poet/requests.py Outdated Show resolved Hide resolved
web_poet/requests.py Outdated Show resolved Hide resolved
web_poet/exceptions/http.py Outdated Show resolved Hide resolved
Copy link
Member

@kmike kmike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @BurnzZ!

@kmike kmike merged commit de6e24c into master May 20, 2022
@kmike
Copy link
Member

kmike commented May 20, 2022

Thanks @BurnzZ!

@kmike kmike deleted the status-code-err branch October 26, 2022 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants