-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise exceptions when receiving 400-5xx responses #38
Conversation
For responses that are not really errors like in the 100-3xx status code range, | ||
this exception shouldn't be raised at all. However, for responses with status | ||
codes in the 400-5xx range, the implementing framework should properly raise | ||
the exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! What do you think about a slightly different approach?
- State that frameworks must not raise exceptions for 400-5xx status code. They must return an HttpResponse when a response is received, with a proper status code set. I'm actually not sure how allow_status works in this PR, if a framework raises an exception for 400-5xx codes.
- Frameworks must raise HttpRequestError for connections errors, etc. I'm not sure about timeouts though; it's fine to make frameworks handle it for now, but we'd need to think about it further.
- We should define framework behavior for 3xx status codes, for redirects. For example, we can say that frameworks must follow redirects.
- Probably it's a right time to add another exception class, to be able to distinguish errors without a response, and errors where response is present. We can even have a response as an attribute of this exception: because of (1), this information would be available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kmike , let me know what do you think about the points below?
-
Sounds good. It's currently the expectation on themaster
branch. So not much changes here aside from perhaps improving the docs to reflect on it more clearly.- EDIT: To clarify the discussion,
web-poet
should still be raising errors for400-5xx
responses. The framework could utilize theallow_status
param here. - I was also thinking of perhaps having a similar interface to aiohttp and request
raise_for_status()
method could cover 400-5xx errors. This shifts the focus of having the errors being the default into an intentional one dictated by the developer.
- EDIT: To clarify the discussion,
-
We'll need to improve the docs on this regard regarding raising
HttpRequestError
for connection errors. For timeouts, we can create a new exception subclassing fromHttpRequestError
. -
Got it, we'll need to improve the docs for this one. Perhaps create an entirely new section for a framework's checklist.
-
I think we can use the current
HttpRequestError
for errors without a response and create a newHttpReponseError
for errors with responses. In addition, we could also create a parent class for both of them, perhaps calling itHttpError
.
We can create a new PR for this as it has a different approach due to (1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @BurnzZ! Regarding (1) - I was thinking that frameworks would not be raising these exceptions, but web-poet would. It could be a place where allow_status is checked as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I got that wrong. Thanks for the clarification! Will edit my points above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding exceptions: +1 to have 3 exceptions - a base exception, and 2 exceptions for these conditions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for timeouts, the main issue is who sets the timeout value - is it a framework, or is it something which can be controlled in PO? I think so far we've been side-stepping it, and letting framework to manage it.
But it means that using a specific timeout for a specific request, e.g.
- decreasing it for requests which should be fast, or
- increasing for requests which are expected to take long, or
- setting a timeout if a library doesn't have a timeout
can't be done. I don't think this is something to solve in this PR, but it's something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kmike ! I've reorganized the docs to better document all of these things. The main change is to properly distinguish the docs for Page Object developers vs developers that would implement a framework for web-poet. 3ebb297.
The new base class exception named HttpError
was also introduced. 470bc6f.
Kindly let me know if there are other specific things we need to explicitly document here that were missed out.
Regarding the timeouts, yeah, that's indeed tricky. It also adds some additional complexity to Page Objects, since aside from knowing how to construct the requests, it would also need to consider (and establish an expectation) how long it would take the response to arrive.
There would be some factors that the Page Objects might not have access to, like the number of requests made to the same domain in the past minute (which could timeout due to rate limiting) or the shifting tide of non-200
responses for a given domain (where the site could throttle down requests as well). Letting a Page Object decide on timeouts based on limited knowledge is difficult.
I think at least for the short-term, we can let the framework handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good points about timeouts, I haven't considered those!
Codecov Report
@@ Coverage Diff @@
## master #38 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 14 14
Lines 289 320 +31
=========================================
+ Hits 289 320 +31
|
with pytest.raises(HttpResponseError) as excinfo: | ||
await method(url_or_request) | ||
assert isinstance(excinfo.value.request, HttpRequest) | ||
assert isinstance(excinfo.value.response, HttpResponse) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmike could you refresh my memory on why the HttpRequest
and HttpResponse
classes were set using @attr.define(..., eq=False)
? link to code
Setting eq=True
could allow quick equality comparisons here in the test. For example, assert excinfo.value.request == expected_request
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry for that, I should have added a comment. I was probably thinking it'd be reflected in a commit history, but then commited in a single commit.
My goal was to make these objects hashable, so that they can be used as keys in dictionaries, or put into sets. As per https://www.attrs.org/en/stable/hashing.html recommendations, I was choosing between frozen=True and eq=False. For some reason I was unable to make frozen=True work (e.g. it'd probably require request headers to be an immutable dict), so I decided to use eq=False. Give up equality by value, get hashing support. It's probably possible to have both, but it requires more thought.
(see :ref:`framework-exception-handling`). | ||
|
||
The Downloader should be able to properly resolve **redirections** except when | ||
the method is ``HEAD``. This means that the :class:`~.HttpResponse` that it'll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except when the method is
HEAD
That's an interesting behavior, a nice touch 👍 . It seems right to me. I wonder how easy would it be to implement it though. For example, Scrapy seems to redirect in case of HEAD: https://github.com/scrapy/scrapy/blob/b2afcbfe2bf090827540d072866bef0d1ab3a3e8/scrapy/downloadermiddlewares/redirect.py#L102
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this @kmike ! I didn't realize that this was the default behavior of Scrapy.
I think the easiest route would be for scrapy-poet to set meta={'dont_redirect'=True}
when it sees a HEAD
request method. Added a note in the TODO section of scrapinghub/scrapy-poet#62 so we wouldn't forget.
This adjusts the code after the said PR refactored the project structure. Reference: #37
Co-authored-by: Mikhail Korobov <[email protected]>
Co-authored-by: Mikhail Korobov <[email protected]>
Co-authored-by: Mikhail Korobov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @BurnzZ!
Thanks @BurnzZ! |
Built on top of #22.
TODO:
execute()
andbatch_execute()
400-5xx
and having an exclusion list usingallow_status
param