implementation of additional requests #22

BurnzZ · 2022-02-08T06:52:47Z

Progress

Testing across a wide-variety of scenarios/cases
- contextvars and Twisted
- scrapy-poet and scrapy integration
Docs
Tests
Changelog

codecov · 2022-02-08T06:53:24Z

Codecov Report

Merging #22 (753e6ad) into master (a29d86d) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #22   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            6        11    +5     
  Lines          224       285   +61     
=========================================
+ Hits           224       285   +61

Impacted Files	Coverage Δ
web_poet/__init__.py	`100.00% <100.00%> (ø)`
web_poet/_base.py	`100.00% <100.00%> (ø)`
web_poet/exceptions/__init__.py	`100.00% <100.00%> (ø)`
web_poet/exceptions/core.py	`100.00% <100.00%> (ø)`
web_poet/exceptions/http.py	`100.00% <100.00%> (ø)`
web_poet/page_inputs.py	`100.00% <100.00%> (ø)`
web_poet/requests.py	`100.00% <100.00%> (ø)`

kmike · 2022-02-17T10:32:23Z

web_poet/requests.py

+    url: str
+    method: str = "GET"
+    headers: Optional[mapping] = None
+    body: Optional[str] = None


I think it should be bytes, not str

This could be split into raw_data: bytes and html: str into a separate HttpResponseBody as per our recent discussions.

I've opened up a separate PR in #24 to open up the deep discussion on this.

We addressed it for Response, not we need to fix it for the Request in some way (which could actually be different).

Addressed this in 3a76bc3

web_poet/requests.py

…nd key requirements

docs/advanced/additional-requests.rst

kmike · 2022-04-12T07:56:45Z

docs/advanced/additional-requests.rst

+implementation** from a given framework is injected to it.
+
+
+Exception Handling


I think we should describe error handling before describing backends - just give some examples, of how to handle errors for

individual requests (e.g. get or post)

batch requests: what're the semantics, is it possible to get partial data, etc.

There is no need to move the whole section above. I think our docs could have 2 main parts, aimed to a different type of users:

People who write Page Objects, and need additional requests

People who're implementing web-poet intergations. Information about backends, etc. is mostly for this group.

I think it'd be good to make it clear what to do for (1) first, and then provide information for (2).

Gotcha, that makes sense. Thanks for raising it!

I moved the Exception Handling section right after discussing how to make additional requests so it's a smoother transition to users developing POs (as you've pointed out). Also written down more examples regarding execption handling in 7947d58

Regarding batch_requests(), thanks for pointing out about retrieving partial data as I haven't considered it before. The behavior is now updated in 5d2a751

docs/advanced/additional-requests.rst

tests/test_page_inputs.py

kmike · 2022-04-18T09:51:18Z

web_poet/requests.py

+        """
+
+        coroutines = [self.request_downloader(r) for r in requests]
+        responses = await asyncio.gather(*coroutines, return_exceptions=True)


What do you think about exposing return_exceptions option? It could make sense to keep the interface similar to asyncio.gather. I also wonder if we should keep False as a default; it may be more intuitive. Otherwise, user would always need to add some code to the PO itself to handle the exceptions; with return_exceptions=False an error would be just raised & propagated, which may be a right behavior in many cases.

Exposing return_exceptions would be a good idea, as it could prevent some additional effort from the the user to manually tweak the behavior. 👍 Updated this on bc4baea

Regarding on having its default value set to False: I do agree that it would be consistent with the underlying asyncio.gather() API default expectations.

However, I'm not quite sure if we're encouraging the right behavior to the user. For our use case, executing HTTP requests incurs more cost than other coroutine operations that asyncio could have. As such, we'd be interested in encouraging the behavior of salvaging any usable HTTP Responses that the user can. Having the exceptions err up when return_exceptions=False would lose track of usable responses altogether.

Moreover, it doesn't promote future possibilities like retrying a specific failed request from a batch_request() since the developer doesn't know which request had the issue.

Let me know what you think. :)

Hi @kmike , as discussed offline last week, we'll be having the default value of return_exceptions set to False.

This leads to a more meaningful exception message that explicitly logs HttpRequestError as opposed to something like "css method not found in ..." since the developer has not automatically handled the returned exception. Raising them by default makes the developer more aware on how to handle them.

Updated in 80f95a8.

web_poet/requests.py

web_poet/page_inputs.py

kmike · 2022-04-18T10:44:19Z

web_poet/__init__.py

-from .page_inputs import HttpResponse, HttpResponseBody, HttpResponseHeaders
+from .requests import (
+    request_backend_var,
+    HttpClient,


what's interesting is that HttpClient also can be a PO dependency :) I wonder if we should part from having page_inputs module, and just split it. E.g. web_poet.http, or something like this.

This is a good thought to ponder. I think we have 3 main choices for this which are:

Move HttpClient from requests.py to page_inputs.py

Create a page_inputs/ subpackage which houses page_inputs/meta.py and page_inputs/http.py (HttpClient is here)

Disband the page_inputs.py and simply have modules like meta.py and http.py (HttpClient is here)

I'm torn between 1 and 2 since having some sort of subpackage/module named page_inputs helps portray the idea that anything within it could be a PO dependency.

What do you think? Perhaps there might be other good options aside from these that I might've missed?

I'm also not sure :) Let's handle it separately.

Sure. I've filed a draft in #37 which explores option 2 to see how it may look like.

docs/advanced/additional-requests.rst

…rue to False

…_execute()

BurnzZ · 2022-04-25T14:19:03Z

Hello @kmike, Thanks for reviewing this PR again last week. 🙌

Aside from these large changes:

update return_exceptions default from True to False 80f95a8
expose execute() for single requests and rename batch_requests() into batch_execute() 5d82a37

Small doc-related changes are included in 753e6ad which includes (but not limited to):

fixing some incorrect code samples (they can't run due to syntax errors)
emphasize that order is preserved for batch_execute()
mention that batch_execute() is merely a shortcut for asyncio.gather() and other asyncio functions like asyncio.as_completed() can be used depending on the need
mention what are the expectations for redirections

Lastly, I've created another PR in #38 to explore the default behavior of raising exceptions when receiving 400-5xx responses.

kmike · 2022-04-27T17:35:37Z

Thanks so much @BurnzZ! I've checked the PR again, it looks good 💪

We might think about re-organizing docs a bit, but the content is there; this can be done later.

BurnzZ · 2022-04-28T04:03:58Z

Thanks for your patience in reviewing this one @kmike ! 🙏

add basic implementation of additional requests

012a199

BurnzZ self-assigned this Feb 8, 2022

BurnzZ mentioned this pull request Feb 8, 2022

integration for web-poet's support on additional requests and Meta scrapinghub/scrapy-poet#62

Merged

11 tasks

BurnzZ added 3 commits February 8, 2022 17:20

introduce concepts of GenericRequest and HttpClient

6d38fd0

prevent HttpClient from being keyword arg in ItemWebPage

cf7de8d

revert changes to page.py

391a449

BurnzZ force-pushed the additional-requests branch from 6507672 to 391a449 Compare February 8, 2022 10:28

BurnzZ added 3 commits February 8, 2022 19:01

assign perform_request as the default downloader for HttpClient

f3bfa95

add status and headers to ResponseData

e0f247b

fix mypy errors on ResponseData.headers

feff29d

kmike reviewed Feb 17, 2022

View reviewed changes

web_poet/requests.py Outdated Show resolved Hide resolved

BurnzZ added 2 commits February 18, 2022 13:42

support multiple inputs in HttpClient.request

15d0525

add tests for additional request components

c5537ce

BurnzZ force-pushed the additional-requests branch from 2e973aa to c5537ce Compare February 18, 2022 05:57

BurnzZ mentioned this pull request Feb 18, 2022

Introduce Meta as a way to pass information inside a PO #23

Merged

3 tasks

refactor HttpClient's request interface

74e5c89

BurnzZ force-pushed the additional-requests branch from 326c9ed to 74e5c89 Compare February 18, 2022 06:36

BurnzZ mentioned this pull request Feb 25, 2022

add core exceptions #26

Draft

3 tasks

BurnzZ added 2 commits March 14, 2022 20:08

add docs for additional requests

f99a6e1

update CHANGELOG about HttpClient support for additional requests

9025660

BurnzZ force-pushed the additional-requests branch from b02b2dd to 9025660 Compare March 14, 2022 12:08

BurnzZ marked this pull request as ready for review March 14, 2022 12:08

BurnzZ added 6 commits March 14, 2022 20:13

fix failing test on refactored _perform_request()

4e803a4

fix incorrect type annotation for HttpClient.batch_requests()

355294e

introduce a concept of Meta to pass things around inside Page Objects

1d29328

add __repr__ to Meta

706f698

add tests for Meta

9ae60e6

refactor Meta to subclass from a dict with added value restrictions a…

658d1b7

…nd key requirements

BurnzZ requested a review from kmike April 11, 2022 08:53

BurnzZ added 2 commits April 11, 2022 17:42

make init params of HttpRequest/HttpResponse kw-only except for url

d88a0e5

reverted change on HttpResponse.body being keyword-only

cf6e663

kmike reviewed Apr 12, 2022

View reviewed changes

docs/advanced/additional-requests.rst Outdated Show resolved Hide resolved

kmike reviewed Apr 12, 2022

View reviewed changes

docs/advanced/additional-requests.rst Outdated Show resolved Hide resolved

kmike reviewed Apr 12, 2022

View reviewed changes

tests/test_page_inputs.py Show resolved Hide resolved

BurnzZ added 3 commits April 13, 2022 15:58

update batch_requests() to return errors in failed requests

5d2a751

improve docs to add more examples and fix existing ones

7947d58

update test to check out positional args in HttpResponse's body

e7b4f08

BurnzZ requested a review from kmike April 13, 2022 08:54

kmike reviewed Apr 18, 2022

View reviewed changes

web_poet/requests.py Outdated Show resolved Hide resolved

kmike reviewed Apr 18, 2022

View reviewed changes

web_poet/requests.py Outdated Show resolved Hide resolved

kmike reviewed Apr 18, 2022

View reviewed changes

web_poet/page_inputs.py Outdated Show resolved Hide resolved

kmike reviewed Apr 18, 2022

View reviewed changes

BurnzZ added 2 commits April 19, 2022 12:49

expose 'return_exceptions' param to HttpClient.batch_requests()

bc4baea

make Headers and Body type variables private only to requests.py

9f43187

kmike reviewed Apr 22, 2022

View reviewed changes

docs/advanced/additional-requests.rst Outdated Show resolved Hide resolved

BurnzZ added 3 commits April 25, 2022 13:31

update batch_requests default param value of return_exceptions from T…

80f95a8

…rue to False

expose execute() in HttpClient and rename batch_requests() into batch…

5d82a37

…_execute()

improve additional request docs

753e6ad

This was referenced Apr 25, 2022

reorganize page_inputs.py as a submodule; move HttpClient to it #37

Merged

Raise exceptions when receiving 400-5xx responses #38

Merged

BurnzZ requested a review from kmike April 25, 2022 14:19

kmike merged commit 025f5b1 into master Apr 27, 2022

BurnzZ deleted the additional-requests branch April 28, 2022 04:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implementation of additional requests #22

implementation of additional requests #22

BurnzZ commented Feb 8, 2022 •

edited

Loading

codecov bot commented Feb 8, 2022 •

edited

Loading

kmike Feb 17, 2022

BurnzZ Feb 18, 2022

kmike Mar 30, 2022

BurnzZ Apr 4, 2022 •

edited

Loading

kmike Apr 12, 2022

BurnzZ Apr 13, 2022

kmike Apr 18, 2022

BurnzZ Apr 19, 2022

BurnzZ Apr 25, 2022

kmike Apr 18, 2022

BurnzZ Apr 19, 2022

kmike Apr 22, 2022

BurnzZ Apr 25, 2022

BurnzZ commented Apr 25, 2022

kmike commented Apr 27, 2022

BurnzZ commented Apr 28, 2022

		implementation** from a given framework is injected to it.


		Exception Handling

implementation of additional requests #22

implementation of additional requests #22

Conversation

BurnzZ commented Feb 8, 2022 • edited Loading

Progress

codecov bot commented Feb 8, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurnzZ Apr 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BurnzZ commented Apr 25, 2022

kmike commented Apr 27, 2022

BurnzZ commented Apr 28, 2022

BurnzZ commented Feb 8, 2022 •

edited

Loading

codecov bot commented Feb 8, 2022 •

edited

Loading

BurnzZ Apr 4, 2022 •

edited

Loading