Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implementation of additional requests #22

Merged
merged 49 commits into from
Apr 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
012a199
add basic implementation of additional requests
BurnzZ Feb 8, 2022
6d38fd0
introduce concepts of GenericRequest and HttpClient
BurnzZ Feb 8, 2022
cf7de8d
prevent HttpClient from being keyword arg in ItemWebPage
BurnzZ Feb 8, 2022
391a449
revert changes to page.py
BurnzZ Feb 8, 2022
f3bfa95
assign perform_request as the default downloader for HttpClient
BurnzZ Feb 8, 2022
e0f247b
add status and headers to ResponseData
BurnzZ Feb 9, 2022
feff29d
fix mypy errors on ResponseData.headers
BurnzZ Feb 9, 2022
15d0525
support multiple inputs in HttpClient.request
BurnzZ Feb 11, 2022
c5537ce
add tests for additional request components
BurnzZ Feb 18, 2022
74e5c89
refactor HttpClient's request interface
BurnzZ Feb 18, 2022
f99a6e1
add docs for additional requests
BurnzZ Mar 14, 2022
9025660
update CHANGELOG about HttpClient support for additional requests
BurnzZ Mar 14, 2022
4e803a4
fix failing test on refactored _perform_request()
BurnzZ Mar 14, 2022
355294e
fix incorrect type annotation for HttpClient.batch_requests()
BurnzZ Mar 14, 2022
1d29328
introduce a concept of Meta to pass things around inside Page Objects
BurnzZ Feb 9, 2022
706f698
add __repr__ to Meta
BurnzZ Feb 10, 2022
9ae60e6
add tests for Meta
BurnzZ Feb 18, 2022
658d1b7
refactor Meta to subclass from a dict with added value restrictions a…
BurnzZ Feb 17, 2022
1e038fe
remove feature to specify required data in Meta
BurnzZ Mar 14, 2022
9d4940d
rename is_restricted_value() into enforce_value_restriction() in Meta
BurnzZ Mar 14, 2022
1caec0d
add doc for Meta
BurnzZ Mar 14, 2022
a404189
update CHANGELOG for Meta
BurnzZ Mar 14, 2022
a8d9530
remove value restriction of Meta
BurnzZ Mar 21, 2022
079ccc1
remove unused import
BurnzZ Mar 21, 2022
396ab8e
Merge pull request #23 from scrapinghub/meta
kmike Mar 21, 2022
f769147
Merge branch 'master' into additional-requests
BurnzZ Mar 28, 2022
83839cf
update doc filenames to be consistent in naming format
BurnzZ Apr 4, 2022
3a76bc3
refactor Request to become HttpRequest
BurnzZ Apr 4, 2022
a853d16
update docs to have a dedicated section for HttpRequest
BurnzZ Apr 5, 2022
f0a3419
general doc improvements to additional requests and meta
BurnzZ Apr 5, 2022
e74ca50
fix code examples in docs for HttpClient and Meta
BurnzZ Apr 5, 2022
109a670
remove the alternative constructor from_anystr() from HttpRequestBody
BurnzZ Apr 5, 2022
eda3352
formalize exceptions and handling them
BurnzZ Apr 6, 2022
65badca
refactor to create base classes for HttpRequest and HttpBody
BurnzZ Apr 11, 2022
99e9b41
move HttpRequest* classes from into the page_inputs.py module
BurnzZ Apr 11, 2022
6a53b90
use keyword args for HttpRequest in HttpClient
BurnzZ Apr 11, 2022
c1f7675
remove return type annotation on HttpResponseBody.json()
BurnzZ Apr 11, 2022
3310c1b
remove proposed _HttpBody
BurnzZ Apr 11, 2022
29f84c6
Merge branch 'master' into additional-requests
BurnzZ Apr 11, 2022
d88a0e5
make init params of HttpRequest/HttpResponse kw-only except for url
BurnzZ Apr 11, 2022
cf6e663
reverted change on HttpResponse.body being keyword-only
BurnzZ Apr 11, 2022
5d2a751
update batch_requests() to return errors in failed requests
BurnzZ Apr 13, 2022
7947d58
improve docs to add more examples and fix existing ones
BurnzZ Apr 13, 2022
e7b4f08
update test to check out positional args in HttpResponse's body
BurnzZ Apr 13, 2022
bc4baea
expose 'return_exceptions' param to HttpClient.batch_requests()
BurnzZ Apr 19, 2022
9f43187
make Headers and Body type variables private only to requests.py
BurnzZ Apr 19, 2022
80f95a8
update batch_requests default param value of return_exceptions from T…
BurnzZ Apr 25, 2022
5d82a37
expose execute() in HttpClient and rename batch_requests() into batch…
BurnzZ Apr 25, 2022
753e6ad
improve additional request docs
BurnzZ Apr 25, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ Changelog
TBR
------------------

* Added support for performing additional requests using
``web_poet.HttpClient``.
* Introduced ``web_poet.Meta`` to pass arbitrary information
inside a Page Object.
* added a ``PageObjectRegistry`` class which has the ``handle_urls`` decorator
to conveniently declare and collect ``OverrideRule``.
* removed support for Python 3.6
Expand Down
921 changes: 921 additions & 0 deletions docs/advanced/additional-requests.rst

Large diffs are not rendered by default.

134 changes: 134 additions & 0 deletions docs/advanced/meta.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
.. _`advanced-meta`:

============================
Passing information via Meta
============================

In some cases, Page Objects might require additional information to be passed to
them. Such information can dictate the behavior of the Page Object or affect its
data entirely depending on the needs of the developer.

If you can recall from the previous basic tutorials, one essential requirement of
Page Objects that inherit from :class:`~.WebPage` or :class:`~.ItemWebPage` would
be :class:`~.HttpResponse`. This holds the HTTP response information that the
Page Object is trying to represent.

In order to standardize how to pass arbitrary information inside Page Objects,
we'll need to use :class:`~.Meta` similar on how we use :class:`~.HttpResponse`
as a requirement to instantiate Page Objects:

.. code-block:: python

import attrs
import web_poet

@attrs.define
class SomePage(web_poet.ItemWebPage):
# The HttpResponse attribute is inherited from ItemWebPage
meta: web_poet.Meta

# Assume that it's constructed with the necessary arguments taken somewhere.
response = web_poet.HttpResponse(...)

# It uses Python's dict interface.
meta = web_poet.Meta({"arbitrary_value": 1234, "cool": True})

page = SomePage(response=response, meta=meta)

However, similar with :class:`~.HttpResponse`, developers using :class:`~.Meta`
shouldn't care about how they are being passed into Page Objects. This will
depend on the framework that would use **web-poet**.

Let's checkout some examples on how to use it inside a Page Object.

Controlling item values
-----------------------

.. code-block:: python

import attrs
import web_poet


@attrs.define
class ProductPage(web_poet.ItemWebPage):
meta: web_poet.Meta

default_tax_rate = 0.10

def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"price": self.css("#main .price ::text").get(),
}
self.calculate_price_with_tax(item)
return item

@staticmethod
def calculate_price_with_tax(item):
tax_rate = self.meta.get("tax_rate") or self.default_tax_rate
item["price_with_tax"] = item["price"] * (1 + tax_rate)


From the example above, we were able to provide an optional information regarding
the **tax rate** of the product. This could be useful when trying to support
the different tax rates for each state or territory. However, since we're treating
the **tax_rate** as optional information, notice that we also have a the
``default_tax_rate`` as a backup value just in case it's not available.


Controlling Page Object behavior
--------------------------------

Let's try an example wherein :class:`~.Meta` is able to control how
:ref:`advanced-requests` are being used. Specifically, we are going to use
:class:`~.Meta` to control the number of paginations being made.

.. code-block:: python

from typing import List

import attrs
import web_poet


@attrs.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
meta: web_poet.Meta

default_max_pages = 5

async def to_item(self):
return {"product_urls": await self.get_product_urls()}

async def get_product_urls(self) -> List[str]:
# Simulates scrolling to the bottom of the page to load the next
# set of items in an "Infinite Scrolling" category list page.
max_pages = self.meta.get("max_pages") or self.default_max_pages
requests = [
self.create_next_page_request(page_num)
for page_num in range(2, max_pages + 1)
]
responses = await http_client.batch_execute(*requests)
return [
url
for response in responses
for product_urls in self.parse_product_urls(response)
for url in product_urls
]

@staticmethod
def create_next_page_request(page_num):
next_page_url = f"https://example.com/category/products?page={page_num}"
return web_poet.Request(url=next_page_url)

@staticmethod
def parse_product_urls(response: web_poet.HttpResponse):
return response.css("#main .products a.link ::attr(href)").getall()

From the example above, we can see how :class:`~.Meta` is able to arbitrarily
limit the pagination behavior by passing an optional **max_pages** info. Take
note that a ``default_max_pages`` value is also present in the Page Object in
case the :class:`~.Meta` instance did not provide it.
19 changes: 17 additions & 2 deletions docs/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Page Inputs
===========

.. automodule:: web_poet.page_inputs
:members:
:undoc-members:
:members:
:undoc-members:

Pages
=====
Expand Down Expand Up @@ -48,6 +48,21 @@ Mixins
:members:
:no-special-members:

Requests
========

.. automodule:: web_poet.requests
:members:
:undoc-members:

Exceptions
==========

.. automodule:: web_poet.exceptions.core
:members:

.. automodule:: web_poet.exceptions.http
:members:

.. _`api-overrides`:

Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,4 +194,5 @@
'scrapy': ('https://docs.scrapy.org/en/latest', None, ),
'url-matcher': ('https://url-matcher.readthedocs.io/en/stable/', None, ),
'parsel': ('https://parsel.readthedocs.io/en/latest/', None, ),
'multidict': ('https://multidict.readthedocs.io/en/latest/', None, ),
}
7 changes: 7 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,13 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
intro/from-ground-up
intro/overrides

.. toctree::
:caption: Advanced
:maxdepth: 1

advanced/additional-requests
advanced/meta

.. toctree::
:caption: Reference
:maxdepth: 1
Expand Down
Loading