Skip to content

Commit

Permalink
Merge pull request #22 from scrapinghub/additional-requests
Browse files Browse the repository at this point in the history
implementation of additional requests
  • Loading branch information
kmike authored Apr 27, 2022
2 parents 33dbdb5 + 753e6ad commit 025f5b1
Show file tree
Hide file tree
Showing 16 changed files with 1,709 additions and 86 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@ Changelog
TBR
------------------

* Added support for performing additional requests using
``web_poet.HttpClient``.
* Introduced ``web_poet.Meta`` to pass arbitrary information
inside a Page Object.
* added a ``PageObjectRegistry`` class which has the ``handle_urls`` decorator
to conveniently declare and collect ``OverrideRule``.
* removed support for Python 3.6
Expand Down
921 changes: 921 additions & 0 deletions docs/advanced/additional-requests.rst

Large diffs are not rendered by default.

134 changes: 134 additions & 0 deletions docs/advanced/meta.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
.. _`advanced-meta`:

============================
Passing information via Meta
============================

In some cases, Page Objects might require additional information to be passed to
them. Such information can dictate the behavior of the Page Object or affect its
data entirely depending on the needs of the developer.

If you can recall from the previous basic tutorials, one essential requirement of
Page Objects that inherit from :class:`~.WebPage` or :class:`~.ItemWebPage` would
be :class:`~.HttpResponse`. This holds the HTTP response information that the
Page Object is trying to represent.

In order to standardize how to pass arbitrary information inside Page Objects,
we'll need to use :class:`~.Meta` similar on how we use :class:`~.HttpResponse`
as a requirement to instantiate Page Objects:

.. code-block:: python
import attrs
import web_poet
@attrs.define
class SomePage(web_poet.ItemWebPage):
# The HttpResponse attribute is inherited from ItemWebPage
meta: web_poet.Meta
# Assume that it's constructed with the necessary arguments taken somewhere.
response = web_poet.HttpResponse(...)
# It uses Python's dict interface.
meta = web_poet.Meta({"arbitrary_value": 1234, "cool": True})
page = SomePage(response=response, meta=meta)
However, similar with :class:`~.HttpResponse`, developers using :class:`~.Meta`
shouldn't care about how they are being passed into Page Objects. This will
depend on the framework that would use **web-poet**.

Let's checkout some examples on how to use it inside a Page Object.

Controlling item values
-----------------------

.. code-block:: python
import attrs
import web_poet
@attrs.define
class ProductPage(web_poet.ItemWebPage):
meta: web_poet.Meta
default_tax_rate = 0.10
def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"price": self.css("#main .price ::text").get(),
}
self.calculate_price_with_tax(item)
return item
@staticmethod
def calculate_price_with_tax(item):
tax_rate = self.meta.get("tax_rate") or self.default_tax_rate
item["price_with_tax"] = item["price"] * (1 + tax_rate)
From the example above, we were able to provide an optional information regarding
the **tax rate** of the product. This could be useful when trying to support
the different tax rates for each state or territory. However, since we're treating
the **tax_rate** as optional information, notice that we also have a the
``default_tax_rate`` as a backup value just in case it's not available.


Controlling Page Object behavior
--------------------------------

Let's try an example wherein :class:`~.Meta` is able to control how
:ref:`advanced-requests` are being used. Specifically, we are going to use
:class:`~.Meta` to control the number of paginations being made.

.. code-block:: python
from typing import List
import attrs
import web_poet
@attrs.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
meta: web_poet.Meta
default_max_pages = 5
async def to_item(self):
return {"product_urls": await self.get_product_urls()}
async def get_product_urls(self) -> List[str]:
# Simulates scrolling to the bottom of the page to load the next
# set of items in an "Infinite Scrolling" category list page.
max_pages = self.meta.get("max_pages") or self.default_max_pages
requests = [
self.create_next_page_request(page_num)
for page_num in range(2, max_pages + 1)
]
responses = await http_client.batch_execute(*requests)
return [
url
for response in responses
for product_urls in self.parse_product_urls(response)
for url in product_urls
]
@staticmethod
def create_next_page_request(page_num):
next_page_url = f"https://example.com/category/products?page={page_num}"
return web_poet.Request(url=next_page_url)
@staticmethod
def parse_product_urls(response: web_poet.HttpResponse):
return response.css("#main .products a.link ::attr(href)").getall()
From the example above, we can see how :class:`~.Meta` is able to arbitrarily
limit the pagination behavior by passing an optional **max_pages** info. Take
note that a ``default_max_pages`` value is also present in the Page Object in
case the :class:`~.Meta` instance did not provide it.
19 changes: 17 additions & 2 deletions docs/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ Page Inputs
===========

.. automodule:: web_poet.page_inputs
:members:
:undoc-members:
:members:
:undoc-members:

Pages
=====
Expand Down Expand Up @@ -48,6 +48,21 @@ Mixins
:members:
:no-special-members:

Requests
========

.. automodule:: web_poet.requests
:members:
:undoc-members:

Exceptions
==========

.. automodule:: web_poet.exceptions.core
:members:

.. automodule:: web_poet.exceptions.http
:members:

.. _`api-overrides`:

Expand Down
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,4 +194,5 @@
'scrapy': ('https://docs.scrapy.org/en/latest', None, ),
'url-matcher': ('https://url-matcher.readthedocs.io/en/stable/', None, ),
'parsel': ('https://parsel.readthedocs.io/en/latest/', None, ),
'multidict': ('https://multidict.readthedocs.io/en/latest/', None, ),
}
7 changes: 7 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,13 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
intro/from-ground-up
intro/overrides

.. toctree::
:caption: Advanced
:maxdepth: 1

advanced/additional-requests
advanced/meta

.. toctree::
:caption: Reference
:maxdepth: 1
Expand Down
Loading

0 comments on commit 025f5b1

Please sign in to comment.