Skip to content

Commit

Permalink
add documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
BurnzZ committed Mar 14, 2022
1 parent 74e5c89 commit b02b2dd
Show file tree
Hide file tree
Showing 4 changed files with 361 additions and 48 deletions.
278 changes: 278 additions & 0 deletions docs/advanced/additional_requests.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
.. _`advanced-requests`:

===================
Additional Requests
===================

Websites nowadays needs a lot of page interactions to display or load some key
information. In most cases, these are done via AJAX requests. Some examples of these are:

* Clicking a button on a page to reveal other similar products.
* Clicking the `"Load More"` button to retrieve more images of a given item.
* Scrolling to the bottom of the page to load more items `(i.e. infinite scrolling)`.
* Hovering that reveals a tool-tip containing additional page info.

As such, performing additional requests inside Page Objects are inevitable to
properly extract data for some websites.

.. warning::

Additional requests made inside a Page Object aren't meant to represent
the **Crawling Logic** at all. They are simply a low-level way to interact
with today's websites which relies on a lot of page interactions to display
its contents.


HttpClient
==========

The main interface for executing additional requests would be :class:`~.HttpClient`.
It also has full support for :mod:`asyncio` enabling developers to perform
the additional requests asynchronously.

Let's see a few quick examples to see how it's being used in action.

A simple ``GET`` request
------------------------

.. code-block:: python
import attr
import web_poet
@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}
# Simulates clicking on a button that says "View All Images"
response: web_poet.ResponseData = await self.http_client.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
page = web_poet.WebPage(response)
item["images"] = page.css(".product-images img::attr(src)").getall()
return item
There are a few things to take note in this example:

* A ``GET`` request can be done via :class:`~.HttpClient`'s
:meth:`~.HttpClient.get` method.
* We're now using the ``async/await`` syntax.
* The response is of type :class:`~.ResponseData`.

* Though in order to use :meth:`~.ResponseShortcutsMixin.css`
`(and other shortcut methods)` we'll need to feed it into
:class:`~.WebPage`.

As the example suggests, we're performing an additional request that allows us
to extract more images in a product page that might not otherwise be possible.
This is because in order to do so, an additional button needs to be clicked
which fetches the complete set of product images via AJAX.

A ``POST`` request with `header` and `body`
-------------------------------------------

Let's see another example which needs ``headers`` and ``body`` data to process
additional requests.

In this example, we'll paginate related items in a carousel. These are
usually lazily loaded by the website to reduce the amount of information
rendered in the DOM that might not otherwise be viewed by all users anyway.

Thus, additional requests inside the Page Object is typically needed for it:

.. code-block:: python
import attr
import web_poet
@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
"related_product_ids": self.parse_related_product_ids(self),
}
# Simulates "scrolling" through a carousel that loads related product items
response: web_poet.responseData = await self.http_client.post(
url="https://www.api.example.com/related-products/",
headers={
'Host': 'www.example.com',
'Content-Type': 'application/json; charset=UTF-8',
},
body=json.dumps(
{
"Page": 2,
"ProductID": item["product_id"],
}
),
)
second_page = web_poet.WebPage(response)
related_product_ids = self.parse_related_product_ids(second_page)
item["related_product_ids"] = related_product_ids
return item
@staticmethod
def parse_related_product_ids(page: web_poet.WebPage) -> List[str]:
return page.css("#main .related-products ::attr(product-id)").getall()
Here's the key takeaway in this example:

* Similar to :class:`~.HttpClient`'s :meth:`~.HttpClient.get` method,
a :meth:`~.HttpClient.post` method is also available that's
typically used to submit forms.

Batch requests
--------------

We can also choose to process requests by **batch** instead of sequentially.
Let's modify the example in the previous section to see how it can be done:

.. code-block:: python
from typing import List
import attr
import web_poet
@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
default_pagination_limit = 10
async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
"related_product_ids": self.parse_related_product_ids(self),
}
requests: List[web_poet.Request] = [
self.create_request(page_num=page_num)
for page_num in range(2, default_pagination_limit)
]
responses: List[web_poet.ResponseData] = await self.http_client.batch_requests(*requests)
pages = map(web_poet.WebPage, responses)
related_product_ids = [
product_id
for page in pages
for product_id in self.parse_related_product_ids(page)
]
item["related_product_ids"].extend(related_product_ids)
return item
def create_request(self, page_num=2):
# Simulates "scrolling" through a carousel that loads related product items
return web_poet.Request(
url="https://www.api.example.com/product-pagination/",
method="POST",
headers={
'Host': 'www.example.com',
'Content-Type': 'application/json; charset=UTF-8',
},
body=json.dumps(
{
"Page": page_num,
"ProductID": item["product_id"],
}
),
)
@staticmethod
def parse_related_product_ids(page: web_poet.WebPage) -> List[str]:
return page.css("#main .related-products ::attr(product-id)").getall()
The key takeaways for this example are:

* A :class:`~.Request` can be instantiated to represent a Generic HTTP Request.
It only contains the HTTP Request information for now and isn't executed yet.
This is useful for creating factory methods to help create them without any
download execution at all.
* :class:`~.HttpClient`' has a :meth:`~.HttpClient.batch_requests` method that
can process a series of :class:`~.Request` instances.

* Note that it can accept different types of :class:`~.Request` that might
not be related *(e.g. a mixture of* ``GET`` *and* ``POST`` *requests)*.
This is useful to process them in batch to take advantage of async
execution.

.. _advanced-downloader-impl:

Downloader Implementation
=========================

Please note that on its own, :class:`~.HttpClient` doesn't do anything. It doesn't
know how to execute the request on its own. Thus, for frameworks or projects
wanting to use additional requests in Page Objects, they need to set the
implementation of how to download :class:`~.Request`.

For more info on this, kindly read the API Specifications for :class:`~.HttpClient`.

In any case, frameworks that wish to support **web-poet** could provide the
HTTP downloader implementation in two ways:

.. _setup-contextvars:

1. Context Variable
-------------------

:mod:`contextvars` is natively supported in :mod:`asyncio` in order to set and
access context-aware values. This means that the framework using **web-poet**
can easily assign the implementation using the readily available :mod:`contextvars`
instance named ``web_poet.request_backend_var``.

This can be set using:

.. code-block:: python
def request_implementation(r: web_poet.Request) -> web_poet.ResponseData:
...
from web_poet import request_backend_var
request_backend_var.set(request_implementation)
Setting this up would allow access to the request implementation in a
:class:`~.HttpClient` instance which uses it by default.

.. warning::

If no value for ``web_poet.request_backend_var`` was set, then a
:class:`~.RequestBackendError` is raised. However, no exception would
be raised if **option 2** below is used.


2. Dependency Injection
-----------------------

The framework using **web-poet** might be using other libraries which doesn't
have a full support to :mod:`contextvars` `(e.g. Twisted)`. With that, an
alternative approach would be to supply the request implementation when creating
an :class:`~.HttpClient` instance:


.. code-block:: python
def request_implementation(r: web_poet.Request) -> web_poet.ResponseData:
...
from web_poet import HttpClient
http_client = HttpClient(request_downloader=request_implementation)
7 changes: 7 additions & 0 deletions docs/api_reference.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,3 +45,10 @@ Mixins
.. autoclass:: web_poet.mixins.ResponseShortcutsMixin
:members:
:no-special-members:

Requests
========

.. automodule:: web_poet.requests
:members:
:undoc-members:
6 changes: 6 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,12 @@ and the motivation behind ``web-poet``, start with :ref:`from-ground-up`.
intro/tutorial
intro/from-ground-up

.. toctree::
:caption: Advanced
:maxdepth: 1

advanced/additional_requests

.. toctree::
:caption: Reference
:maxdepth: 1
Expand Down
Loading

0 comments on commit b02b2dd

Please sign in to comment.