Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ItemPage #74

Merged
merged 26 commits into from
Aug 25, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
fail-fast: false
matrix:
python-version: ['3.10']
tox-job: ["mypy", "docs", "linters"]
tox-job: ["mypy", "docs", "linters", "types"]

steps:
- uses: actions/checkout@v2
Expand Down
174 changes: 71 additions & 103 deletions docs/advanced/fields.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,23 +47,15 @@ This approach has 2 main advantages:
However, writing and maintaining ``to_item()`` method can get tedious,
especially if there is a lot of properties.

web_poet.fields
---------------

@field decorator
----------------
To aid writing Page Objects in this style, ``web-poet`` provides
a few utilities:

* :func:`@web_poet.field <web_poet.fields.field>` decorator,
* :func:`web_poet.item_from_fields <web_poet.fields.item_from_fields>`
and :func:`web_poet.item_from_fields_sync <web_poet.fields.item_from_fields_sync>`
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
functions.

We can rewrite the example like this:
the :func:`@web_poet.field <web_poet.fields.field>` decorator:

.. code-block:: python

import attrs
from web_poet import ItemPage, HttpResponse, field, item_from_fields_sync
from web_poet import ItemPage, HttpResponse, field


@attrs.define
Expand All @@ -78,64 +70,51 @@ We can rewrite the example like this:
def price(self):
return self.response.css(".price").get()

def to_item(self) -> dict:
return item_from_fields_sync(self)
:class:`~.ItemPage` has a default ``to_item()``
implementation: it uses all the properties created with
kmike marked this conversation as resolved.
Show resolved Hide resolved
:func:`@field <web_poet.fields.field>` decorator, and returns
a dict with the result, where keys are method names, and values are
property values. In the example above, ``to_item()`` returns a
``{"name": ..., "price": ...}`` dict with the extracted data.

Methods annotated with :func:`@field <web_poet.fields.field>` decorator
become properties; for ``page = MyPage(...)`` instance
kmike marked this conversation as resolved.
Show resolved Hide resolved
you can access them as ``page.name``.

As you can guess, :func:`~.item_from_fields_sync` uses all the properties
created with :func:`@field <web_poet.fields.field>` decorator, and returns
a dict with the result, where keys are method names, and values are
property values.
It's important to note that the default
:meth:`ItemPage.to_item() <web_poet.pages.ItemPage.to_item>` implementation
is an ``async def`` function - make sure to await its result:
``item = await page.to_item()``

Asynchronous fields
-------------------

``async def`` fields are also supported, as well as a mix of
sync and async methods - use :func:`~.item_from_fields` in ``to_item``
to make it work.
The reason :class:`~.ItemPage` provides an async ``to_item`` method by
default is that both regular and ``async def`` fields are supported.

For example, you might need to send :ref:`advanced-requests` to extract some
of the attributes:

.. code-block:: python

import attrs
from web_poet import ItemPage, HttpResponse, HttpClient, field, item_from_fields
from web_poet import ItemPage, HttpResponse, HttpClient, field


@attrs.define
class MyPage(ItemPage):
response: HttpResponse
http_client: HttpClient
http: HttpClient
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

@field
def name(self):
return self.response.css(".name").get()

@field
async def price(self):
resp = self.http_client.get("...")
resp = self.http.get("...")
kmike marked this conversation as resolved.
Show resolved Hide resolved
return resp.json()['price']

async def to_item(self) -> dict:
return await item_from_fields(self)

Because :func:`~.item_from_fields` supports both sync and async fields,
it's recommended to use it over :func:`~.item_from_fields_sync`, even
if there are no async fields yet. The only reason to use
:func:`~.item_from_fields_sync` would be to avoid using
``async def to_item`` method.

If you want to get a value of an async field, make sure to await it:

.. code-block:: python

page = MyPage(...)
price = await page.price

Using Page Objects with async fields
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down Expand Up @@ -185,38 +164,43 @@ attrs instances) instead of unstructured dicts to hold the data:
from web_poet import ItemPage, HttpResponse

@attrs.define
class Item:
class Product:
name: str
price: str


@attrs.define
class MyPage(ItemPage):
class ProductPage(ItemPage):
# ...
def to_item(self) -> Item:
return Item(
def to_item(self) -> Product:
return Product(
name=self.name,
price=self.price
)

:mod:`web_poet.fields` supports it, by allowing to pass an item class to the
:func:`~.item_from_fields` / :func:`~.item_from_fields_sync` functions:
:mod:`web_poet.fields` supports it, by allowing to parametrize
:class:`~.ItemPage` with an item class:

.. code-block:: python

@attrs.define
class MyPage(ItemPage):
class ProductPage(ItemPage[Product]):
# ...

async def to_item(self) -> Item:
return await item_from_fields(self, item_cls=Item)
When :class:`~.ItemPage` is parametrized with an item class,
its ``to_item()`` method starts to return item instances, instead
of ``dict`` instances. In the example above ``ProductPage.to_item`` method
returns ``Product`` instances.

Defining an Item class may be an overkill if you only have a single Page Object,
but item classes are of a great help when

* you need to extract data in the same format from multiple websites, or
* if you want to define the schema upfront.

Item classes can also be used to hold common attribute
pre-processing and validation logic.

Error prevention
~~~~~~~~~~~~~~~~

Expand All @@ -229,35 +213,32 @@ Consider the following badly written page object:
.. code-block:: python

import attrs
from web_poet import ItemPage, HttpResponse, field, item_from_fields
from web_poet import ItemPage, HttpResponse, field

@attrs.define
class Item:
class Product:
name: str
price: str


@attrs.define
class MyPage(ItemPage):
class ProductPage(ItemPage[Product]):
response: HttpResponse

@field
def nane(self):
return self.response.css(".name").get()

async def to_item(self) -> Item:
return await item_from_fields(self, item_cls=Item)

Because Item class is used, a typo ("nane" instead of "name") is detected
at runtime: creation of Item instance would fail with a ``TypeError``, because
Because Product item class is used, a typo ("nane" instead of "name") is detected
at runtime: creation of Product instance would fail with a ``TypeError``, because
of unexpected keyword argument "nane".
kmike marked this conversation as resolved.
Show resolved Hide resolved

After fixing it (renaming "nane" method to "name"), another error is going to be
detected: ``price`` argument is required, but there is no extraction method for
kmike marked this conversation as resolved.
Show resolved Hide resolved
this attribute, so ``Item.__init__`` will raise another ``TypeError``,
this attribute, so ``Product.__init__`` will raise another ``TypeError``,
indicating that a required argument is missing.

Without an Item class, none of these errors are detected.
Without an item class, none of these errors are detected.

Changing Item type
~~~~~~~~~~~~~~~~~~
Expand All @@ -278,15 +259,15 @@ different, using the original Page Object as a dependency is a good approach:

import attrs
from my_library import FooPage, StandardItem
from web_poet import ItemPage, HttpResponse, field, ensure_awaitable, item_from_fields
from web_poet import ItemPage, HttpResponse, field, ensure_awaitable

@attrs.define
class CustomItem:
new_name: str
new_price: str

@attrs.define
class CustomFooPage(ItemPage):
class CustomFooPage(ItemPage[CustomItem]):
response: HttpResponse
standard: FooPage

Expand All @@ -300,9 +281,6 @@ different, using the original Page Object as a dependency is a good approach:
async def new_price(self):
...

async def to_item(self) -> CustomItem:
return await item_from_fields(self, item_cls=CustomItem)

However, if items are similar, and share many attributes, this approach
could lead to boilerplate code. For example, you might be extending an item
with a new field, and it'd be required to duplicate definitions for all
Expand All @@ -314,63 +292,62 @@ to the item:

.. code-block:: python

import attrs
from my_library import FooPage, StandardItem
from web_poet import SetItemType, HttpResponse, field, ensure_awaitable

@attrs.define
class CustomItem(StandardItem):
new_field: str

@attrs.define
class CustomFooPage(FooPage):
class CustomFooPage(FooPage, SetItemType[CustomItem]):
kmike marked this conversation as resolved.
Show resolved Hide resolved

@field
def new_field(self) -> str:
# ...

async def to_item(self) -> CustomItem:
# we need to override to_item to ensure CustomItem is returned
return await item_from_fields(self, item_cls=CustomItem)
Note how :class:`~.SetItemType` is used as one of the base classes of
``CustomFooPage``; it allows to change the item type returned by a page object.

Removing fields (as well as renaming) is more tricky with inheritance though.
Removing fields (as well as renaming) is a bit more tricky.

The caveat is that by default :func:`item_from_fields` uses all fields
The caveat is that by default :class:`~.ItemPage` uses all fields
defined as ``@field`` to produce an item, passing all these values to
``Item.__init__``. So, if you follow the previous example, and inherit from
the "base", "standard" Page Object, there could be a ``@field`` from the base
class which is not present in the ``CustomItem``. It'd be still passed
to ``CustomItem.__init__``, causing an exception.
item's ``__init__`` method. So, if you follow the previous example, and
inherit from the "base", "standard" Page Object, there could be a ``@field``
from the base class which is not present in the ``CustomItem``.
It'd be still passed to ``CustomItem.__init__``, causing an exception.

To solve it, you can either
One way to solve it is to make the orignal Page Object a dependency
instead of inheriting from it, as explained in the beginning.

* make the orignal Page Object a dependency instead of inheriting from it
(as explained in the beginning), or
* use ``item_cls_fields=True`` argument of :func:`item_from_fields`:
when ``item_cls_fields`` parameter is True, ``@fields`` which
are not defined in the item are skipped.
Alternatively, you can use ``skip_nonitem_fields=True`` class argument - it tells
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Alternatively, you can use ``skip_nonitem_fields=True`` class argument - it tells
Alternatively, you can use ``skip_nonitem_fields=True`` class argument (*default*: ``False``) - it tells

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on fence about this suggestion :) On one hand, it makes sense. On the other hand, class arguments are so rarely used, so that thinking about default values of class arguments could be too much. I think for most users it'd be just "copy-paste skip_nonitem_fields=True". No strong opinon on it though.

:meth:`~.ItemPage.to_item` to skip ``@fields`` which are not defined
in the item:

.. code-block:: python

@attrs.define
class CustomItem(Item):
class CustomItem:
# let's pick only 1 attribute from StandardItem, nothing more
name: str

@attrs.define
class CustomFooPage(FooPage):
# inheriting from a page object which defines all StandardItem fields
class CustomFooPage(FooPage, SetItemType[CustomItem], skip_nonitem_fields=True):
pass

async def to_item(self) -> CustomItem:
return await item_from_fields(self, item_cls=CustomItem,
item_cls_fields=True)

Here, ``CustomFooPage.to_item`` only uses ``name`` field of the ``FooPage``, ignoring
all other fields defined in ``FooPage``, because ``item_cls_fields=True``
all other fields defined in ``FooPage``, because ``skip_nonitem_fields=True``
is passed, and ``name`` is the only field ``CustomItem`` supports.

To recap:

* Use ``item_cls_fields=False`` (default) when your Page Object corresponds
to an item exactly, or when you're only adding fields. This is a safe option,
which allows to detect typos in field names, even for optional fields.
* Use ``item_cls_fields=True`` when it's possible for the Page Object
* Don't use ``skip_nonitem_fields=True`` when your Page Object corresponds
to an item exactly, or when you're only adding fields. This is a safe
approach, which allows to detect typos in field names, even for optional
fields.
* Use ``skip_nonitem_fields=True`` when it's possible for the Page Object
to contain more ``@fields`` than defined in the item class, e.g. because
Page Object is inherited from some other base Page Object.

Expand Down Expand Up @@ -405,14 +382,7 @@ extracting the heavy operation to a method, and caching the results:

.. code-block:: python

from web_poet import (
ItemPage,
HttpResponse,
HttpClient,
field,
cached_method,
item_from_fields
)
from web_poet import ItemPage, HttpResponse, HttpClient, field, cached_method

class MyPage(ItemPage):
response: HttpResponse
Expand All @@ -437,9 +407,6 @@ extracting the heavy operation to a method, and caching the results:
api_response = await self.api_response()
return api_response["sku"]

async def to_item(self):
return await item_from_fields(self)

As you can see, ``web-poet`` provides :func:`~.cached_method` decorator,
which allows to memoize the function results. It supports both sync and
async methods, i.e. you can use it on regular methods (``def foo(self)``),
Expand Down Expand Up @@ -517,3 +484,4 @@ returns a dictionary, where keys are field names, and values are
fields_dict = get_fields_dict(MyPage)
field_names = fields_dict.keys()
my_field_meta = fields_dict["my_field"].meta

5 changes: 5 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
[tool.isort]
profile = "black"
multi_line_output = 3

[tool.mypy]
show_error_codes = true
ignore_missing_imports = true
no_warn_no_return = true
Loading