Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration for web-poet's support on additional requests and Meta #62

Merged
merged 77 commits into from
Jun 16, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
77 commits
Select commit Hold shift + click to select a range
d43e8e6
add basic integration for web-poet's support on additional requests
BurnzZ Feb 8, 2022
9bc60d0
create provider for web-poet's new HttpClient and GenericRequest
BurnzZ Feb 8, 2022
7505176
enable tox dep in draft branch of web-poet for CI
BurnzZ Feb 8, 2022
c30546f
use the new status and headers from ResponseData
BurnzZ Feb 9, 2022
c7918eb
accept either web_poet.GenericRequest and scrapy.Request
BurnzZ Feb 9, 2022
7f539bb
create provider for web_poet.page_inputs.Meta
BurnzZ Feb 9, 2022
ed2c489
use 'po_args' inside a Request meta instead of using the entire meta
BurnzZ Feb 17, 2022
0bd3b80
use web-poet's new Request container
BurnzZ Feb 17, 2022
5488504
sync dep to WIP branch to run tox tests
BurnzZ Feb 18, 2022
b5e9c56
add tests
BurnzZ Feb 21, 2022
4dd19b8
remove ContextVar approach and use Dependency Injection in Provider i…
BurnzZ Mar 11, 2022
2a155f5
update CHANGELOG to new support on additional requests
BurnzZ Mar 11, 2022
e8f4c10
add docs for supporting web-poet's HttpClient and Meta
BurnzZ Mar 15, 2022
8340ced
Update to use HttpReponse which replaces ResponseData
BurnzZ Mar 29, 2022
ae4d8a5
remove unused imports
BurnzZ Mar 29, 2022
ba0d8fe
add basic integration for web-poet's support on additional requests
BurnzZ Feb 8, 2022
81df664
create provider for web-poet's new HttpClient and GenericRequest
BurnzZ Feb 8, 2022
1316090
add tests
BurnzZ Feb 21, 2022
f8a7efe
remove ContextVar approach and use Dependency Injection in Provider i…
BurnzZ Mar 11, 2022
cc97213
update CHANGELOG to new support on additional requests
BurnzZ Mar 11, 2022
a25b61e
update callback_for() to have async support
BurnzZ Mar 16, 2022
eb3e837
add docs mentioning async support in callback_for()
BurnzZ Mar 16, 2022
7b2d4cf
force callback_for() to have 'is_async' to be keyword-only param
BurnzZ Mar 16, 2022
5c4326f
update async test spider to use async PO as well
BurnzZ Mar 16, 2022
b79dfa8
remove 'is_async' param in callback_for
BurnzZ Mar 18, 2022
a74d264
remove duplicated test
BurnzZ Mar 29, 2022
af0b802
remove unrelated file
BurnzZ Apr 5, 2022
d23a169
Merge branch 'master' of ssh://github.com/scrapinghub/scrapy-poet int…
BurnzZ May 16, 2022
42bf6da
update imports after web_poet refactoring
BurnzZ May 16, 2022
9ce67ae
Merge branch 'master' of http://github.com/scrapinghub/scrapy-poet in…
BurnzZ May 19, 2022
3983ed6
fix duplicated entry in CHANGELOG
BurnzZ May 27, 2022
d76db34
Remove implementation details about callback_for() in the docs
BurnzZ May 27, 2022
f1126fb
remove else block in callback_for()
BurnzZ May 27, 2022
c2bfe89
Update docs/intro/basic-tutorial.rst
BurnzZ May 27, 2022
10b56e3
Merge pull request #66 from scrapinghub/async_callback_for
kmike May 27, 2022
75d5a13
Fix tests
Gallaecio May 30, 2022
1df25a8
Handle additional request IgnoreError as per web-poet #38
Gallaecio May 30, 2022
2d6da5a
Fix the documentation build
Gallaecio May 30, 2022
063ef20
Support a non-asyncio Twisted reactor
Gallaecio Jun 2, 2022
2a9e8d0
Fix tests
Gallaecio Jun 2, 2022
f47816b
backend: handle unexpected exceptions as HttpRequestError
Gallaecio Jun 2, 2022
c1e8b93
Additional requests: prevent HEAD redirects
Gallaecio Jun 2, 2022
e7989ed
Additional requests: do not filter out duplicate requests
Gallaecio Jun 2, 2022
6252526
Make the latest tests compatible with Pytho 3.7
Gallaecio Jun 2, 2022
0358146
po_args → po_meta
Gallaecio Jun 2, 2022
0146c83
Document the peculiarities of additional request handling
Gallaecio Jun 2, 2022
b4ce395
Test both asyncio and non-asyncio reactors
Gallaecio Jun 2, 2022
bedbb68
GitHub Actions: test both reactors
Gallaecio Jun 2, 2022
98a47ba
Use raise-from syntax for additional request exceptions
Gallaecio Jun 2, 2022
4d49428
Fix syntax error
Gallaecio Jun 2, 2022
1d7ef88
Move request conversion into a function
Gallaecio Jun 6, 2022
9f9dfa1
On request conversion, silently ignore unknown attributes
Gallaecio Jun 6, 2022
6f5218f
Contextualize additional request exception handling
Gallaecio Jun 6, 2022
850ad4e
Pass user-defined encoding on response conversion
Gallaecio Jun 6, 2022
5f483c4
Support non-string values as meta keys
Gallaecio Jun 6, 2022
5752664
Merge remote-tracking branch 'origin/master' into po-additional-requests
Gallaecio Jun 6, 2022
19a3283
Remove request convertion TypeError handling
Gallaecio Jun 6, 2022
4dcca57
Provide integration tests for good and bad additional responses
Gallaecio Jun 6, 2022
aa09109
Meta → PageParams
Gallaecio Jun 7, 2022
dcb6716
Implement test_additional_requests_connection_issue
Gallaecio Jun 7, 2022
381eb25
Implement test_additional_requests_ignored_request
Gallaecio Jun 7, 2022
3429d49
Implement test_additional_requests_unhandled_downloader_middleware_ex…
Gallaecio Jun 7, 2022
902d41c
Fix pre-3.9 syntax error
Gallaecio Jun 7, 2022
776b768
Implement test_additional_requests_dont_filter
Gallaecio Jun 7, 2022
a1c52f8
Remove unneeded test
Gallaecio Jun 7, 2022
64eeb72
Update docs/intro/advanced-tutorial.rst
Gallaecio Jun 8, 2022
5210b2a
Update docs/intro/advanced-tutorial.rst
Gallaecio Jun 8, 2022
8aecfab
backend → download_func
Gallaecio Jun 8, 2022
dc36f5a
test_additional_requests_dont_filter: ensure additional requests are …
Gallaecio Jun 8, 2022
b839eb1
Cast url from HttpRequest and HttpResponse before comparisons
Gallaecio Jun 9, 2022
995e9b9
fix examples when using callback_for()
BurnzZ Jun 13, 2022
6c07ce4
Merge pull request #75 from scrapinghub/callback-for-docs
kmike Jun 13, 2022
be786fc
backend → downloader
Gallaecio Jun 15, 2022
ded8354
Merge remote-tracking branch 'origin/po-additional-requests' into po-…
Gallaecio Jun 15, 2022
3398dc7
Update instal requirements
Gallaecio Jun 15, 2022
566e727
GitHub Actions: test minimum dependency versions
Gallaecio Jun 15, 2022
98ce454
Revert mypy changes
Gallaecio Jun 15, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 17 additions & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,21 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: ['3.7', '3.8', '3.9', '3.10']
include:
- python-version: "3.7"
toxenv: "min"
- python-version: "3.7"
toxenv: "asyncio-min"

- python-version: "3.8"
toxenv: "py"
- python-version: "3.9"
toxenv: "py"

- python-version: "3.10"
toxenv: "py"
- python-version: "3.10"
toxenv: "asyncio"

steps:
- uses: actions/checkout@v2
Expand All @@ -29,6 +43,8 @@ jobs:
python -m pip install --upgrade pip
python -m pip install tox
- name: tox
env:
TOXENV: ${{ matrix.toxenv }}
run: |
tox -e py
- name: coverage
Expand Down
8 changes: 8 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,12 @@ TBR
---

* Use the new ``web_poet.HttpResponse`` which replaces ``web_poet.ResponseData``.
* Support for the new features in ``web_poet>=0.2.0`` for supporting additional
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
requests inside Page Objects:

* Created new providers for ``web_poet.PageParams`` and
``web_poet.HttpClient``.
* The minimum Scrapy version is now ``2.6.0``.
* We have these **backward incompatible** changes since the
``web_poet.OverrideRule`` follow a different structure:

Expand All @@ -15,6 +21,8 @@ TBR
* This resuls in a newer format in the ``SCRAPY_POET_OVERRIDES`` setting.
* Removal of this deprecated module: ``scrapy.utils.reqser``

* add ``async`` support for ``callback_for``.


0.3.0 (2022-01-28)
------------------
Expand Down
24 changes: 24 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,27 @@ License is BSD 3-clause.
* Issue tracker: https://github.com/scrapinghub/scrapy-poet/issues

.. _`web-poet`: https://github.com/scrapinghub/web-poet


Quick Start
***********

Installation
============

.. code-block::

pip install scrapy-poet

Requires **Python 3.7+** and **Scrapy >= 2.6.0**.

Usage in a Scrapy Project
=========================

Add the following inside Scrapy's ``settings.py`` file:

.. code-block:: python

DOWNLOADER_MIDDLEWARES = {
"scrapy_poet.InjectionMiddleware": 543,
}
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
language = "en"

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
Expand Down
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ To get started, see :ref:`intro-install` and :ref:`intro-tutorial`.
:maxdepth: 1

intro/install
intro/tutorial
intro/basic-tutorial
intro/advanced-tutorial

.. toctree::
:caption: Advanced
Expand Down
168 changes: 168 additions & 0 deletions docs/intro/advanced-tutorial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
.. _`intro-advanced-tutorial`:

=================
Advanced Tutorial
=================

This section intends to go over the supported features in **web-poet** by
**scrapy-poet**:

* ``web_poet.HttpClient``
* ``web_poet.PageParams``

These are mainly achieved by **scrapy-poet** implementing **providers** for them:

* :class:`scrapy_poet.page_input_providers.HttpClientProvider`
* :class:`scrapy_poet.page_input_providers.PageParamsProvider`

.. _`intro-additional-requests`:

Additional Requests
===================

Using Page Objects using additional requests doesn't need anything special from
the spider. It would work as-is because of the readily available
:class:`scrapy_poet.page_input_providers.HttpClientProvider` that is enabled
out of the box.

This supplies the Page Object with the necessary ``web_poet.HttpClient`` instance.

The HTTP client implementation that **scrapy-poet** provides to
``web_poet.HttpClient`` handles requests as follows:

- Requests go through downloader middlewares, but they do not go through
spider middlewares or through the scheduler.

- Duplicate requests are not filtered out.

- In line with the web-poet specification for additional requests,
``Request.meta['dont_redirect']`` is set to ``True`` for requests with the
``HEAD`` HTTP method.

Suppose we have the following Page Object:

.. code-block:: python

import attr
import web_poet


@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient

async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}

# Simulates clicking on a button that says "View All Images"
response: web_poet.HttpResponse = await self.http_client.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
item["images"] = response.css(".product-images img::attr(src)").getall()
return item


It can be directly used inside the spider as:

.. code-block:: python

import scrapy


class ProductSpider(scrapy.Spider):

custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_poet.InjectionMiddleware": 543,
}
}

start_urls = [
"https://example.com/category/product/item?id=123",
"https://example.com/category/product/item?id=989",
]

async def parse(self, response, page: ProductPage):
return await page.to_item()

Note that we needed to update the ``parse()`` method to be an ``async`` method,
since the ``to_item()`` method of the Page Object we're using is an ``async``
method as well.


Page params
===========

Using ``web_poet.PageParams`` allows the Scrapy spider to pass any arbitrary
information into the Page Object.

Suppose we update the earlier Page Object to control the additional request.
This basically acts as a switch to update the behavior of the Page Object:

.. code-block:: python

import attr
import web_poet


@attr.define
class ProductPage(web_poet.ItemWebPage):
http_client: web_poet.HttpClient
page_params: web_poet.PageParams

async def to_item(self):
item = {
"url": self.url,
"name": self.css("#main h3.name ::text").get(),
"product_id": self.css("#product ::attr(product-id)").get(),
}

# Simulates clicking on a button that says "View All Images"
if self.page_params.get("enable_extracting_all_images")
response: web_poet.HttpResponse = await self.http_client.get(
f"https://api.example.com/v2/images?id={item['product_id']}"
)
item["images"] = response.css(".product-images img::attr(src)").getall()

return item

Passing the ``enable_extracting_all_images`` page parameter from the spider
into the Page Object can be achieved by using **Scrapy's** ``Request.meta``
attribute. Specifically, any ``dict`` value inside the ``page_params``
parameter inside **Scrapy's** ``Request.meta`` will be passed into
``web_poet.PageParams``.

Let's see it in action:

.. code-block:: python

import scrapy


class ProductSpider(scrapy.Spider):

custom_settings = {
"DOWNLOADER_MIDDLEWARES": {
"scrapy_poet.InjectionMiddleware": 543,
}
}

start_urls = [
"https://example.com/category/product/item?id=123",
"https://example.com/category/product/item?id=989",
]

def start_requests(self):
for url in start_urls:
yield scrapy.Request(
url=url,
callback=self.parse,
meta={"page_params": {"enable_extracting_all_images": True}}
)

async def parse(self, response, page: ProductPage):
return await page.to_item()
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
44 changes: 40 additions & 4 deletions docs/intro/tutorial.rst → docs/intro/basic-tutorial.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _`intro-tutorial`:
.. _`intro-basic-tutorial`:

========
Tutorial
========
==============
Basic Tutorial
==============

In this tutorial, we’ll assume that ``scrapy-poet`` is already installed on your
system. If that’s not the case, see :ref:`intro-install`.
Expand Down Expand Up @@ -198,6 +198,42 @@ returning the result of the ``to_item`` method call. We could use
``response.follow_all(links, callback_for(BookPage))``, without creating
an attribute, but currently it won't work with Scrapy disk queues.

.. tip::

:func:`~.callback_for` also supports `async generators`. So if we have the
following:

.. code-block:: python

class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
links = response.css('.image_container a')
yield from response.follow_all(links, self.parse_book)

async def parse_book(self, response: DummyResponse, page: BookPage):
yield await page.to_item()

It could be turned into:

.. code-block:: python

class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
links = response.css('.image_container a')
yield from response.follow_all(links, self.parse_book)

parse_book = callback_for(BookPage)

This is useful when the Page Objects uses additional requests, which rely
heavily on ``async/await`` syntax. More info on this in this tutorial
section: :ref:`intro-additional-requests`.

Final result
============

Expand Down
2 changes: 1 addition & 1 deletion docs/intro/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ If you’re already familiar with installation of Python packages, you can insta

pip install scrapy-poet

Scrapy 2.1.0 or above is required and it has to be installed separately.
Scrapy 2.6.0 or above is required and it has to be installed separately.

Things that are good to know
============================
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
Scrapy >= 2.1.0
Scrapy >= 2.6.0
Sphinx >= 3.0.3
sphinx-rtd-theme >= 0.4
40 changes: 40 additions & 0 deletions scrapy_poet/api.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from typing import Callable, Optional, Type
from inspect import iscoroutinefunction

from scrapy.http import Request, Response

Expand Down Expand Up @@ -55,6 +56,38 @@ def parse_book(self, response: DummyResponse, page: BookPage):

It allows to write this:

.. code-block:: python

class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
links = response.css('.image_container a')
yield from response.follow_all(links, self.parse_book)

parse_book = callback_for(BookPage)

It also supports producing an async generator callable if the Page Objects's
``to_item()`` method is a coroutine which uses the ``async/await`` syntax.

So if we have the following:

.. code-block:: python

class BooksSpider(scrapy.Spider):
name = 'books'
start_urls = ['http://books.toscrape.com/']

def parse(self, response):
links = response.css('.image_container a')
yield from response.follow_all(links, self.parse_book)

async def parse_book(self, response: DummyResponse, page: BookPage):
yield await page.to_item()

It could be turned into:

.. code-block:: python

class BooksSpider(scrapy.Spider):
Expand Down Expand Up @@ -90,5 +123,12 @@ def parse(self, response):
def parse(*args, page: page_cls, **kwargs): # type: ignore
yield page.to_item() # type: ignore

async def async_parse(*args, page: page_cls, **kwargs): # type: ignore
yield await page.to_item() # type: ignore

if iscoroutinefunction(page_cls.to_item):
setattr(async_parse, _CALLBACK_FOR_MARKER, True)
return async_parse

setattr(parse, _CALLBACK_FOR_MARKER, True)
return parse
Loading