Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle item overrides #164

Merged
merged 11 commits into from
Sep 25, 2023
Merged

Handle item overrides #164

merged 11 commits into from
Sep 25, 2023

Conversation

BurnzZ
Copy link
Contributor

@BurnzZ BurnzZ commented Sep 18, 2023

Fixes the case when the following code below results in a ProviderDependencyDeadlockError. In this example, the issue stems from the product: Product dependency that's declared. However, scrapy-poet sees ProductPage as the one that provides it which causes the deadlock error.

import attrs
from zyte_common_items import Product
from web_poet import handle_urls, WebPage

@handle_urls("example.com")
@attrs.define 
class ProductPage(WebPage[Product]):
    product: Product

    def name(self) -> str:
        return f"(modified) {self.product.name}"

The use case for this is when items like Product are produced by a provider and not a page object. For example, scrapy-zyte-api's ZyteApiProvider (reference). The ProductPage in this scenario tries to override the item produced by ZyteApiProvider.


TODO:

  • add more test cases to handle different scenarios
  • verify if this is the best implementation to go

tests/test_web_poet_rules.py Outdated Show resolved Hide resolved
@@ -34,7 +34,7 @@
PageParamsProvider: 700,
RequestUrlProvider: 800,
ResponseUrlProvider: 900,
ItemProvider: 1000,
ItemProvider: 2000,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing this value allows ZyteApiProvider (as well as any other item provider in the future) to run before scrapy-poet's ItemProvider. This ensures any item dependency in page objects are available.

@BurnzZ BurnzZ changed the title WIP: handle item overrides Handle item overrides Sep 19, 2023
):
"""Build dependencies handled by registered providers"""
instances: Dict[Callable, Any] = {}
instances: Dict[Callable, Any] = prev_instances or {}
Copy link
Contributor Author

@BurnzZ BurnzZ Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the prev_instances allows the PO to gain access to the item produced by a provider (as long as that provider was called earlier).

@attrs.define 
class ProductPage(WebPage[Product]):
    product: Product

In the above example, before the ProductPage instance can be made, the product: Product instance should first be created. This is where this passing of new prev_instances parameter is useful.

@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

Merging #164 (3536421) into master (5339d63) will increase coverage by 0.14%.
The diff coverage is 100.00%.

❗ Current head 3536421 differs from pull request most recent head 92d1f1e. Consider uploading reports for the commit 92d1f1e to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #164      +/-   ##
==========================================
+ Coverage   85.64%   85.78%   +0.14%     
==========================================
  Files          14       14              
  Lines         801      809       +8     
==========================================
+ Hits          686      694       +8     
  Misses        115      115              
Files Changed Coverage Δ
scrapy_poet/downloadermiddlewares.py 100.00% <ø> (ø)
scrapy_poet/commands.py 35.64% <100.00%> (ø)
scrapy_poet/injection.py 98.52% <100.00%> (+0.05%) ⬆️
scrapy_poet/page_input_providers.py 100.00% <100.00%> (ø)

@@ -414,6 +441,7 @@ class MySpider(Spider):
spider = MySpider()
spider.settings = settings
crawler.spider = spider
crawler.stats = load_object(crawler.settings["STATS_CLASS"])(crawler)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scrapy 2.11 changes broke this since stats was missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, I've already fixed it in master before looking here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's good to see we did it in (kinda) the same way!

@BurnzZ BurnzZ marked this pull request as ready for review September 19, 2023 12:14
}
# item from 'to_return'
item, deps = yield crawl_item_and_deps(Kangaroo, override_settings=settings)
assert item == Kangaroo(name="(modified by Joey) data from KangarooProvider")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that only JoeyPage had touched the data and KangarooPage didn't because the earlier PO had higher priority.

@Gallaecio
Copy link
Member

Gallaecio commented Sep 21, 2023

Conflicts resolved, but john-kurkowski/tldextract#305 breaks tests through url-matcher’s get_domain, will fix that now.

To-do:

  • Freeze url-matcher in min as well. (it is already the case 🤦 )

@Gallaecio Gallaecio self-requested a review September 21, 2023 14:32
@Gallaecio
Copy link
Member

Gallaecio commented Sep 21, 2023

I need to investigate a potential issue with DummyResponse not working on code that relies on this branch.

Never mind, it was caused by an outdated test expectation.

@kmike kmike merged commit bd9f0e9 into master Sep 25, 2023
13 checks passed
@kmike
Copy link
Member

kmike commented Sep 25, 2023

Thanks @BurnzZ and @Gallaecio!

@wRAR wRAR deleted the item-overrides branch September 26, 2023 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants