ZyteApiProvider could make an unneeded API request #91

kmike · 2023-06-12T10:46:38Z

In the example below ZyteApiProvide makes 2 API requests instead of 1:

@handle_urls("example.com")
@attrs.define
class MyPage(ItemPage[MyItem]):
    html: BrowserHtml
    # ...

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response: DummyResponse, product: Product, my_item: MyItem):
        # ...

Gallaecio · 2023-06-15T13:26:50Z

Findings so far:

Remove ItemProvider’s Response dependency scrapinghub/scrapy-poet#151 won’t fix this.
This issue seems to be caused by Zyte API provided classes being resolved at different stages. If you request both product and browser_response directly in the callback, a single request is sent. Otherwise, first Product is injected, then MyItem resolves to MyPage, then BrowserHtml is injected. I am not sure yet how to best solve that.

wRAR · 2023-06-15T14:59:52Z

Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. scrapinghub/scrapy-poet#151 is actually about a third request done in this or similar use case.

wRAR · 2023-06-15T15:17:32Z

We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further.

Gallaecio · 2023-06-15T15:42:13Z

Indeed.

Gallaecio · 2023-06-20T09:36:38Z

New finding: Switching MyItem to MyPage works, even if there is still some level of indirection. Could explain why scrapinghub/scrapy-poet#153 works.

BurnzZ · 2023-10-03T06:43:42Z

I looked into this further and it still occurs without any Page Objects involved.

The sent Zyte API requests were determined by setting ZYTE_API_LOG_REQUESTS=True.

Given the following spider:

class BooksSpider(scrapy.Spider):
    name = "books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_nav,
            meta={"zyte_api": {"browserHtml": True}},
        )

Case 1

✅ The following callback set up is correct since it has only 1 request:

# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Case 2

❌ However, the following has 2 separate requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
    ...

This case should not happen since browserHtml and productNavigation can both be present in the same Zyte API Request.

Case 3

However, if we introduce a Page Object to the same spider:

@handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
    response: BrowserResponse
    nav_item: ProductNavigation

    @field
    def url(self):
        return self.nav_item.url

    @field
    def categoryName(self) -> str:
        return f"(modified) {self.nav_item.categoryName}"

❌ Then, the following callback set up would have 3 separate Zyte API Requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Note that the same series of 3 separate requests still occurs on:

def parse_nav(self, response, navigation: ProductNavigation):
    ...

Gallaecio · 2023-10-03T07:23:22Z

I wonder if some of the unexpected requests are related to #135.

BurnzZ · 2024-01-09T12:19:51Z

Re-opening this since Case 2 is still occurring. Case 3 has been fixed though.

wRAR · 2024-01-11T17:01:54Z

@BurnzZ so do you think after your latest analysis that case 2 still happens or not?

BurnzZ · 2024-01-12T05:48:44Z

@wRAR I can still reproduce Case 2. 👍

wRAR · 2024-01-12T12:20:15Z

OK, so the difference between this use case and ones that we already test is having "browserHtml": True in meta. Currently the provider doesn't check this at all. It looks like it should? cc: @kmike

wRAR · 2024-01-12T13:01:57Z

OTOH I'm not sure if even we handle this in the provider the request itself won't be sent?

kmike · 2024-01-12T15:57:17Z

@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be.

Gallaecio self-assigned this Jun 15, 2023

Gallaecio mentioned this issue Jun 15, 2023

Cache provider output per request #94

Merged

Gallaecio mentioned this issue Jun 20, 2023

[WIP] Minimize provider calls scrapinghub/scrapy-poet#153

Closed

Gallaecio added a commit to scrapinghub/scrapy-poet that referenced this issue Jun 20, 2023

Reproduce scrapy-plugins/scrapy-zyte-api#91

96bb926

Gallaecio mentioned this issue Jul 13, 2023

Avoid duplicate requests #105

Merged

wRAR mentioned this issue Nov 8, 2023

Remove ItemProvider’s Response dependency scrapinghub/scrapy-poet#151

Closed

wRAR mentioned this issue Dec 26, 2023

Bump scrapy-poet #156

Merged

wRAR closed this as completed in #156 Jan 5, 2024

BurnzZ reopened this Jan 9, 2024

BurnzZ mentioned this issue Jan 10, 2024

Discussion: Handle data type of response data type in HeuristicsProductNavigationPage (BrowserResponse vs HttpResponse) zytedata/zyte-spider-templates#25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZyteApiProvider could make an unneeded API request #91

ZyteApiProvider could make an unneeded API request #91

kmike commented Jun 12, 2023 •

edited

Loading

Gallaecio commented Jun 15, 2023

wRAR commented Jun 15, 2023

wRAR commented Jun 15, 2023 •

edited

Loading

Gallaecio commented Jun 15, 2023

Gallaecio commented Jun 20, 2023

BurnzZ commented Oct 3, 2023 •

edited

Loading

Gallaecio commented Oct 3, 2023

BurnzZ commented Jan 9, 2024

wRAR commented Jan 11, 2024

BurnzZ commented Jan 12, 2024

wRAR commented Jan 12, 2024

wRAR commented Jan 12, 2024

kmike commented Jan 12, 2024

ZyteApiProvider could make an unneeded API request #91

ZyteApiProvider could make an unneeded API request #91

Comments

kmike commented Jun 12, 2023 • edited Loading

Gallaecio commented Jun 15, 2023

wRAR commented Jun 15, 2023

wRAR commented Jun 15, 2023 • edited Loading

Gallaecio commented Jun 15, 2023

Gallaecio commented Jun 20, 2023

BurnzZ commented Oct 3, 2023 • edited Loading

Case 1

Case 2

Case 3

Gallaecio commented Oct 3, 2023

BurnzZ commented Jan 9, 2024

wRAR commented Jan 11, 2024

BurnzZ commented Jan 12, 2024

wRAR commented Jan 12, 2024

wRAR commented Jan 12, 2024

kmike commented Jan 12, 2024

kmike commented Jun 12, 2023 •

edited

Loading

wRAR commented Jun 15, 2023 •

edited

Loading

BurnzZ commented Oct 3, 2023 •

edited

Loading