Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed regression in ItemLoader since 1.0 #35

Open
juraseg opened this issue Feb 15, 2018 · 7 comments
Open

Speed regression in ItemLoader since 1.0 #35

juraseg opened this issue Feb 15, 2018 · 7 comments
Labels
enhancement New feature or request

Comments

@juraseg
Copy link

juraseg commented Feb 15, 2018

Hi

TLDR: ItemLoader does not reuse response.selector when you are passing response to it as argument. And looks like it was reusing it up to Scrapy==1.0

Recently we were trying to upgrade our setup to use newer version of Scrapy (we are on 1.0.5).
We noticed huge slowdown in a lot of our spiders.

I dag down a bit and found that the bottleneck was ItemLoader, in particular when you create many ItemLoaders (one for each Item) passing response to every one of them and the response size is relatively big.

The speed drop down goes back to version 1.1. So on Scrapy==1.1 and above the performance degradation is present.

Test spider code (testfile is just a random file of 1MB size):

# -*- coding: utf-8 -*-
import os.path

import scrapy
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field

HERE = os.path.abspath(os.path.dirname(__file__))


class Product(Item):
    url = Field()
    name = Field()


class TestSpeedSpider(scrapy.Spider):
    name = 'test_speed'

    def start_requests(self):
        file_path = HERE + '/testfile'

        file_url = 'file://' + file_path
        yield scrapy.Request(file_url)

    def parse(self, response):
        for i in xrange(0, 300):
            loader = ItemLoader(item=Product(), response=response)
            loader.add_value('name', 'item {}'.format(i))
            loader.add_value('url', 'http://site.com/item{}'.format(i))

            product = loader.load_item()

            yield product
@ghost
Copy link

ghost commented Feb 16, 2018

Hi @juraseg - thanks for raising this! I'm looking back over the history to get familiar with how these objects were instantiated before, and now. Lots of ground to cover. But, it looks like you could deliberately provide a selector instance to the constructor for ItemLoader, and sidestep the issue entirely. Is this something you've explored?

@ghost
Copy link

ghost commented Feb 16, 2018

As an experiment, I added a line to the ItemLoader constructor that would check response._cached_selector is not None and would use it if present. On Py2.7 and Py3.5 locally, it seems to pass tests well enough. So, this ought to be fixable, but we should probably discuss where is the most appropriate place to re-use selectors in this way. If implemented incautiously, then it might clobber custom selector subclasses silently and unexpectedly.

@kmike
Copy link
Member

kmike commented Feb 16, 2018

I believe regression happened when lxml document cache was removed, as a part of scrapy/scrapy#1409. The idea was to use response.selector explicitly; we fixed it for link extractors (https://github.com/scrapy/scrapy/pull/1409/files#diff-952f1c77a1cfacb08eed58f08daa8870R67), but not for item loaders.

Item loaders are a bit different because they alow to customize selector class, this is where we should be careful implementing a fix.

Accessing response._cached_selector is not None doesn't look clean as a final solution, as we'll be accessing a private attribute. Even if there is no cached selector yet, it makes sense to use response.selector (which caches transparently) - this way if some further code needs a selector, body won't be parsed one more time.

@juraseg
Copy link
Author

juraseg commented Feb 16, 2018

Hi guys

@cathalgarvey Yes, I tried providing selector instead of response and it does the job. So really a matter of reusing selector from response.

If that helps, I did what @kmike suggests for our case to solve the issue - created custom loader, which checks if there is response.selector. To workaround the issue with custom selector this loader checks if response.selector.__class__ is equal to loader's default_selector_class.

The class check looks a bit clunky, but I have not found other way to check (isinstance won't work as it will return True for subclasses as well, which is not what we need here).

@kmike
Copy link
Member

kmike commented Feb 16, 2018

@juraseg I think your approach is fine, but there is an extra tricky part: when you access response.selector.__class__, response body could be parsed in order to fill the "selector" attribute for the first time. If a class is not the same as ItemLoader's default_selector_class, response is parsed again. The second one is inevitable, but parsing response with a wrong selector just to get its type looks unnecessary.

@juraseg
Copy link
Author

juraseg commented Feb 17, 2018

@kmike
Ah, I see what you mean.

So the only way to avoid second parsing is to know which class the would be response.selector is and compare to it instead of accessing response.selector. For example response object could have selector_class read-only property.

@malloxpb
Copy link
Member

malloxpb commented Mar 8, 2018

Hey guys, I tried implementing the modification to ItemLoader constructor. So when there is response.selector available, we will use that instead of creating a new Selector. The result for my sample code has shown that the performance is equal to that of Scrapy 1.0. However, it seems like the discussion is still continuing so I am just wondering if you guys can tell me more about the potential outcome if I do something shown on this diff scrapy/scrapy#3157

@Gallaecio Gallaecio transferred this issue from scrapy/scrapy Oct 30, 2020
@Gallaecio Gallaecio added the enhancement New feature or request label Feb 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants