-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed regression in ItemLoader since 1.0 #35
Comments
Hi @juraseg - thanks for raising this! I'm looking back over the history to get familiar with how these objects were instantiated before, and now. Lots of ground to cover. But, it looks like you could deliberately provide a selector instance to the constructor for ItemLoader, and sidestep the issue entirely. Is this something you've explored? |
As an experiment, I added a line to the ItemLoader constructor that would check |
I believe regression happened when lxml document cache was removed, as a part of scrapy/scrapy#1409. The idea was to use response.selector explicitly; we fixed it for link extractors (https://github.com/scrapy/scrapy/pull/1409/files#diff-952f1c77a1cfacb08eed58f08daa8870R67), but not for item loaders. Item loaders are a bit different because they alow to customize selector class, this is where we should be careful implementing a fix. Accessing |
Hi guys @cathalgarvey Yes, I tried providing selector instead of response and it does the job. So really a matter of reusing selector from response. If that helps, I did what @kmike suggests for our case to solve the issue - created custom loader, which checks if there is The class check looks a bit clunky, but I have not found other way to check ( |
@juraseg I think your approach is fine, but there is an extra tricky part: when you access |
@kmike So the only way to avoid second parsing is to know which class the would be |
Hey guys, I tried implementing the modification to ItemLoader constructor. So when there is response.selector available, we will use that instead of creating a new Selector. The result for my sample code has shown that the performance is equal to that of Scrapy 1.0. However, it seems like the discussion is still continuing so I am just wondering if you guys can tell me more about the potential outcome if I do something shown on this diff scrapy/scrapy#3157 |
Hi
TLDR: ItemLoader does not reuse response.selector when you are passing response to it as argument. And looks like it was reusing it up to Scrapy==1.0
Recently we were trying to upgrade our setup to use newer version of Scrapy (we are on 1.0.5).
We noticed huge slowdown in a lot of our spiders.
I dag down a bit and found that the bottleneck was ItemLoader, in particular when you create many ItemLoaders (one for each Item) passing response to every one of them and the response size is relatively big.
The speed drop down goes back to version 1.1. So on Scrapy==1.1 and above the performance degradation is present.
Test spider code (
testfile
is just a random file of 1MB size):The text was updated successfully, but these errors were encountered: