This repository has been archived by the owner on Dec 28, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 10
Retry #18
Open
elacuesta
wants to merge
16
commits into
master
Choose a base branch
from
retry
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Retry #18
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
3e12067
Retry: settings and docs
elacuesta c3543a9
Import get_retry_request, update settings format in readme
elacuesta 5adda54
Merge branch 'master' into retry
elacuesta dba0789
Deprecate CRAWLERA_FETCH_RAISE_ON_ERROR, add CRAWLERA_FETCH_ON_ERROR
elacuesta 6e56acf
Paint it black, fix type issue (py35)
elacuesta 8e2b2d8
CRAWLERA_FETCH_SHOULD_RETRY
elacuesta fd6b54d
Fallback implementation of get_retry_request
elacuesta e5c5feb
Exclude _utils from coverage report
elacuesta 1e36a03
Retry tests
elacuesta c25600a
Add missing type hint
elacuesta 022f1fc
Add another missing type hint
elacuesta dcd4efa
py35 compliance
elacuesta 872dcda
Full diff coverage
elacuesta 9b9560e
Comment about excluding upstream code from test coverage reports
elacuesta c77983e
Validate CRAWLERA_FETCH_ON_ERROR setting
elacuesta 8e87867
Merge branch 'master' into retry
elacuesta File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
[run] | ||
omit = crawlera_fetch/_utils.py # already tested in upstream Scrapy |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,2 @@ | ||
from .logformatter import CrawleraFetchLogFormatter # noqa: F401 | ||
from .middleware import CrawleraFetchMiddleware, DownloadSlotPolicy # noqa: F401 | ||
from .middleware import CrawleraFetchMiddleware, DownloadSlotPolicy, OnError # noqa: F401 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
from logging import Logger | ||
from typing import Optional, Union | ||
|
||
from scrapy import Request, Spider | ||
from scrapy.utils.python import global_object_name | ||
|
||
|
||
# disable black formatting to avoid syntax error on py35 | ||
# fmt: off | ||
def _get_retry_request( | ||
request: Request, | ||
*, | ||
spider: Spider, | ||
reason: Union[str, Exception] = "unspecified", | ||
max_retry_times: Optional[int] = None, | ||
priority_adjust: Optional[int] = None, | ||
logger: Logger, | ||
stats_base_key: str # black wants to put a comma at the end, but py35 doesn't like it | ||
) -> Optional[Request]: | ||
# fmt: on | ||
""" | ||
Fallback implementation, taken verbatim from https://github.com/scrapy/scrapy/pull/4902 | ||
""" | ||
settings = spider.crawler.settings | ||
stats = spider.crawler.stats | ||
retry_times = request.meta.get("retry_times", 0) + 1 | ||
if max_retry_times is None: | ||
max_retry_times = request.meta.get("max_retry_times") | ||
if max_retry_times is None: | ||
max_retry_times = settings.getint("RETRY_TIMES") | ||
if retry_times <= max_retry_times: | ||
logger.debug( | ||
"Retrying %(request)s (failed %(retry_times)d times): %(reason)s", | ||
{"request": request, "retry_times": retry_times, "reason": reason}, | ||
extra={"spider": spider}, | ||
) | ||
new_request = request.copy() | ||
new_request.meta["retry_times"] = retry_times | ||
new_request.dont_filter = True | ||
if priority_adjust is None: | ||
priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST") | ||
new_request.priority = request.priority + priority_adjust | ||
|
||
if callable(reason): | ||
reason = reason() | ||
if isinstance(reason, Exception): | ||
reason = global_object_name(reason.__class__) | ||
|
||
stats.inc_value("{}/count".format(stats_base_key)) | ||
stats.inc_value("{}/reason_count/{}".format(stats_base_key, reason)) | ||
return new_request | ||
else: | ||
stats.inc_value("{}/max_reached".format(stats_base_key)) | ||
logger.error( | ||
"Gave up retrying %(request)s (failed %(retry_times)d times): " "%(reason)s", | ||
{"request": request, "retry_times": retry_times, "reason": reason}, | ||
extra={"spider": spider}, | ||
) | ||
return None |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.
Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @pawelmhm:
CRAWLERA_FETCH_SHOULD_RETRY
receives a callable (or the name of a callable within the spider) to be used to determine if a request should be retried. Perhaps it could be named differently, I'm open to suggestions.CRAWLERA_FETCH_ON_ERROR
is the setting to determine what to do with errors. I madeOnError.Warn
the default, just to keep backward-compatibility, but perhapsOnError.Retry
can be a better default.AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of scrapy/scrapy#4902. I wanted to avoid code duplication but I guess I can just use the upstream function if available and fall back to copying the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explanation. Is there any scenario where you don't need retry? Because in my experience it is very rare not to want retry of internal server errors, timeouts, or bans?
I think in some projects people are using old Scrapy versions and they will have to update Scrapy to most recent versions, which will be extra work, extra effort for them. If they are stuck on some old versions like 1.6 or 1.7 updating to 2.5 might not be straigtforward.
But my main point after thinking about this is that why do we actually need custom retry? Why can't we handle this in retry middleware by default? like all other HTTP error codes? There are 2 use cases mentioned by Taras in issue on GH, but I'm not convinced about them and after talking with developer of Fetch API I hear they plan to change behavior to return 500 anbd 503 HTTP status codes instead of 200 HTTP status code with error code in response body.