Skip to content
This repository has been archived by the owner on Dec 28, 2023. It is now read-only.

Retry #18

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[run]
omit = crawlera_fetch/_utils.py # already tested in upstream Scrapy
23 changes: 23 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,18 @@ Crawlera middleware won't be able to handle them.

The endpoint of a specific Crawlera instance

* `CRAWLERA_FETCH_ON_ERROR` (type `enum.Enum` - `crawlera_fetch.OnError`,
default `OnError.Raise`)

What to do if an error occurs while downloading or decoding a response. Possible values are:
* `OnError.Raise` (raise a `crawlera_fetch.CrawleraFetchException` exception)
* `OnError.Warn` (log a warning and return the raw upstream response)
* `OnError.Retry` (retry the failed request, up to `CRAWLERA_FETCH_RETRY_TIMES` times)

* `CRAWLERA_FETCH_RAISE_ON_ERROR` (type `bool`, default `True`)

**_Deprecated, please use `CRAWLERA_FETCH_ON_ERROR`_**

Whether or not the middleware will raise an exception if an error occurs while downloading
or decoding a response. If `False`, a warning will be logged and the raw upstream response
will be returned upon encountering an error.
Expand All @@ -76,6 +86,19 @@ Crawlera middleware won't be able to handle them.
Default values to be sent to the Crawlera Fetch API. For instance, set to `{"device": "mobile"}`
to render all requests with a mobile profile.

* `CRAWLERA_FETCH_SHOULD_RETRY` (type `Optional[Callable, str]`, default `None`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.

Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pawelmhm:

I would use True as default because most of the cases you want to retry. Every API can fail, uncork can fail, spider should retry.

CRAWLERA_FETCH_SHOULD_RETRY receives a callable (or the name of a callable within the spider) to be used to determine if a request should be retried. Perhaps it could be named differently, I'm open to suggestions. CRAWLERA_FETCH_ON_ERROR is the setting to determine what to do with errors. I made OnError.Warn the default, just to keep backward-compatibility, but perhaps OnError.Retry can be a better default.

Requirement of 2.5 might be limiting for some users, we don't support this stack in Scrapy Cloud in Zyte at the moment so this would have to wait for release of stack and would force all uncork users to migrate to 2.5. Is there some way to make it compatible with all scrapy not just 2.5?

AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of scrapy/scrapy#4902. I wanted to avoid code duplication but I guess I can just use the upstream function if available and fall back to copying the implementation.

Copy link

@pawelmhm pawelmhm Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explanation. Is there any scenario where you don't need retry? Because in my experience it is very rare not to want retry of internal server errors, timeouts, or bans?

AFAIK, you should be able to use 2.5 with a previous stack, by updating the requirements file. The 2.5 requirement is because of

I think in some projects people are using old Scrapy versions and they will have to update Scrapy to most recent versions, which will be extra work, extra effort for them. If they are stuck on some old versions like 1.6 or 1.7 updating to 2.5 might not be straigtforward.

But my main point after thinking about this is that why do we actually need custom retry? Why can't we handle this in retry middleware by default? like all other HTTP error codes? There are 2 use cases mentioned by Taras in issue on GH, but I'm not convinced about them and after talking with developer of Fetch API I hear they plan to change behavior to return 500 anbd 503 HTTP status codes instead of 200 HTTP status code with error code in response body.


A boolean callable that determines whether a request should be retried by the middleware.
If the setting value is a `str`, an attribute by that name will be looked up on the spider
object doing the crawl. The callable should accept the following arguments:
`response: scrapy.http.response.Response, request: scrapy.http.request.Request, spider: scrapy.spiders.Spider`.
If the return value evaluates to `True`, the request will be retried by the middleware.

* `CRAWLERA_FETCH_RETRY_TIMES` (type `Optional[int]`, default `None`)

The maximum number of times a request should be retried.
If `None`, the value is taken from the `RETRY_TIMES` setting.

### Spider attributes

* `crawlera_fetch_enabled` (type `bool`, default `False`)
Expand Down
2 changes: 1 addition & 1 deletion crawlera_fetch/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
from .logformatter import CrawleraFetchLogFormatter # noqa: F401
from .middleware import CrawleraFetchMiddleware, DownloadSlotPolicy # noqa: F401
from .middleware import CrawleraFetchMiddleware, DownloadSlotPolicy, OnError # noqa: F401
59 changes: 59 additions & 0 deletions crawlera_fetch/_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
from logging import Logger
from typing import Optional, Union

from scrapy import Request, Spider
from scrapy.utils.python import global_object_name


# disable black formatting to avoid syntax error on py35
# fmt: off
def _get_retry_request(
request: Request,
*,
spider: Spider,
reason: Union[str, Exception] = "unspecified",
max_retry_times: Optional[int] = None,
priority_adjust: Optional[int] = None,
logger: Logger,
stats_base_key: str # black wants to put a comma at the end, but py35 doesn't like it
) -> Optional[Request]:
# fmt: on
"""
Fallback implementation, taken verbatim from https://github.com/scrapy/scrapy/pull/4902
"""
settings = spider.crawler.settings
stats = spider.crawler.stats
retry_times = request.meta.get("retry_times", 0) + 1
if max_retry_times is None:
max_retry_times = request.meta.get("max_retry_times")
if max_retry_times is None:
max_retry_times = settings.getint("RETRY_TIMES")
if retry_times <= max_retry_times:
logger.debug(
"Retrying %(request)s (failed %(retry_times)d times): %(reason)s",
{"request": request, "retry_times": retry_times, "reason": reason},
extra={"spider": spider},
)
new_request = request.copy()
new_request.meta["retry_times"] = retry_times
new_request.dont_filter = True
if priority_adjust is None:
priority_adjust = settings.getint("RETRY_PRIORITY_ADJUST")
new_request.priority = request.priority + priority_adjust

if callable(reason):
reason = reason()
if isinstance(reason, Exception):
reason = global_object_name(reason.__class__)

stats.inc_value("{}/count".format(stats_base_key))
stats.inc_value("{}/reason_count/{}".format(stats_base_key, reason))
return new_request
else:
stats.inc_value("{}/max_reached".format(stats_base_key))
logger.error(
"Gave up retrying %(request)s (failed %(retry_times)d times): " "%(reason)s",
{"request": request, "retry_times": retry_times, "reason": reason},
extra={"spider": spider},
)
return None
144 changes: 126 additions & 18 deletions crawlera_fetch/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,13 @@
import logging
import os
import time
import warnings
from enum import Enum
from typing import Optional, Type, TypeVar
from typing import Callable, Optional, Type, TypeVar, Union

import scrapy
from scrapy.crawler import Crawler
from scrapy.exceptions import ScrapyDeprecationWarning
from scrapy.http.request import Request
from scrapy.http.response import Response
from scrapy.responsetypes import responsetypes
Expand All @@ -18,13 +20,21 @@
from scrapy.utils.reqser import request_from_dict, request_to_dict
from w3lib.http import basic_auth_header

try:
from scrapy.downloadermiddlewares.retry import get_retry_request # available on Scrapy >= 2.5
except ImportError:
from crawlera_fetch._utils import _get_retry_request as get_retry_request

logger = logging.getLogger("crawlera-fetch-middleware")

__all__ = [
"CrawleraFetchException",
"CrawleraFetchMiddleware",
"DownloadSlotPolicy",
"OnError",
]

logger = logging.getLogger("crawlera-fetch-middleware")
MiddlewareTypeVar = TypeVar("MiddlewareTypeVar", bound="CrawleraFetchMiddleware")


META_KEY = "crawlera_fetch"


Expand All @@ -34,6 +44,12 @@ class DownloadSlotPolicy(Enum):
Default = "default"


class OnError(Enum):
Warn = "warn"
Raise = "raise"
Retry = "retry"


class CrawleraFetchException(Exception):
pass

Expand Down Expand Up @@ -74,12 +90,57 @@ def _read_settings(self, spider: Spider) -> None:
self.download_slot_policy = settings.get(
"CRAWLERA_FETCH_DOWNLOAD_SLOT_POLICY", DownloadSlotPolicy.Domain
)

self.raise_on_error = settings.getbool("CRAWLERA_FETCH_RAISE_ON_ERROR", True)

self.default_args = settings.getdict("CRAWLERA_FETCH_DEFAULT_ARGS", {})

def spider_opened(self, spider):
# what to do when errors happen?
self.on_error_action = None # type: Optional[OnError]
if "CRAWLERA_FETCH_RAISE_ON_ERROR" in settings:
warnings.warn(
"CRAWLERA_FETCH_RAISE_ON_ERROR is deprecated, "
"please use CRAWLERA_FETCH_ON_ERROR instead",
category=ScrapyDeprecationWarning,
stacklevel=2,
)
if settings.getbool("CRAWLERA_FETCH_RAISE_ON_ERROR"):
self.on_error_action = OnError.Raise
else:
self.on_error_action = OnError.Warn
if "CRAWLERA_FETCH_ON_ERROR" in settings:
if isinstance(settings["CRAWLERA_FETCH_ON_ERROR"], OnError):
self.on_error_action = settings["CRAWLERA_FETCH_ON_ERROR"]
else:
logger.warning(
"Invalid type for CRAWLERA_FETCH_ON_ERROR setting:"
" expected crawlera_fetch.OnError, got %s",
type(settings["CRAWLERA_FETCH_ON_ERROR"]),
)
if self.on_error_action is None:
self.on_error_action = OnError.Raise

# should we retry?
self.should_retry = settings.get("CRAWLERA_FETCH_SHOULD_RETRY")
if self.should_retry is not None:
if isinstance(self.should_retry, str):
try:
self.should_retry = getattr(spider, self.should_retry)
except AttributeError:
logger.warning(
"Could not find a '%s' callable on the spider - user retries are disabled",
self.should_retry,
)
self.should_retry = None
elif not isinstance(self.should_retry, Callable): # type: ignore[arg-type]
logger.warning(
"Invalid type for retry function: expected Callable"
" or str, got %s - user retries are disabled",
type(self.should_retry),
)
self.should_retry = None
self.retry_times = settings.getint("CRAWLERA_FETCH_RETRY_TIMES")
if not self.retry_times:
self.retry_times = settings.getint("RETRY_TIMES")

def spider_opened(self, spider: Spider) -> None:
try:
spider_attr = getattr(spider, "crawlera_fetch_enabled")
except AttributeError:
Expand Down Expand Up @@ -163,6 +224,21 @@ def process_request(self, request: Request, spider: Spider) -> Optional[Request]
request.meta[META_KEY] = crawlera_meta
return request.replace(url=self.url, method="POST", body=body_json)

def _get_retry_request(
self,
request: Request,
reason: Union[Exception, str],
stats_base_key: str,
) -> Optional[Request]:
return get_retry_request(
request=request,
reason=reason,
stats_base_key=stats_base_key,
spider=self.crawler.spider,
max_retry_times=self.retry_times,
logger=logger,
)

def process_response(self, request: Request, response: Response, spider: Spider) -> Response:
if not self.enabled:
return response
Expand Down Expand Up @@ -193,11 +269,19 @@ def process_response(self, request: Request, response: Response, spider: Spider)
response.status,
message,
)
if self.raise_on_error:
if self.on_error_action == OnError.Raise:
raise CrawleraFetchException(log_msg)
else:
elif self.on_error_action == OnError.Warn:
logger.warning(log_msg)
return response
elif self.on_error_action == OnError.Retry:
return self._get_retry_request(
request=request,
reason=message,
stats_base_key="crawlera_fetch/retry/error",
)
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
else:
raise Exception("Invalid CRAWLERA_FETCH_ON_ERROR setting")

try:
json_response = json.loads(response.text)
Expand All @@ -213,14 +297,24 @@ def process_response(self, request: Request, response: Response, spider: Spider)
exc.lineno,
exc.colno,
)
if self.raise_on_error:
if self.on_error_action == OnError.Raise:
raise CrawleraFetchException(log_msg) from exc
else:
elif self.on_error_action == OnError.Warn:
logger.warning(log_msg)
return response
elif self.on_error_action == OnError.Retry:
return self._get_retry_request(
request=request,
reason=exc,
stats_base_key="crawlera_fetch/retry/error",
)
else:
raise Exception("Invalid CRAWLERA_FETCH_ON_ERROR setting")

server_error = json_response.get("crawlera_error") or json_response.get("error_code")
original_status = json_response.get("original_status")
self.stats.inc_value("crawlera_fetch/response_status_count/{}".format(original_status))

server_error = json_response.get("crawlera_error") or json_response.get("error_code")
request_id = json_response.get("id") or json_response.get("uncork_id")
if server_error:
message = json_response.get("body") or json_response.get("message")
Expand All @@ -237,13 +331,19 @@ def process_response(self, request: Request, response: Response, spider: Spider)
message,
request_id or "unknown",
)
if self.raise_on_error:
if self.on_error_action == OnError.Raise:
raise CrawleraFetchException(log_msg)
else:
elif self.on_error_action == OnError.Warn:
logger.warning(log_msg)
return response

self.stats.inc_value("crawlera_fetch/response_status_count/{}".format(original_status))
elif self.on_error_action == OnError.Retry:
return self._get_retry_request(
request=request,
reason=server_error,
stats_base_key="crawlera_fetch/retry/error",
)
else:
raise Exception("Invalid CRAWLERA_FETCH_ON_ERROR setting")

crawlera_meta["upstream_response"] = {
"status": response.status,
Expand All @@ -260,14 +360,22 @@ def process_response(self, request: Request, response: Response, spider: Spider)
url=json_response["url"],
body=resp_body,
)
return response.replace(
response = response.replace(
cls=respcls,
request=original_request,
headers=json_response["headers"],
url=json_response["url"],
body=resp_body,
status=original_status or 200,
)
if self.should_retry is not None:
if self.should_retry(response=response, request=request, spider=spider):
return self._get_retry_request(
request=request,
reason="should-retry",
stats_base_key="crawlera_fetch/retry/should-retry",
)
return response

def _set_download_slot(self, request: Request, spider: Spider) -> None:
if self.download_slot_policy == DownloadSlotPolicy.Domain:
Expand Down
Loading