- Python 3.7+
- Scrapy 2.6+
pip install scrapy-zyte-api
This package requires Python 3.7+.
Replace the default http
and https
in Scrapy's
DOWNLOAD_HANDLERS
in the settings.py
of your Scrapy project.
You also need to set the ZYTE_API_KEY
.
Lastly, make sure to install the asyncio-based Twisted reactor
in the settings.py
file as well.
Here's an example of the things needed inside a Scrapy project's settings.py
file:
DOWNLOAD_HANDLERS = {
"http": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler",
"https": "scrapy_zyte_api.handler.ScrapyZyteAPIDownloadHandler"
}
# Having the following in the env var would also work.
ZYTE_API_KEY = "<your API key>"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
To enable a scrapy.Request
to go through Zyte Data API, the zyte_api
key in
Request.meta
must be present and contain a dict with Zyte API parameters:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
def start_requests(self):
yield scrapy.Request(
url="http://quotes.toscrape.com/",
callback=self.parse,
meta={
"zyte_api": {
"browserHtml": True,
}
},
)
def parse(self, response):
yield {"URL": response.url, "HTML": response.body}
print(response.raw_api_response)
# {
# 'url': 'https://quotes.toscrape.com/',
# 'browserHtml': '<html> ... </html>',
# }
You can see the full list of parameters in the Zyte Data API Specification.
The url
parameter is filled automatically from request.url
, other
parameters should be set explicitly.
The raw Zyte Data API response can be accessed via the raw_api_response
attribute of the response object.
When you use the Zyte Data API parameters browserHtml
,
httpResponseBody
, or httpResponseHeaders
, the response body and headers
are set accordingly.
Note that, for Zyte Data API requests, the spider gets responses of
ZyteAPIResponse
and ZyteAPITextResponse
types,
which are respectively subclasses of scrapy.http.Response
and scrapy.http.TextResponse
.
If multiple requests target the same URL with different Zyte Data API
parameters, pass dont_filter=True
to Request
.
Often the same configuration needs to be used for all Zyte API requests.
For example, all requests may need to set the same geolocation, or
the spider only uses browserHtml
requests.
To set the default parameters for Zyte API enabled requests, you can set the
following in the settings.py
file or any other settings within Scrapy:
ZYTE_API_DEFAULT_PARAMS = {
"browserHtml": True,
"geolocation": "US",
}
ZYTE_API_DEFAULT_PARAMS
works if the zyte_api
key in Request.meta
is set, i.e. having ZYTE_API_DEFAULT_PARAMS
doesn't make all requests
to go through Zyte Data API. Parameters in ZYTE_API_DEFAULT_PARAMS
are
merged with parameters set via the zyte_api
meta key, with the values in
meta taking priority.
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
custom_settings = {
"ZYTE_API_DEFAULT_PARAMS": {
"geolocation": "US", # You can set any Geolocation region you want.
}
}
def start_requests(self):
yield scrapy.Request(
url="http://quotes.toscrape.com/",
callback=self.parse,
meta={
"zyte_api": {
"browserHtml": True,
"javascript": True,
"echoData": {"some_value_I_could_track": 123},
}
},
)
def parse(self, response):
yield {"URL": response.url, "HTML": response.body}
print(response.raw_api_response)
# {
# 'url': 'https://quotes.toscrape.com/',
# 'browserHtml': '<html> ... </html>',
# 'echoData': {'some_value_I_could_track': 123},
# }
print(response.request.meta)
# {
# 'zyte_api': {
# 'browserHtml': True,
# 'geolocation': 'US',
# 'javascript': True,
# 'echoData': {'some_value_I_could_track': 123}
# },
# 'download_timeout': 180.0,
# 'download_slot': 'quotes.toscrape.com'
# }
There is a shortcut, in case a request uses the same parameters as
defined in the ZYTE_API_DEFAULT_PARAMS
setting, without any further
customization - the zyte_api
meta key can be set to True
or {}
:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
custom_settings = {
"ZYTE_API_DEFAULT_PARAMS": {
"browserHtml": True,
}
}
def start_requests(self):
yield scrapy.Request(
url="http://quotes.toscrape.com/",
callback=self.parse,
meta={"zyte_api": True},
)
def parse(self, response):
yield {"URL": response.url, "HTML": response.body}
print(response.raw_api_response)
# {
# 'url': 'https://quotes.toscrape.com/',
# 'browserHtml': '<html> ... </html>',
# }
print(response.request.meta)
# {
# 'zyte_api': {
# 'browserHtml': True,
# },
# 'download_timeout': 180.0,
# 'download_slot': 'quotes.toscrape.com'
# }
API requests are retried automatically using the default retry policy of python-zyte-api.
API requests that exceed retries are dropped. You cannot manage API request retries through Scrapy downloader middlewares.
Use the ZYTE_API_RETRY_POLICY
setting or the zyte_api_retry_policy
request meta key to override the default python-zyte-api retry policy with a
custom retry policy.
A custom retry policy must be an instance of tenacity.AsyncRetrying.
For example, to also retry HTTP 521 errors the same as HTTP 520 errors, you can subclass RetryFactory as follows:
# settings.py from tenacity import retry_if_exception from zyte_api.aio.retry import RetryFactory def is_http_521(exc: BaseException) -> bool: return isinstance(exc, RequestError) and exc.status == 521 class CustomRetryFactory(RetryFactory): retry_condition = ( RetryFactory.retry_condition | retry_if_exception(is_http_521) ) def wait(self, retry_state: RetryCallState) -> float: if is_http_521(retry_state.outcome.exception()): return self.temporary_download_error_wait(retry_state=retry_state) return super().wait(retry_state) def stop(self, retry_state: RetryCallState) -> bool: if is_http_521(retry_state.outcome.exception()): return self.temporary_download_error_stop(retry_state) return super().stop(retry_state) ZYTE_API_RETRY_POLICY = CustomRetryFactory().build()