Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stats telemetry stops collecting after it encounters a server error #1771

Open
inqueue opened this issue Aug 25, 2023 · 1 comment
Open

stats telemetry stops collecting after it encounters a server error #1771

inqueue opened this issue Aug 25, 2023 · 1 comment
Labels
bug Something's wrong :Telemetry Telemetry Devices that gather additional metrics

Comments

@inqueue
Copy link
Member

inqueue commented Aug 25, 2023

Rally version (get with esrally --version): esrally 2.9.0.dev0 (git revision: 50ebcb68d9f09de545a1bfb217fc9840b97a367e)

esrally race --pipeline=benchmark-only --track-repository="default" --track="nyc_taxis" --challenge="autoscale" --telemetry='["node-stats", "shard-stats", "blob-store-stats"]' --on-error="continue" --target-hosts=target-hosts.json --client-options=client-options.json --track-params=track-params.json --telemetry-params=telemetry-params.json --user-tags=user-tags.json --race-id=c5420fb2-d073-4a6f-a54a-f98244e9b74b --load-driver-hosts=127.0.0.1

Description of the problem including expected versus actual behavior:

Rally will stop retrying to collect stats telemetry once it has failed too many times.

  • At the time of the last stats collection attempt, the benchmark showed a steady and prolonged increase in average bulk indexing latency.
  • Rally recorded 0 bulk indexing failures, though indexing throughput dropped significantly.
  • Subsequent manual stats calls to the cluster were successful.

image
image

Provide logs (if relevant):

2023-08-25 16:30:33,699 ActorAddr-(T|:45481)/PID:7942 esrally.telemetry ERROR Could not determine master node stats
Traceback (most recent call last):

  File "~/rally/esrally/telemetry.py", line 172, in run
    self.recorder.record()

  File "~/rally/esrally/telemetry.py", line 2249, in record
    info = self.client.nodes.info(node_id=state["master_node"], metric="os")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/utils.py", line 414, in wrapped
    return api(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^

  File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/nodes.py", line 249, in info
    return self.perform_request(  # type: ignore[return-value]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "~/.local/lib/python3.11/site-packages/elasticsearch/_sync/client/_base.py", line 390, in perform_request
    return self._client.perform_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "~/rally/esrally/client/synchronous.py", line 226, in perform_request
    raise HTTP_EXCEPTIONS.get(meta.status, ApiError)(message=message, meta=meta, body=resp_body)

elasticsearch.ApiError: ApiError(503, "{'ok': False, 'message': 'The requested resource is currently unavailable.'}")

The benchmark was using the default node-stats-sample-interval of 1s. One second seems aggressive, and I will try with a value of 10s. We might consider a new default.

@inqueue inqueue added the :Telemetry Telemetry Devices that gather additional metrics label Aug 25, 2023
@inqueue
Copy link
Member Author

inqueue commented Aug 25, 2023

The issue appears to only affect the node-stats telemetry device.

@b-deam b-deam added the bug Something's wrong label Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something's wrong :Telemetry Telemetry Devices that gather additional metrics
Projects
None yet
Development

No branches or pull requests

2 participants