You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We see this randomly in the CI, typically very rarely, but recently we see more and more failures.
When it happened locally, the issue was that volsync deleted the cephfs snapshot, but the snapshot was never deleted.
When this happens retrying the e2e job will continue to fail since we use the same namespace and resource names in the next run.
Fixing the cluster require manual cleanup. I think removing the finalizer on the volumesnapshot was enough, but I did not check if there are no leftovers on in cephfs.
Example error:
drenv.commands.Error: Command failed:
command: ('addons/volsync/test', 'dr1', 'dr2')
exitcode: 1
error:
Traceback (most recent call last):
File "/home/github/actions-runner/_work/ramen/ramen/test/addons/volsync/test", line 259, in <module>
t.result()
File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib64/python3.12/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib64/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/github/actions-runner/_work/ramen/ramen/test/addons/volsync/test", line 243, in test
teardown(cluster1, cluster2, variant)
File "/home/github/actions-runner/_work/ramen/ramen/test/addons/volsync/test", line 223, in teardown
kubectl.wait(
File "/home/github/actions-runner/_work/ramen/ramen/test/drenv/kubectl.py", line [144](https://github.com/RamenDR/ramen/actions/runs/11955746683/job/33338240384?pr=1548#step:7:145), in wait
_watch("wait", *args, context=context, log=log)
File "/home/github/actions-runner/_work/ramen/ramen/test/drenv/kubectl.py", line 216, in _watch
for line in commands.watch(*cmd, input=input):
File "/home/github/actions-runner/_work/ramen/ramen/test/drenv/commands.py", line 207, in watch
raise Error(args, error, exitcode=p.returncode)
drenv.commands.Error: Command failed:
command: ('kubectl', 'wait', '--context', 'dr2', 'ns', 'volsync-test-file', '--for=delete', '--timeout=120s')
exitcode: 1
error:
error: timed out waiting for the condition on namespaces/volsync-test-file
We see this randomly in the CI, typically very rarely, but recently we see more and more failures.
When it happened locally, the issue was that volsync deleted the cephfs snapshot, but the snapshot was never deleted.
When this happens retrying the e2e job will continue to fail since we use the same namespace and resource names in the next run.
Fixing the cluster require manual cleanup. I think removing the finalizer on the volumesnapshot was enough, but I did not check if there are no leftovers on in cephfs.
Example error:
Failed builds:
The text was updated successfully, but these errors were encountered: