You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried benchmarking RBD volumes by running fio in a Pod. Unfortunately, multiple times fio got stuck, and trying to terminate it left its worker subprocesses in disk sleep forever. In this situation I am unable to stop the processes or the container (stays Terminating, cannot stop container: tried to kill container, but did not receive an exit event). My only move is to reboot the whole node, which is very disruptive.
I should add that the mount is not completely broken in that situation, I can still issue writes to other files without hanging. But the hung processes never recover or exit.
I am very worried about running production workloads on the cluster if enough sustained disk operations can apparently crash the node.
How can I recover when this happens? How can I prevent it from happening?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I tried benchmarking RBD volumes by running
fio
in a Pod. Unfortunately, multiple timesfio
got stuck, and trying to terminate it left its worker subprocesses indisk sleep
forever. In this situation I am unable to stop the processes or the container (staysTerminating
,cannot stop container: tried to kill container, but did not receive an exit event
). My only move is to reboot the whole node, which is very disruptive.I should add that the mount is not completely broken in that situation, I can still issue writes to other files without hanging. But the hung processes never recover or exit.
I am very worried about running production workloads on the cluster if enough sustained disk operations can apparently crash the node.
How can I recover when this happens? How can I prevent it from happening?
I am using Ubuntu 20.04, Linux 5.4.0-109
Beta Was this translation helpful? Give feedback.
All reactions