-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ImagePartitionAction: retry losetup.Attach() #524
Conversation
losetup.Attach() can fail due to concurrent attaches in other processes as seen in go-debos#522 . The problem is a race condition between finding a free loop device and attaching the image. Now that we have go-losetup v2, which does report the error, we can do what util-linux does ( https://github.com/util-linux/util-linux/blob/4c4b248c68149089c8be2f830214bb2be693307e/sys-utils/losetup.c#L662 ) and retry on failure. I only sleep for 200 ms as opposed to 1 second as in https://github.com/go-debos/debos/blob/78aad24dc068ec2aac0355c165f760b953379b8f/actions/image_partition_action.go#L668 because the race condition should immediately resolve without waiting at all. I still sleep for 200 ms as this is what util-linux does to prevent spinning ( util-linux/util-linux@3ff6fb8 ). Fixes: go-debos#522
Draft as I want to let this run for a week or so in our build farm. |
Thanks for the fix ! Hopefully this solves the issue for you, also happy to wait a while before merging. It'd be really good to see some stats on how this improves things in your setup :-) (e.g drop from 20/100 build failures to 0/wk or similar) |
go-losetup v2 now returns a meaningful error. Add it to the returned error message.
Unfortunately (fortunately?) it looks like the issue is harder to trigger than I thought. Since 958e68d I have only one failed build of about 100. It failed, as expected, with
instead of something exploding later on. So this is good. |
Great news! It's much better to fail early than later. |
@jakob-tsd Do you have an update? :-) |
I think this is good to go. I have zero failures since I have implemented the change. Additionally I did the stress test below, which triggers "Setup loop device: try 1/60 failed: device or resource busy" but always succeeds with at most 2 retries. Not sure what is going on with the sfdisk partition table warning, but I this in all of our builds, and the images are fine. stress.sh :
losetup.yml :
Output:
|
Still all green. Marked ready. |
losetup.Attach() can fail due to concurrent attaches in other processes as seen in #522 .
The problem is a race condition between finding a free loop device and attaching the image.
Now that we have go-losetup v2, which does report the error, we can do what util-linux does
( https://github.com/util-linux/util-linux/blob/4c4b248c68149089c8be2f830214bb2be693307e/sys-utils/losetup.c#L662 ) and retry on failure.
I only sleep for 200 ms as opposed to 1 second as in
debos/actions/image_partition_action.go
Line 668 in 78aad24
I still sleep for 200 ms as this is what util-linux does to prevent spinning ( util-linux/util-linux@3ff6fb8 ).
Fixes: #522