Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImagePartitionAction: retry losetup.Attach() #524

Merged
merged 2 commits into from
Dec 17, 2024

Conversation

jakob-tsd
Copy link
Contributor

losetup.Attach() can fail due to concurrent attaches in other processes as seen in #522 .

The problem is a race condition between finding a free loop device and attaching the image.

Now that we have go-losetup v2, which does report the error, we can do what util-linux does
( https://github.com/util-linux/util-linux/blob/4c4b248c68149089c8be2f830214bb2be693307e/sys-utils/losetup.c#L662 ) and retry on failure.

I only sleep for 200 ms as opposed to 1 second as in

if i.usingLoop {
because the race condition should immediately resolve without waiting at all.

I still sleep for 200 ms as this is what util-linux does to prevent spinning ( util-linux/util-linux@3ff6fb8 ).

Fixes: #522

losetup.Attach() can fail due to concurrent attaches in other processes
as seen in go-debos#522 .

The problem is a race condition between finding a free loop device
and attaching the image.

Now that we have go-losetup v2, which does report the error, we can do
what util-linux does
( https://github.com/util-linux/util-linux/blob/4c4b248c68149089c8be2f830214bb2be693307e/sys-utils/losetup.c#L662 )
and retry on failure.

I only sleep for 200 ms as opposed to 1 second as in
https://github.com/go-debos/debos/blob/78aad24dc068ec2aac0355c165f760b953379b8f/actions/image_partition_action.go#L668
because the race condition should immediately resolve without waiting
at all.

I still sleep for 200 ms as this is what util-linux does to
prevent spinning ( util-linux/util-linux@3ff6fb8 ).

Fixes: go-debos#522
@jakob-tsd
Copy link
Contributor Author

Draft as I want to let this run for a week or so in our build farm.

@obbardc
Copy link
Member

obbardc commented Dec 2, 2024

Draft as I want to let this run for a week or so in our build farm.

Thanks for the fix ! Hopefully this solves the issue for you, also happy to wait a while before merging.

It'd be really good to see some stats on how this improves things in your setup :-) (e.g drop from 20/100 build failures to 0/wk or similar)

go-losetup v2 now returns a meaningful error. Add it to the returned
error message.
@jakob-tsd
Copy link
Contributor Author

Unfortunately (fortunately?) it looks like the issue is harder to trigger than I thought. Since 958e68d I have only one failed build of about 100. It failed, as expected, with

2024/12/02 01:01:19 Action image-partition failed at stage PreNoMachine, error: Failed to setup loop device

instead of something exploding later on. So this is good.

@obbardc
Copy link
Member

obbardc commented Dec 2, 2024

Unfortunately (fortunately?) it looks like the issue is harder to trigger than I thought. Since 958e68d I have only one failed build of about 100. It failed, as expected, with

2024/12/02 01:01:19 Action image-partition failed at stage PreNoMachine, error: Failed to setup loop device

instead of something exploding later on. So this is good.

Great news! It's much better to fail early than later.

@obbardc
Copy link
Member

obbardc commented Dec 12, 2024

@jakob-tsd Do you have an update? :-)

@jakob-tsd
Copy link
Contributor Author

jakob-tsd commented Dec 12, 2024

I think this is good to go. I have zero failures since I have implemented the change.

Additionally I did the stress test below, which triggers "Setup loop device: try 1/60 failed: device or resource busy" but always succeeds with at most 2 retries.

Not sure what is going on with the sfdisk partition table warning, but I this in all of our builds, and the images are fine.

stress.sh :

#!/bin/bash

for i in 1 2 3 4 5 6; do
rm -f $i.img
debos --disable-fakemachine -t OUTPUT_FILE:$i.img losetup.yml |& sed -u -e "s/^/[$i] /" &
done

wait

losetup.yml :

architecture: arm64

actions:
  - action: image-partition
    imagename: {{.OUTPUT_FILE}}
    imagesize: 629145600
    partitiontype: gpt
    partitions:
      - name: STRESS
        fs: fat32
        parttype: B921B045-1DF0-41C3-AF44-4C6F280D3FAE
        start: 16MiB
        end: 100%

Output:

root@jaguar1:~/debos-stress# ./stress.sh
[3] 2024/12/12 12:23:06 fakemachine not supported, running on the host!
[2] 2024/12/12 12:23:06 fakemachine not supported, running on the host!
[6] 2024/12/12 12:23:06 fakemachine not supported, running on the host!
[1] 2024/12/12 12:23:06 fakemachine not supported, running on the host!
[4] 2024/12/12 12:23:06 fakemachine not supported, running on the host!
[5] 2024/12/12 12:23:06 fakemachine not supported, running on the host!
[3] 2024/12/12 12:23:06 ==== image-partition ====
[1] 2024/12/12 12:23:06 Setup loop device: try 1/60 failed: device or resource busy
[5] 2024/12/12 12:23:06 ==== image-partition ====
[4] 2024/12/12 12:23:06 Setup loop device: try 1/60 failed: device or resource busy
[6] 2024/12/12 12:23:06 Setup loop device: try 1/60 failed: device or resource busy
[2] 2024/12/12 12:23:06 ==== image-partition ====
[4] 2024/12/12 12:23:07 Setup loop device: try 2/60 failed: device or resource busy
[1] 2024/12/12 12:23:07 ==== image-partition ====
[6] 2024/12/12 12:23:07 ==== image-partition ====
[4] 2024/12/12 12:23:07 ==== image-partition ====
[3] 2024/12/12 12:23:07 sfdisk |
[5] 2024/12/12 12:23:07 sfdisk |
[2] 2024/12/12 12:23:07 sfdisk |
[3] 2024/12/12 12:23:07 sfdisk | The partition table has been altered.
[5] 2024/12/12 12:23:07 sfdisk | The partition table has been altered.
[2] 2024/12/12 12:23:07 sfdisk | The partition table has been altered.
[3] 2024/12/12 12:23:07 sfdisk | Calling ioctl() to re-read partition table.
[5] 2024/12/12 12:23:07 sfdisk | Calling ioctl() to re-read partition table.
[2] 2024/12/12 12:23:07 sfdisk | Calling ioctl() to re-read partition table.
[2] 2024/12/12 12:23:07 sfdisk | Re-reading the partition table failed.: Invalid argument
[5] 2024/12/12 12:23:07 sfdisk | Re-reading the partition table failed.: Invalid argument
[3] 2024/12/12 12:23:07 sfdisk | Re-reading the partition table failed.: Invalid argument
[3] 2024/12/12 12:23:08 sfdisk | The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or partx(8).
[3] 2024/12/12 12:23:08 sfdisk | Syncing disks.
[2] 2024/12/12 12:23:08 sfdisk | The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or partx(8).
[2] 2024/12/12 12:23:08 sfdisk | Syncing disks.
[5] 2024/12/12 12:23:08 sfdisk | The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or partx(8).
[5] 2024/12/12 12:23:08 sfdisk | Syncing disks.
[3] 2024/12/12 12:23:08 Formatting partition 1 | mkfs.fat 4.2 (2021-01-31)
[5] 2024/12/12 12:23:08 Formatting partition 1 | mkfs.fat 4.2 (2021-01-31)
[2] 2024/12/12 12:23:08 Formatting partition 1 | mkfs.fat 4.2 (2021-01-31)
[6] 2024/12/12 12:23:08 sfdisk |
[6] 2024/12/12 12:23:08 sfdisk | The partition table has been altered.
[6] 2024/12/12 12:23:08 sfdisk | Calling ioctl() to re-read partition table.
[6] 2024/12/12 12:23:08 sfdisk | Re-reading the partition table failed.: Invalid argument
[6] 2024/12/12 12:23:08 sfdisk | The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or partx(8).
[6] 2024/12/12 12:23:08 sfdisk | Syncing disks.
[6] 2024/12/12 12:23:08 Formatting partition 1 | mkfs.fat 4.2 (2021-01-31)
[1] 2024/12/12 12:23:08 sfdisk |
[1] 2024/12/12 12:23:08 sfdisk | The partition table has been altered.
[4] 2024/12/12 12:23:08 sfdisk |
[4] 2024/12/12 12:23:08 sfdisk | The partition table has been altered.
[1] 2024/12/12 12:23:08 sfdisk | Calling ioctl() to re-read partition table.
[1] 2024/12/12 12:23:08 sfdisk | Re-reading the partition table failed.: Invalid argument
[4] 2024/12/12 12:23:08 sfdisk | Calling ioctl() to re-read partition table.
[4] 2024/12/12 12:23:08 sfdisk | Re-reading the partition table failed.: Invalid argument
[1] 2024/12/12 12:23:08 sfdisk | The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or partx(8).
[1] 2024/12/12 12:23:08 sfdisk | Syncing disks.
[4] 2024/12/12 12:23:08 sfdisk | The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or partx(8).
[4] 2024/12/12 12:23:08 sfdisk | Syncing disks.
[1] 2024/12/12 12:23:08 Formatting partition 1 | mkfs.fat 4.2 (2021-01-31)
[4] 2024/12/12 12:23:08 Formatting partition 1 | mkfs.fat 4.2 (2021-01-31)
[1] 2024/12/12 12:23:09 ==== Recipe done ====
[6] 2024/12/12 12:23:10 ==== Recipe done ====
[5] 2024/12/12 12:23:10 ==== Recipe done ====
[3] 2024/12/12 12:23:10 ==== Recipe done ====
[2] 2024/12/12 12:23:10 ==== Recipe done ====
[4] 2024/12/12 12:23:10 ==== Recipe done ====

@jakob-tsd jakob-tsd marked this pull request as ready for review December 17, 2024 08:50
@jakob-tsd
Copy link
Contributor Author

Still all green. Marked ready.

@obbardc obbardc added this pull request to the merge queue Dec 17, 2024
Merged via the queue into go-debos:main with commit f219e0a Dec 17, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Native arm64 builds: loop device woes
2 participants