Problem copying snapshots to regions when building #122

wolfsage · 2024-02-07T21:10:18Z

Overview of the Issue

Since yesterday, packer build has bailed out when copying an image to other regions like so:

2024/02/06 08:30:14 packer-plugin-digitalocean_v1.1.1_x5.0_linux_amd64 plugin: 2024/02/06 08:30:14 Looking up snapshot ID for snapshot: test
2024/02/06 08:30:16 ui: ==> digitalocean: transferring Snapshot ID: 2030587156
2024/02/06 08:30:16 packer-plugin-digitalocean_v1.1.1_x5.0_linux_amd64 plugin: 2024/02/06 08:30:16 Waiting for up to 1200 seconds for image transfer to become completed
2024/02/06 08:30:16 packer-plugin-digitalocean_v1.1.1_x5.0_linux_amd64 plugin: 2024/02/06 08:30:16 Checking action status... (attempt: 1)
2024/02/06 08:30:16 ui error: ==> digitalocean: Error waiting for snapshot transfer: GET https://api.digitalocean.com/v2/images/2030587156/actions/2030585883: 404 (request "b3086244-9084-4591-bb59-0d3c9164c9a1") The resource you were accessing could not be found.

What's interesting here is 2030587156 is not a valid image id. That's an action id! And it's easy to see where this value came from:

https://github.com/digitalocean/packer-plugin-digitalocean/blob/main/builder/digitalocean/step_snapshot.go#L98

Breaking down what's happening:

			imageTransfer, _, err := client.ImageActions.Transfer(context.TODO(), images[0].ID, transferRequest)
			if err != nil {
				err := fmt.Errorf("Error transferring snapshot: %s", err)
				state.Put("error", err)
				ui.Error(err.Error())
				return multistep.ActionHalt
			}
			ui.Say(fmt.Sprintf("transferring Snapshot ID: %d", imageTransfer.ID))
			if err := WaitForImageState(godo.ActionCompleted, imageTransfer.ID, action.ID,
				client, 20*time.Minute); err != nil {
				// If we get an error the first time, actually report it
				err := fmt.Errorf("Error waiting for snapshot transfer: %s", err)
				state.Put("error", err)
				ui.Error(err.Error())
				return multistep.ActionHalt
			}

Below we create a transfer request. This uses images[0].ID so posts to /images/1234/actions {type:transfer, region:...}

		imageTransfer, _, err := client.ImageActions.Transfer(context.TODO(), images[0].ID, transferRequest)

And then we're suppose to poll that action to wait for it to finish. Polling that would be /images/1234/actions/imageTransfer.ID.

But instead, we're passing what seem to me the wrong things to WaitForImageState:

		if err := WaitForImageState(godo.ActionCompleted, imageTransfer.ID, action.ID,

We're passing in the imageTransfer.ID (which is an action id!) to images/<...>/, and then we're passing in action.ID which comes from much higher:

	action, _, err := client.DropletActions.Snapshot(context.TODO(), dropletId, c.SnapshotName)

... so that's the ID of a snapshot action on a droplet. This feels like a bug, and I'm not sure how this ever worked. Shouldn't the code be:

			imageTransfer, _, err := client.ImageActions.Transfer(context.TODO(), images[0].ID, transferRequest)
			if err != nil {
				err := fmt.Errorf("Error transferring snapshot: %s", err)
				state.Put("error", err)
				ui.Error(err.Error())
				return multistep.ActionHalt
			}
			ui.Say(fmt.Sprintf("transferring Snapshot ID: %d", imageTransfer.ID))
-			if err := WaitForImageState(godo.ActionCompleted, imageTransfer.ID, action.ID,
+			if err := WaitForImageState(godo.ActionCompleted, images[0].ID, imageTransfer.ID,
				client, 20*time.Minute); err != nil {
				// If we get an error the first time, actually report it
				err := fmt.Errorf("Error waiting for snapshot transfer: %s", err)
				state.Put("error", err)
				ui.Error(err.Error())
				return multistep.ActionHalt
			}

I don't understand how this ever worked to be honest. My two theories are:

we were running old versions of things that didn't have this seeming bug
digitalocean changed their API? Maybe the "ID" that came back from the transferimage action was the id of the image mistakenly. But that doesn't explain how passing the action.ID would have worked too.... Hmm. Unless DigitalOcean started returning an error for invalid actions when previously they returned okay responses? That would account for it

Reproduction Steps

packer build with a configuration like this:

{
      "type": "digitalocean",
      "droplet_name": "some-droplet",
      "image": "debian-11-x64",
      "region": "sfo3",
      "size": "g-4vcpu-16gb",

      "ssh_username": "root",
      "ssh_timeout": "60m",

      "snapshot_name": "some-snapshot-{{isotime \"20060102\"}}",
      "snapshot_regions": ["sfo3","nyc3","ams3","syd1"],
      "tags": ["sometag"],

      "private_networking": false
}

Plugin and Packer version

Packer v1.9.2
packer-plugin-digitalocean_v1.1.1_x5.0_linux_amd64

Operating system and Environment details

Linux 4.19.0-23-cloud-amd64 #1 SMP Debian 4.19.269-1 (2022-12-20) x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

andrewsomething · 2024-02-07T23:55:17Z

Hi @wolfsage,

Thanks for flagging this issue for us! I've been able to reproduce the problem. Like you, I am pretty confused about this one. git blame shows that this code has not changed in many years, but your analysis seems correct. It's unclear to me how this ever worked. The bug goes back to the initial addition of the transfer functionality: hashicorp/packer#4893

I've opened a PR to address this #123

wolfsage · 2024-02-08T03:07:28Z

Thanks!

wolfsage · 2024-02-08T14:01:17Z

I just want to add this for anyone else having similar trouble:

Before this fix, it seems that with the buggy behaviour, we'd kick off a bunch of transfers in quick succession, but never actually wait for them to complete.

But with the bugfix in #123, each transfer is waited on before the next is started. This is causing us new problems because one transfer took 1h20m, and so the build timed out.

I've hacked around this for now by disabling the "wait for transfers to complete" step, which gets us back to the earlier behaviour:

diff --git a/builder/digitalocean/step_snapshot.go b/builder/digitalocean/step_snapshot.go
index 372b2b4..bc4d9c4 100644
--- a/builder/digitalocean/step_snapshot.go
+++ b/builder/digitalocean/step_snapshot.go
@@ -97,14 +97,16 @@ func (s *stepSnapshot) Run(ctx context.Context, state multistep.StateBag) multis
                                return multistep.ActionHalt
                        }
 
-                       if err := WaitForImageState(godo.ActionCompleted, images[0].ID, imageTransfer.ID,
-                               client, 20*time.Minute); err != nil {
-                               // If we get an error the first time, actually report it
-                               err := fmt.Errorf("Error waiting for snapshot transfer: %s", err)
-                               state.Put("error", err)
-                               ui.Error(err.Error())
-                               return multistep.ActionHalt
-                       }
+                       ui.Say(fmt.Sprintf("Not waiting for transfers (id %d), they take too long", imageTransfer.ID))
+
+//                     if err := WaitForImageState(godo.ActionCompleted, images[0].ID, imageTransfer.ID,
+//                             client, 20*time.Minute); err != nil {
+//                             // If we get an error the first time, actually report it
+//                             err := fmt.Errorf("Error waiting for snapshot transfer: %s", err)
+//                             state.Put("error", err)
+//                             ui.Error(err.Error())
+//                             return multistep.ActionHalt
+//                     }
                }
        }

and it seems to be working.

Cheers

andrewsomething · 2024-02-08T15:37:53Z

@wolfsage That's a good call out. We should really initiate these transfers in parallel. I've opened another issue to track that #125

wolfsage added the bug label Feb 7, 2024

andrewsomething mentioned this issue Feb 7, 2024

builder: fix distributing images to additional regions #123

Merged

andrewsomething closed this as completed in #123 Feb 8, 2024

andrewsomething mentioned this issue Feb 8, 2024

Initiate all transfers before blocking on completion #125

Closed

andrewsomething mentioned this issue Feb 9, 2024

builder: initiate image transfers in parallel and make timeout configurable. #126

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem copying snapshots to regions when building #122

Problem copying snapshots to regions when building #122

wolfsage commented Feb 7, 2024

andrewsomething commented Feb 7, 2024

wolfsage commented Feb 8, 2024

wolfsage commented Feb 8, 2024

andrewsomething commented Feb 8, 2024

Problem copying snapshots to regions when building #122

Problem copying snapshots to regions when building #122

Comments

wolfsage commented Feb 7, 2024

Overview of the Issue

Reproduction Steps

Plugin and Packer version

Operating system and Environment details

andrewsomething commented Feb 7, 2024

wolfsage commented Feb 8, 2024

wolfsage commented Feb 8, 2024

andrewsomething commented Feb 8, 2024