Concurrency-safe unpacking of TF providers. #36085

pietrodn · 2024-11-23T12:35:33Z

The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the terraform init command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time. This is a serious problem in CI workflows, as it forces developers to choose between two unattractive options:

initializing modules serially and using the cache, or
initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth.

This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location.

It also takes a file lock on the target cache directory to prevent two Terraform processes from installing the same provider and conflicting with each other.

This change is a simpler version of this change by @mplzik, rebased on the latest main branch: #33479

Fixes #31964

Target Release

1.11.x

Draft CHANGELOG entry

BUG FIXES

Concurrent write access to plugin cache directory has been fixed.

The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the `terraform init` command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time. This is a serious problem in CI workflows, as it forces developers to choose between initializing modules serially and using the cache, or initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth. This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location. This change is inspired by this PR: hashicorp#33479

liamcervante · 2024-11-25T09:06:50Z

Hi @pietrodn, thanks for putting this together. I think the scope for this has changed from the original issue, and I worry we're not solving this in the right way.

As I understood, the previous issue was a relatively minor change in the way Terraform unpacked archives. The outcome of that change was to minimise the chance that multiple Terraform binaries had to overlap with each other, improving the behaviour of Terraform when executing in parallel, but ultimately not actually fixing the root cause of the problem. As the fix was relatively simple, I was happy to merge it as a small improvement that wasn't particularly risky to the overall function of Terraform itself.

I think with this change we're now past the point of simpleness that the original change had to a point where it is worth re-evaluating the approach we're taking. If Terraform discovers a file they are attempting to write is locked, that means another version of Terraform is already attempting to download a provider that we need. In that situation, I don't think the answer is to simply wait until the file is unlocked and then do the work again. Instead, this version of Terraform should move on to the next provider and start downloading that before coming back later and checking if the other version of Terraform did in fact download the correct provider. This way we avoid all the competing versions of Terraform overwriting each other.

With that in mind, I wonder if the place to check for and validate the new lock file we'll be creating is here: https://github.com/hashicorp/terraform/blob/main/internal/providercache/installer.go#L576. If we fail to acquire the lock at that point, we can simply continue onto checking the next provider or pause with the same backoff implementation and then recheck the earlier validation to see if whatever held the lock has actually just downloaded what we need.

I think moving the lock to that external point would also mean we don't need to refactor the archive unpacking at all, as it should be concurrency-safe however we choose to write to the file.

I'd be interested to know your thoughts here? I just think if we're taking the implementation of this into the area of "proper" fixes, it's worth thinking about a real proper fix.

mplzik · 2024-11-25T10:19:27Z

Hello @liamcervante . Since I might've caused some misunderstanding, firstly -- IIUC the original change that was similar to #33479 was not acceptable at all (first second comment) -- was that the case?

If I misunderstood something, apologies to @pietrodn for the extra work he put in to make things work correctly.

mplzik · 2024-11-25T10:31:09Z

Also, I we assume that we can't guarantee atomicity of the file writes, there's one more race condition that we might need to take care of. When launching the provider binary, there's currently no consistency checking when launching the provider. Thus, I'd propose to either:

add locking to https://github.com/hashicorp/terraform/blob/main/internal/providercache/cached_provider.go#L106 to ensure that the code only returns executable path once no locks are being held, or
add lock handling to all the places where the binary is being called after getting its path using ExecuteFile().

liamcervante · 2024-11-25T11:15:45Z

Hi @mplzik! I think that once a provider has been installed, Terraform itself will never delete it from either the local or the global cache. As such, I don't think we need to lock the providers during plan or apply operations. If one execution of Terraform is using a provider during a plan or an apply, and init is called in another directory that needs the same provider then the init call will just reuse the already downloaded provider - it won't try and delete the existing version and redownload it.

pietrodn · 2024-11-25T12:08:02Z

I think moving the lock to that external point would also mean we don't need to refactor the archive unpacking at all, as it should be concurrency-safe however we choose to write to the file.

It's harder to implement (as we need more complex logic to retry the locked providers), but it'd be a superior solution: it would avoid to download any provider more than once. The current solution fixes the issue but doesn't prevent redundant downloads.

liamcervante · 2024-12-04T09:27:38Z

Hi @pietrodn - I'm trying to replicate this issue to verify the fix and I haven't been able to so far. Do you have some configuration and reproduction steps we can use to test the behaviour of Terraform before and after this fix? Thanks!

pietrodn requested a review from a team as a code owner November 23, 2024 12:35

pietrodn mentioned this pull request Nov 23, 2024

Unpack TF plugins in a more atomic way. #33479

Open

pietrodn force-pushed the tf-atomic-plugin-cache branch from 565c605 to 6d51f7f Compare November 23, 2024 13:53

pietrodn added 2 commits November 23, 2024 21:22

Add rename_with_retry to mitigate Windows non-atomic renames.

d024465

Add file locks to prevent concurrent installations.

ec25335

crw added the enhancement label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency-safe unpacking of TF providers. #36085

Concurrency-safe unpacking of TF providers. #36085

pietrodn commented Nov 23, 2024 •

edited

Loading

liamcervante commented Nov 25, 2024

mplzik commented Nov 25, 2024

mplzik commented Nov 25, 2024

liamcervante commented Nov 25, 2024

pietrodn commented Nov 25, 2024

liamcervante commented Dec 4, 2024

Concurrency-safe unpacking of TF providers. #36085

Are you sure you want to change the base?

Concurrency-safe unpacking of TF providers. #36085

Conversation

pietrodn commented Nov 23, 2024 • edited Loading

Target Release

Draft CHANGELOG entry

BUG FIXES

liamcervante commented Nov 25, 2024

mplzik commented Nov 25, 2024

mplzik commented Nov 25, 2024

liamcervante commented Nov 25, 2024

pietrodn commented Nov 25, 2024

liamcervante commented Dec 4, 2024

pietrodn commented Nov 23, 2024 •

edited

Loading