Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency-safe unpacking of TF providers. #36085

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pietrodn
Copy link

@pietrodn pietrodn commented Nov 23, 2024

The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the terraform init command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time. This is a serious problem in CI workflows, as it forces developers to choose between two unattractive options:

  1. initializing modules serially and using the cache, or
  2. initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth.

This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location.

It also takes a file lock on the target cache directory to prevent two Terraform processes from installing the same provider and conflicting with each other.

This change is a simpler version of this change by @mplzik, rebased on the latest main branch: #33479

Fixes #31964

Target Release

1.11.x

Draft CHANGELOG entry

BUG FIXES

  • Concurrent write access to plugin cache directory has been fixed.

The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the `terraform init` command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time.
This is a serious problem in CI workflows, as it forces developers to choose between initializing modules serially and using the cache, or initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth.

This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location.

This change is inspired by this PR: hashicorp#33479
@pietrodn pietrodn force-pushed the tf-atomic-plugin-cache branch from 565c605 to 6d51f7f Compare November 23, 2024 13:53
@liamcervante
Copy link
Member

Hi @pietrodn, thanks for putting this together. I think the scope for this has changed from the original issue, and I worry we're not solving this in the right way.

As I understood, the previous issue was a relatively minor change in the way Terraform unpacked archives. The outcome of that change was to minimise the chance that multiple Terraform binaries had to overlap with each other, improving the behaviour of Terraform when executing in parallel, but ultimately not actually fixing the root cause of the problem. As the fix was relatively simple, I was happy to merge it as a small improvement that wasn't particularly risky to the overall function of Terraform itself.

I think with this change we're now past the point of simpleness that the original change had to a point where it is worth re-evaluating the approach we're taking. If Terraform discovers a file they are attempting to write is locked, that means another version of Terraform is already attempting to download a provider that we need. In that situation, I don't think the answer is to simply wait until the file is unlocked and then do the work again. Instead, this version of Terraform should move on to the next provider and start downloading that before coming back later and checking if the other version of Terraform did in fact download the correct provider. This way we avoid all the competing versions of Terraform overwriting each other.

With that in mind, I wonder if the place to check for and validate the new lock file we'll be creating is here: https://github.com/hashicorp/terraform/blob/main/internal/providercache/installer.go#L576. If we fail to acquire the lock at that point, we can simply continue onto checking the next provider or pause with the same backoff implementation and then recheck the earlier validation to see if whatever held the lock has actually just downloaded what we need.

I think moving the lock to that external point would also mean we don't need to refactor the archive unpacking at all, as it should be concurrency-safe however we choose to write to the file.

I'd be interested to know your thoughts here? I just think if we're taking the implementation of this into the area of "proper" fixes, it's worth thinking about a real proper fix.

@mplzik
Copy link

mplzik commented Nov 25, 2024

Hello @liamcervante . Since I might've caused some misunderstanding, firstly -- IIUC the original change that was similar to #33479 was not acceptable at all (firstsecond comment) -- was that the case?

If I misunderstood something, apologies to @pietrodn for the extra work he put in to make things work correctly.

@mplzik
Copy link

mplzik commented Nov 25, 2024

Also, I we assume that we can't guarantee atomicity of the file writes, there's one more race condition that we might need to take care of. When launching the provider binary, there's currently no consistency checking when launching the provider. Thus, I'd propose to either:

@liamcervante
Copy link
Member

Hi @mplzik! I think that once a provider has been installed, Terraform itself will never delete it from either the local or the global cache. As such, I don't think we need to lock the providers during plan or apply operations. If one execution of Terraform is using a provider during a plan or an apply, and init is called in another directory that needs the same provider then the init call will just reuse the already downloaded provider - it won't try and delete the existing version and redownload it.

@pietrodn
Copy link
Author

I think moving the lock to that external point would also mean we don't need to refactor the archive unpacking at all, as it should be concurrency-safe however we choose to write to the file.

It's harder to implement (as we need more complex logic to retry the locked providers), but it'd be a superior solution: it would avoid to download any provider more than once. The current solution fixes the issue but doesn't prevent redundant downloads.

@liamcervante
Copy link
Member

Hi @pietrodn - I'm trying to replicate this issue to verify the fix and I haven't been able to so far. Do you have some configuration and reproduction steps we can use to test the behaviour of Terraform before and after this fix? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow multiple Terraform instances to write to plugin_cache_dir concurrently
4 participants