-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency-safe unpacking of TF providers. #36085
base: main
Are you sure you want to change the base?
Conversation
The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the `terraform init` command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time. This is a serious problem in CI workflows, as it forces developers to choose between initializing modules serially and using the cache, or initializing modules in parallel, but download the same modules every time, increasing the consumed bandwidth. This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location. This change is inspired by this PR: hashicorp#33479
565c605
to
6d51f7f
Compare
Hi @pietrodn, thanks for putting this together. I think the scope for this has changed from the original issue, and I worry we're not solving this in the right way. As I understood, the previous issue was a relatively minor change in the way Terraform unpacked archives. The outcome of that change was to minimise the chance that multiple Terraform binaries had to overlap with each other, improving the behaviour of Terraform when executing in parallel, but ultimately not actually fixing the root cause of the problem. As the fix was relatively simple, I was happy to merge it as a small improvement that wasn't particularly risky to the overall function of Terraform itself. I think with this change we're now past the point of simpleness that the original change had to a point where it is worth re-evaluating the approach we're taking. If Terraform discovers a file they are attempting to write is locked, that means another version of Terraform is already attempting to download a provider that we need. In that situation, I don't think the answer is to simply wait until the file is unlocked and then do the work again. Instead, this version of Terraform should move on to the next provider and start downloading that before coming back later and checking if the other version of Terraform did in fact download the correct provider. This way we avoid all the competing versions of Terraform overwriting each other. With that in mind, I wonder if the place to check for and validate the new lock file we'll be creating is here: https://github.com/hashicorp/terraform/blob/main/internal/providercache/installer.go#L576. If we fail to acquire the lock at that point, we can simply continue onto checking the next provider or pause with the same backoff implementation and then recheck the earlier validation to see if whatever held the lock has actually just downloaded what we need. I think moving the lock to that external point would also mean we don't need to refactor the archive unpacking at all, as it should be concurrency-safe however we choose to write to the file. I'd be interested to know your thoughts here? I just think if we're taking the implementation of this into the area of "proper" fixes, it's worth thinking about a real proper fix. |
Hello @liamcervante . Since I might've caused some misunderstanding, firstly -- IIUC the original change that was similar to #33479 was not acceptable at all (firstsecond comment) -- was that the case? If I misunderstood something, apologies to @pietrodn for the extra work he put in to make things work correctly. |
Also, I we assume that we can't guarantee atomicity of the file writes, there's one more race condition that we might need to take care of. When launching the provider binary, there's currently no consistency checking when launching the provider. Thus, I'd propose to either:
|
Hi @mplzik! I think that once a provider has been installed, Terraform itself will never delete it from either the local or the global cache. As such, I don't think we need to lock the providers during plan or apply operations. If one execution of Terraform is using a provider during a plan or an apply, and |
It's harder to implement (as we need more complex logic to retry the locked providers), but it'd be a superior solution: it would avoid to download any provider more than once. The current solution fixes the issue but doesn't prevent redundant downloads. |
Hi @pietrodn - I'm trying to replicate this issue to verify the fix and I haven't been able to so far. Do you have some configuration and reproduction steps we can use to test the behaviour of Terraform before and after this fix? Thanks! |
The original implementation unpacked the downloaded provider .zip files directly in the Terraform plugin cache directory. This made the
terraform init
command prone to race conditions when multiple Terraform modules using the same cache were downloading, checksumming, and validating the provider files at the same time. This is a serious problem in CI workflows, as it forces developers to choose between two unattractive options:This change unpacks the .zip files in a temporary directory with a unique name inside the plugin cache directory, and only then moves the files to the expected location.
It also takes a file lock on the target cache directory to prevent two Terraform processes from installing the same provider and conflicting with each other.
This change is a simpler version of this change by @mplzik, rebased on the latest
main
branch: #33479Fixes #31964
Target Release
1.11.x
Draft CHANGELOG entry
BUG FIXES