Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge increase in deployment times for v6 upgrade #2987

Closed
jaxxstorm opened this issue Nov 14, 2023 · 10 comments
Closed

Huge increase in deployment times for v6 upgrade #2987

jaxxstorm opened this issue Nov 14, 2023 · 10 comments
Assignees
Labels
impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec needs-repro Needs repro steps before it can be triaged or fixed p1 A bug severe enough to be the next item assigned to an engineer resolution/fixed This issue was fixed

Comments

@jaxxstorm
Copy link
Contributor

What happened?

A customer is reporting a large increase in deployment times when upgrading the AWS provider for a stack of approximately 1k resources.

Example

I have a performance trace for the issue. Please DM me to see it

Output of pulumi about

N/A

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@jaxxstorm jaxxstorm added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Nov 14, 2023
@jaxxstorm
Copy link
Contributor Author

Important notes:

up is run with PULUMI_EXPERIMENTAL=1 PULUMI_SKIP_CHECKPOINTS=true PULUMI_OPTIMIZED_CHECKPOINT_PATCH=true pulumi up --yes

@jaxxstorm
Copy link
Contributor Author

After viewing the trace, it seems every operation is taking considerably longer than expected

CleanShot 2023-11-14 at 13 44 38@2x

@t0yv0 t0yv0 added the impact/performance Something is slower than expected label Nov 14, 2023
@mikhailshilkov
Copy link
Member

@mjeffryes @t0yv0 Does one of you want to work with Lee on reproducing this issue?

@jaxxstorm
Copy link
Contributor Author

@mikhailshilkov we have all the data in this slack thread, we need eyes on this ASAP:
https://pulumi.slack.com/archives/C9SGX9QA1/p1699996996558019

@mjeffryes
Copy link
Member

Summarizing the investigation so far:

  • The customer is able to reliably trigger slow updates and imports with the v6 plugin, and fast updates with the v5 plugin; they have confirmed that the environment and program for each command are identical.
  • Pulumi has not yet successfully reproduced the slowdown.
  • We suspect that the problem might be in the state recorded by v6:
    • The slow traces provided by the customer show many unusually long spans that don't directly depend on the resource provider (eg. the language plugin taking 100x longer to start)
    • The customer has observed that even imports with v5 are slow after running an upgrade with v6 (and vice versa, the initial import with v6 is not slow when the last upgrade was made with v5)

@mikhailshilkov mikhailshilkov added needs-repro Needs repro steps before it can be triaged or fixed p1 A bug severe enough to be the next item assigned to an engineer and removed needs-triage Needs attention from the triage team labels Nov 17, 2023
@t0yv0 t0yv0 assigned t0yv0 and unassigned mjeffryes Nov 17, 2023
@t0yv0
Copy link
Member

t0yv0 commented Nov 21, 2023

Quick update here. It appears that CheckConfig/Configure are taking 10 and 20 minutes respectively, and the customer stack performs them twice (two instances of AWS provider), leading to an unacceptable slowdown.

@jaxxstorm contributed a workaround that is now released as v5.43.0 and reduces the customer urgency to land this fix on v6. This buys a little time but it remains a high priority for us to fix since at some point the customer will be compelled again to upgrade to v6.

In terms of the actual fix, we still do not have a solid repro and we found we have an instrumentation gap. The traces do not explain why Configure/CheckConfig are slow and the verbose logs are silent around the timestamp gaps.

With pulumi/pulumi-terraform-bridge#1534 we're trying to add visibility into Configure/CheckConfig for both SDKv2 an PF sub-providers. This change also adds memory instrumentation to get an idea of RAM statistics for the pulumi-resource-aws Go process.

IN addition we are attempting to fix pulumi/pulumi-terraform-bridge#1536 which should recover Terraform-initiated logs under TF_LOG=trace environment variable.

With these two pieces of instrumentation we'll work through another round of customer testing to gather more information.

Current hypotheses:

  • increased RAM consumption on V6

  • pathologically slow Configure or CheckConfig on the PF sub-provider - similar to Performance degradation with custom endpoints #2880 which is fixed on the version the customer is trying but does indicate this is a problematic area.

t0yv0 added a commit to pulumi/pulumi-terraform-bridge that referenced this issue Nov 27, 2023
IN pulumi/pulumi-aws#2987 a customer is
experiencing very long Configure and CheckConfig times but our traces
and logs provide limited visibility into what's going on.

This PR adds a little more information. This is complicated by the fact
that there's no root span for the provider and the provider process can
be killed by the engine at any moment, so finalizing the root span is
not reliable. Therefore the strategy of collecting mem stats in a
background thread and tagging them to the root span, as in pulumi/pulumi
does not work very well.

Changes:

- add spans for key sdkv2 methods GetPluginInfo Configure, CheckConfig,
Diff, Create, Update, Delete, Invoke
- all the spans sample the go memory stats for the running process and
report them as tags on the span
- since we are investigating Configure and CheckConfig specifically,
additional tags are added to show how much time is spent tin
PreConfigureCallbacks and in provider configuration validation
- the latter is also instrumented for the PF version of Configure

With these changes we have a little more visibility into PF vs SDKv2
Configure costs and have some memory stat samples to get an idea of peak
memory use for the provider. This kind of hopes that sampling memory at
the beginning of each span is sufficient; perhaps can be improved upon
by sampling every 1s for long-running spans.
@lukehoban
Copy link
Contributor

We suspect this may perf regression may have been introduced by #2949. We are working on a patch to change that implementation to avoid the additional synchronization it forces.

If that is correct, there is still an underlying issue - that some Configure call is taking 10+ minutes. However, previously this may not have been observable - the Configure call that mattered (that other resource operations depended on) was happening quickly. We still will need to collect more detailed logs from folks hitting this to root cause where that slow Configure is coming from.

But ideally, the fix to the logic in 2949 will unblock the practical perf regression a few folks are seeing here.

@t0yv0
Copy link
Member

t0yv0 commented Nov 28, 2023

v6.11.0-alpha.1 is released with a fix for the logic in #2949 and additional instrumentation.

@lukehoban
Copy link
Contributor

lukehoban commented Nov 29, 2023

The investigation in the other issue has identified the expected root cause - ultimately due to aws/aws-sdk-go-v2#2353 - notes in #3044 (comment).

@lukehoban lukehoban added the resolution/fixed This issue was fixed label Nov 29, 2023
@lukehoban
Copy link
Contributor

We released https://github.com/pulumi/pulumi-aws/releases/tag/v6.12.0 this morning, which includes the fix from aws/aws-sdk-go-v2#2355. We've also verified that all versions of http://github.com/aws/aws-sdk-go-v2/config and github.com/aws/aws-sdk-go-v2/internal/ini match upstream.

So we'll closed this out as fixed with the latest release. If you continue to see any performance issue, please do reopen!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/performance Something is slower than expected kind/bug Some behavior is incorrect or out of spec needs-repro Needs repro steps before it can be triaged or fixed p1 A bug severe enough to be the next item assigned to an engineer resolution/fixed This issue was fixed
Projects
None yet
Development

No branches or pull requests

5 participants