Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArcBox-Full deployment hangs Waiting for Kubernetes control plane to be in Provisioned phase... in multiple subscriptions #2087

Closed
mausolfj opened this issue Aug 31, 2023 · 10 comments
Assignees
Labels
ArcBox Jumpstart ArcBox related user_fix Fixed by the user or user error

Comments

@mausolfj
Copy link
Contributor

mausolfj commented Aug 31, 2023

Is your issue related to a Jumpstart scenario, ArcBox, HCIBox, or Agora?
[<!--- A link to the Jumpstart scenario you are working on. --->]
(https://azurearcjumpstart.io/azure_jumpstart_arcbox/full/)

Describe the issue or the bug

When attempting to deploy ArcBox-Full module.capi_vm[0].azurerm_virtual_machine_extension.custom_script is still creating... long after ArcBox-Client is complete. The jumpstart_logs/installCAPI.log shows the Rancher K3 cluster (arcbox-capi-mgmt) is ready, providers are installed, and we start Deploying the Kubernetes workload cluster. The following are created before an endless loop of "Waiting for Kubernetes control plane to be in Provisioned phase..." starts:

kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/arcbox-capi-data-oaeg-md-0 created
cluster.cluster.x-k8s.io/arcbox-capi-data-oaeg created
machinedeployment.cluster.x-k8s.io/arcbox-capi-data-oaeg-md-0 created
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/arcbox-capi-data-oaeg-control-plane created
azurecluster.infrastructure.cluster.x-k8s.io/arcbox-capi-data-oaeg created
azureclusteridentity.infrastructure.cluster.x-k8s.io/cluster-identity created
azuremachinetemplate.infrastructure.cluster.x-k8s.io/arcbox-capi-data-oaeg-control-plane created
azuremachinetemplate.infrastructure.cluster.x-k8s.io/arcbox-capi-data-oaeg-md-0 created
Waiting for Kubernetes control plane to be in Provisioned phase...
Waiting for Kubernetes control plane to be in Provisioned phase...
Waiting for Kubernetes control plane to be in Provisioned phase...

To Reproduce

CAPI cluster provisioning completes and ArcBox-Full is provisioned.

Environment summary

$ az version
{
"azure-cli": "2.51.0",
"azure-cli-core": "2.51.0",
"azure-cli-telemetry": "1.1.0",
"extensions": {}
}
$ terraform version
Terraform v1.5.6
on linux_amd64

  • provider registry.terraform.io/hashicorp/azurerm v2.99.0
  • provider registry.terraform.io/hashicorp/random v3.5.1

Have you looked at the Troubleshooting and Logs section?

Yes, I have reviewed the log files and then attempted ArcBox-Full deployment in another subscription with less stringent security measures but got the same results...

Screenshots
VSCode terminal window

image CAPI VM jumpstart_logs/installCAPI.log image

CAPI cluster still stuck in provisioning state two hours later

image

more kubctl information

Additional context

@mausolfj mausolfj added the triage issue or feature up for triage label Aug 31, 2023
@github-actions
Copy link

Hey friend! Thanks for opening this issue. We appreciate your contribution and welcome you to our community! We are glad to have you here and to have your input on the Azure Arc Jumpstart.

@mausolfj
Copy link
Contributor Author

To Reproduce:

  1. check az version then log in
  2. verify region and vCPUs
  3. register providers
  4. create SP
  5. generate ssh key pair
  6. go to deployment option 4: terraform deployment
  7. clone repo, download terraform
  8. create terraform.tfvars
  9. run terraform init, plan and apply
  10. review terminal output
  11. review jumpstart_logs/installCAPI.logs
  12. run kubectl and clusterctl commands to gather information

@zaidmohd zaidmohd self-assigned this Aug 31, 2023
@zaidmohd
Copy link
Contributor

zaidmohd commented Sep 6, 2023

@mausolfj: Could you please provide the detailed error you got in Terraform, as I was not able to repro it. Meanwhile, I am trying again with Bastion enabled.

image

Also, I recommend deleting the screenshot with SPN secrets and rotating the secrets. Thank you.

@mausolfj
Copy link
Contributor Author

mausolfj commented Sep 6, 2023

Appreciate you looking into this Zaid. Additional context included below.
Following-> https://azurearcjumpstart.io/azure_jumpstart_arcbox/full/
Deployment Option 4: Terraform Deployment
run terraform init, terraform plan -out=infra.out, then terraform apply "infra.out"

...
azurerm_resource_group.rg: Creating...
azurerm_resource_group.rg: Creation complete after 1s [id=/subscriptions/0d4b9684-ad97-4326-8ed0-df8c5b780d35/resourceGroups/ArcBox906]
...
module.management_artifacts.azurerm_network_security_rule.allow_SQLMI_traffic: Creating...
module.management_artifacts.azurerm_network_security_rule.allow_k8s_443: Creating...
module.management_artifacts.azurerm_network_security_rule.allow_traefik_lb_external: Creating...
module.management_artifacts.azurerm_network_security_rule.allow_k8s_80: Creating...
module.management_artifacts.azurerm_network_security_rule.allow_SQLMI_mirroring_traffic: Creating...
module.management_artifacts.azurerm_network_security_rule.allow_Postgresql_traffic: Creating...
module.management_artifacts.azurerm_network_security_rule.allow_k8s_8080: Creating...
...
module.management_artifacts.azurerm_bastion_host.bastionHost[0]: Still creating... [1m30s elapsed]
module.management_policy.azurerm_resource_group_policy_assignment.policies["1"]: Still creating... [1m30s elapsed]
module.management_policy.azurerm_resource_group_policy_assignment.policies["0"]: Still creating... [1m30s elapsed]
module.management_policy.azurerm_resource_group_policy_assignment.policies["2"]: Still creating... [1m30s elapsed]
...
module.management_policy.azurerm_role_assignment.policy_AMA_role_0[0]: Still creating... [20s elapsed]
module.management_policy.azurerm_role_assignment.policy_AMA_role_1[0]: Still creating... [20s elapsed]
module.management_policy.azurerm_role_assignment.policy_AMA_role_0[0]: Creation complete after 21s [id=/subscriptions/0d4b9684-ad97-4326-8ed0-df8c5b780d35/resourceGroups/ArcBox906/providers/Microsoft.Authorization/roleAssignments/c4e25bca-084f-c6db-e140-f9979c1f9edd]
module.management_policy.azurerm_role_assignment.policy_AMA_role_2[0]: Creation complete after 22s [id=/subscriptions/0d4b9684-ad97-4326-8ed0-df8c5b780d35/resourceGroups/ArcBox906/providers/Microsoft.Authorization/roleAssignments/049fceb9-81a0-8ed1-00ae-a2e44b491e58]
module.management_policy.azurerm_role_assignment.policy_defender_kubernetes[0]: Creation complete after 22s [id=/subscriptions/0d4b9684-ad97-4326-8ed0-df8c5b780d35/resourceGroups/ArcBox906/providers/Microsoft.Authorization/roleAssignments/8caacc7c-485f-5b48-f488-995974261330]
...
module.management_artifacts.azurerm_bastion_host.bastionHost[0]: Still creating... [8m0s elapsed]
module.management_artifacts.azurerm_bastion_host.bastionHost[0]: Creation complete after 8m4s [id=/subscriptions/0d4b9684-ad97-4326-8ed0-df8c5b780d35/resourceGroups/ArcBox906/providers/Microsoft.Network/bastionHosts/ArcBox-Bastion]
module.rancher_vm[0].data.azurerm_subscription.primary: Reading...
module.rancher_vm[0].data.azurerm_resource_group.rg: Reading...
module.capi_vm[0].data.azurerm_subscription.primary: Reading...
module.client_vm.data.azurerm_subscription.primary: Reading...
...
module.client_vm.azurerm_virtual_machine.client: Still creating... [10s elapsed]
module.capi_vm[0].azurerm_virtual_machine.client: Still creating... [10s elapsed]
module.rancher_vm[0].azurerm_virtual_machine.client: Still creating... [10s elapsed]
...
module.client_vm.azurerm_virtual_machine_extension.custom_script: Still creating... [2m40s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [2m30s elapsed]
module.rancher_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [2m30s elapsed]
... rancher completes and we go into a tighter loop ..

module.client_vm.azurerm_virtual_machine_extension.custom_script: Still creating... [26m20s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [26m40s elapsed]
module.client_vm.azurerm_virtual_machine_extension.custom_script: Still creating... [26m30s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [26m50s elapsed]
module.client_vm.azurerm_virtual_machine_extension.custom_script: Still creating... [26m40s elapsed]
module.client_vm.azurerm_virtual_machine_extension.custom_script: Creation complete after 26m46s [id=/subscriptions/0d4b9684-ad97-4326-8ed0-df8c5b780d35/resourceGroups/ArcBox906pm/providers/Microsoft.Compute/virtualMachines/ArcBox-Client/extensions/ArcBox-Client]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [27m0s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [27m10s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [27m20s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [27m30s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [27m40s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [27m50s elapsed]
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [28m0s elapsed]
... this message repeats until the deadline is exceeded ...
module.capi_vm[0].azurerm_virtual_machine_extension.custom_script: Still creating... [59m51s elapsed]

│ Error: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded

│ with module.capi_vm[0].azurerm_virtual_machine_extension.custom_script,
│ on modules/kubernetes/ubuntuCapi/main.tf line 164, in resource "azurerm_virtual_machine_extension" "custom_script":
│ 164: resource "azurerm_virtual_machine_extension" "custom_script" {

... there is no terraform error, just a deadline exceeded, essentially a timeout on the custom script installCAPI.sh
... if we review the logs on the capi vm (ArcBox-CAPI-MGMT) we see the repeating line “Waiting for kubernetes control plane to be in provisioned state…”
... reviewing installCAPI.sh we see that corresponds to line 194
until kubectl get cluster --all-namespaces | grep -q "Provisioned"; do echo "Waiting for Kubernetes control plane to be in Provisioned phase..." && sleep 20 ; done

provisioning of the capi vm kube control plane should take minutes, instead it hangs here for hours resulting in the custom script timeout exceeding the default time limit of 90 min.

We have tried to figure out why we are hung here using kubectl and clusterctl commands but have yet to fully understand why the control plane is stuck in the provisioning state.

Any suggestions or ideas you may have that may help us isolate or work around this deployment issue would be greatly appreciated.

Thank you - Jeff

@zaidmohd
Copy link
Contributor

zaidmohd commented Sep 7, 2023

@mausolfj : Yes, I did a deployment now and Terraform and CAPI is successful.

Terraform
image

installCapi.log
image

kubectx
image

@mausolfj
Copy link
Contributor Author

mausolfj commented Sep 7, 2023

That's good news Zaid, here is what we are experiencing in WWPubSec and MCAPS subscription when deploying ArcBox-Full
Terraform
image

installCAPI.log
image
... then another hour of this
image

kubectl
image
image

image image image

Do these "waiting for control plane provider", "WaitingForAvailableMachines", or "WaitingForInfrastructure" provide any clues about what may be going on with cluster provisioning?

image image

Can you recommend next steps to isolate problem?

image

@mausolfj
Copy link
Contributor Author

mausolfj commented Sep 7, 2023

Reviewing kube events reveals an issue with SubscriptionID not set in cluster or env var

image

The subscriptionID is set in the terraform.tfvars with spn_tenant_id along with spn_client_id and spn_client_secret.

What is the proper way to set the subscription ID?

Shouldn't this be set post "az login", "az account show" and "az account set --subscription <sub_id>?

image

Referencing ArcBox docs
image

@zaidmohd zaidmohd added the ArcBox Jumpstart ArcBox related label Sep 8, 2023
@likamrat likamrat removed the triage issue or feature up for triage label Sep 11, 2023
@mausolfj
Copy link
Contributor Author

We can obtain the subscriptionId after az login which gets us past the error creating the azure services but the capi cluster is still stuck in the provisioning state indicating the ClusterReconcilerNormalFailed to reconcile customer services: failed to get availability zones: failed to get zones for location: failed to refresh resource sku cache: could not list resource skus ... when I run a describe on the cluster we get:
kubectl-describe-cluster

@likamrat
Copy link
Contributor

@mausolfj can you please try to create a new service principal and test it? make sure you are assigning it with the proper permissions.

@mausolfj
Copy link
Contributor Author

@zaidmohd @likamrat My machine restarted after applying updates last night so this morning I went through a fresh ArcBox Full deployment and it completed successfully in under 30 minutes.

@likamrat likamrat added the user_fix Fixed by the user or user error label Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ArcBox Jumpstart ArcBox related user_fix Fixed by the user or user error
Projects
None yet
Development

No branches or pull requests

3 participants