Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SCSI IDs changing on machines built with 2.60+. #2089

Closed
4 tasks done
gavinwill opened this issue Dec 13, 2023 · 11 comments · Fixed by #2115
Closed
4 tasks done

SCSI IDs changing on machines built with 2.60+. #2089

gavinwill opened this issue Dec 13, 2023 · 11 comments · Fixed by #2115
Assignees
Labels
bug Type: Bug
Milestone

Comments

@gavinwill
Copy link

Community Guidelines

  • I have read and agree to the HashiCorp Community Guidelines .
  • Vote on this issue by adding a 👍 reaction to the original issue initial description to help the maintainers prioritize.
  • Do not leave "+1" or other comments that do not add relevant information or questions.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Terraform

Terraform v1.3.0

Terraform Provider

v2.6.0

VMware vSphere

7.0.3.01700

Description

Hi

On building a VM from an Ubuntu OVF template we are seeing the scsi order change and therefore the interface naming change (which has impact as we use cloudinit and specify nic to configure)

On a machine deployed with provider 2.5.1 we see the correct ordering for us


03:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
04:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
0b:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
13:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
1b:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)

This provides us with ens192
2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000 on the ubuntu machine

When we deploy a brand new machine with provider 2.6.0+ we see the scsi order change

### lspci from vm w with provider 2.6.0 = bad

03:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
04:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
0b:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
0c:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)
13:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)
1b:00.0 Serial Attached SCSI controller: VMware PVSCSI SCSI Controller (rev 02)

This change in order means that our nic interface name has changed since naming is
# example Interface names are generated as:
# en --> ethernet
# p0 --> bus number
# s31 --> slot number

ip link on this machine shows 2: ens161: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000

Note - i need to get access via console in vmware since cant ssh to host as network config mismatch.

If i have a machine built on provider 2.5.1 then upgrade the provider to 2.6.0+ and do a plan we see that infra is up to date and no changes are planned. It seems it is on new vm creation with later provider the order is incorrect.

Upgrading modules...


Initializing provider plugins...
- Finding hashicorp/vsphere versions matching "2.6.0"...
- Installing hashicorp/vsphere v2.6.0...
- Installed hashicorp/vsphere v2.6.0 (signed by HashiCorp)

Plan 
No changes. Your infrastructure matches the configuration.```


### Affected Resources or Data Sources

vsphere_network.network

### Terraform Configuration

Will provide details in update

### Debug Output

Will provide details in update

### Panic Output

_No response_

### Expected Behavior

We would expect no change in the scsi ordering when using new provider

### Actual Behavior

Scsi ordering incorrect causing issues with disk and nics

### Steps to Reproduce

upgrade provider to 2.6.0+ (from verified good 2.5.1) and create new machine

### Environment Details

_No response_

### Screenshots

_No response_

### References

_No response_
@gavinwill gavinwill added bug Type: Bug needs-triage Status: Issue Needs Triage labels Dec 13, 2023
Copy link

Hello, gavinwill! 🖐

Thank you for submitting an issue for this provider. The issue will now enter into the issue lifecycle.

If you want to contribute to this project, please review the contributing guidelines and information on submitting pull requests.

@tenthirtyam
Copy link
Collaborator

@vasilsatanasov cannyou investigate to see if this is related to SR-IOV introduction?

@gavinwill
Copy link
Author

I did think it may be SR-IOV related from quickly looking at the diff from 2.5.1 > 2.6.0

Potentially may be available for PR to fix also

@vasilsatanasov
Copy link
Contributor

Looking at it, @gavinwill , could you please provide and example HCL to reproduce the issue + Ubuntu version you are using ?

@gavinwill
Copy link
Author

Hi @vasilsatanasov
Apologies - Its an Ubuntu 2004 OVF template

I have just tested this out by building the provider against different commits go build -o terraform-provider-vsphere and using dev_overrides on the provider installation and can confirm that the last commit this works for me is 6211c3b

If i taint machine and rebuild with building the provider against 9c25530 It fails and we see the scsi device order wrong and hence the ens161 nic.

We use a slightly customised module. I am just parsing that down to minimal stand alone code so that you can repo.

@vasilsatanasov
Copy link
Contributor

Thank you @gavinwill , waiting for the code for reproduction!

@gavinwill
Copy link
Author

gavinwill commented Dec 14, 2023

Hi

I have "converted" our module to a simple tf file with hard coded values but can repo the issue with the below config.

If i specify the provider to be 2.5.1 and apply (after cleaning out .terraform folder to be sure including terraform init) the machine boots up fine with expected scsi order and nic is ens192

If I clean out the .terraform folder and update provider to 2.6.0+ the machine boots up but the nic is ens161 and the scsi ordering is changed. The change to terraform is only the provider version.

terraform {
  required_providers {
    vsphere = {
      source  = "hashicorp/vsphere"
      version = "2.6.1"
    }
  }
  required_version = ">= 1.3.0"
}

provider "vsphere" {
  vsphere_server       = "vsphereserver.com"
  user                 = "[email protected]"
  password             = "hunter2"

}

resource "vsphere_virtual_machine" "vm" {
  name                    = "gt-gavintest-01"
  resource_pool_id        = "resgroup-1234"
  folder                  = "test"
  extra_config = {
              "guestinfo.metadata": "our metadata for cloud init including netplan for ens192"
              "guestinfo.userdata": "our base64 userdata",
              "guestinfo.userdata.encoding": "base64"
            }

  extra_config_reboot_required  = false
  firmware                      = "bios"
  efi_secure_boot_enabled       = false
  enable_disk_uuid              = false
  datastore_id                  =  "datastore-1234"

  num_cpus               = 4
  num_cores_per_socket   = 2
  cpu_hot_add_enabled    = true
  cpu_hot_remove_enabled = true
  memory                 = 8192
  guest_id               = "ubuntu64Guest"
  scsi_bus_sharing       = "noSharing"
  scsi_type              = "pvscsi"
  scsi_controller_count  = 4
  wait_for_guest_net_routable = false
  wait_for_guest_ip_timeout   = 0 
  wait_for_guest_net_timeout  = 5

  dynamic "network_interface" {
    for_each = local.networks
    content {
      network_id   = "dvportgroup-1234"
      adapter_type = "vmxnet3"
      ovf_mapping  = "nic${network_interface.key}"
    }
  }

  disk {
      label             = "disk0"
      size              = 72
      unit_number       = 0
      thin_provisioned  = true
      eagerly_scrub     = false
      datastore_id      = "datastore-1234"
      io_reservation    = 0
      io_share_level    = "normal"
      io_share_count    = 1000
    }
  
  clone {
    template_uuid = "12345-78910-121213-1415-16171819e"
    linked_clone  = false
    timeout       = 30
  }

  hv_mode                          = "hvAuto"
  ept_rvi_mode                     = "automatic"
  nested_hv_enabled                = false
  enable_logging                   = false
  cpu_performance_counters_enabled = false
  swap_placement_policy            = "inherit"
  latency_sensitivity              = "normal"
  shutdown_wait_timeout = 3
  force_power_off       = false
}

our locals contains

locals {    
    networks = [
      { "addresses" : ["10.12.13.14/24"], },
      { "addresses" : [], },
    ]
}

We use the address to populate our cloudinit and do a for each on the key in above tf.

Hope this helps

@tenthirtyam tenthirtyam added this to the v2.7.0 milestone Dec 14, 2023
@adamhorden
Copy link

adamhorden commented Dec 20, 2023

I have faced this same issue today, I could not work out why the order was incorrect on new VM builds, before finding this issue. VMs would come up, but the network would not come up so needed manual intervention via the console.

v2.5.1:

03:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)

v2.6.1:

04:00.0 Ethernet controller: VMware VMXNET3 Ethernet Controller (rev 01)

This causes the network to not come up as ens160 now becomes ens224.
Terraform Plans on v2.6.1 are clean but any new VMs on v2.6.1 have the incorrect order.
For the moment pinning to v2.5.1 works as expected.

Adam Horden

@tenthirtyam
Copy link
Collaborator

@vasilsatanasov - this might be related to the SR-IOV enhancement?

@vasilsatanasov
Copy link
Contributor

@vasilsatanasov - this might be related to the SR-IOV enhancement?

Looks like it is, as per @gavinwill 's report.

vasilsatanasov added a commit to vasilsatanasov/terraform-provider-vsphere that referenced this issue Jan 15, 2024
After the introduction of the SR-IOV feature network adapters due to
inconsistent check for changes in the `physical_function` attribute have
been recreated after clone since the check always returned `true`. The
result is that instead of updating existing NICs the relocate tack
started after clone was always deleting the existing NICs and creating
new ones. This was causing the new VM to be disconnected from network

Changed the check for changes in `physical_function` attribute to treat
nil and empty string equaly so missing `physical_function` attribute in
the device compared to empty string from the schema won't be cosnidered as changed.

Testing done: cloned VM from template with 2 nics and verified that
there is network connectivity. Also verified the output from lspci
command on the template VM and on the clone VM.

Fixes hashicorp#2089

Signed-off-by: Vasil Atanasov <[email protected]>
tenthirtyam pushed a commit that referenced this issue Jan 16, 2024
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 16, 2024
@tenthirtyam tenthirtyam removed the needs-triage Status: Issue Needs Triage label Apr 29, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants