-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GZ doesn't come online when MTU is configured for absent aggr #454
Comments
The admin nic comes up with the early-admin service using the net-early-admin service method for exactly this reason. The admin nic will already come up separately from other nics. I would need to see your config to determine exactly what went wrong. It's possible that you're running into a bug, but it's also possible that you've specified an invalid configuration that we would never be able to make sense of, in which case failure is the only option. Again, I need to see your config file to reproduce the situation. |
My config file is as follows: # cat /usbkey/config
# Note: This file must be source-able by bash
# Ethernet configuration
admin_nic=18:03:73:ad:6c:8e
admin_ip=dhcp
admin_ip6=addrconf
headnode_default_gateway=none
# Aggregated SFP+ configuration
aggr0_aggr=90:1b:e:6d:c4:82,90:1b:e:6d:c4:83
aggr0_lacp_mode=active
aggr0_mtu=9000
internal_nic=aggr0
internal_mtu=9000
internal_ip=10.255.255.2
internal_netmask=255.255.255.252
# Hostname
hostname=blackserver
# DNS setup
dns_domain=internal.dvdgiessen.nl
dns_resolvers=192.168.1.1,8.8.8.8
# NTP servers (see http://www.pool.ntp.org/zone/nl)
ntp_hosts=0.nl.pool.ntp.org,1.nl.pool.ntp.org,2.nl.pool.ntp.org,3.nl.pool.ntp.org
compute_node_ntp_hosts=dhcp
# Load SSH authorized keys
root_authorized_keys_file=authorized_keys Both interfaces work fine; the problem occurs if I physically remove it the SPF+ PCIe card and boot the machine; I'd expect the EDIT: Clarified that the problem is the GZ does not come online. |
Ok, I'll see what I can figure out. |
To further clarify: If I comment out the # cat /usbkey/config
# Note: This file must be source-able by bash
# Ethernet configuration
admin_nic=18:03:73:ad:6c:8e
admin_ip=dhcp
admin_ip6=addrconf
headnode_default_gateway=none
# Aggregated SFP+ configuration
aggr0_aggr=90:1b:e:6d:c4:82,90:1b:e:6d:c4:83
aggr0_lacp_mode=active
#aggr0_mtu=9000
internal_nic=aggr0
#internal_mtu=9000
internal_ip=10.255.255.2
internal_netmask=255.255.255.252
# Hostname
hostname=blackserver
# DNS setup
dns_domain=internal.dvdgiessen.nl
dns_resolvers=192.168.1.1,8.8.8.8
# NTP servers (see http://www.pool.ntp.org/zone/nl)
ntp_hosts=0.nl.pool.ntp.org,1.nl.pool.ntp.org,2.nl.pool.ntp.org,3.nl.pool.ntp.org
compute_node_ntp_hosts=dhcp
# Load SSH authorized keys
root_authorized_keys_file=authorized_keys With this config the built-in admin NIC works fine / the GZ comes online regardless of whether the aggr NIC is physically present in the system. Thus, it is specifically the failure to configure the MTU on a non-existant NIC that seems to cause this. However if these lines are not commented out (as in the config in the previous comment) AND the aggr NIC is not physically present in the machine, then the global zone does not come online (but the NIC does; other VM's on that NIC do appear on the network).
Ah, I did see the early-admin service but did not assume it was activated by default since the comments mentioned it was used specifically for PXE-booting compute nodes. After checking I see the service is active. But, looking through the code now it seems that early-admin does not do anything when |
I got around to taking another look at this. This time, with some helpful log output. :)
So it is indeed failing because it cannot set the MTU on a non-existing device.
My assumption was a bit off. This wouldn't have helped, because it never reaches this point because sysinfo doesn't have the aggregation and Instead, it fails in
So a variation of this might instead be more appropriate? |
When I remove a secondary NIC from one of my SmartOS systems, the admin NIC doesn't get fully configured in the GZ. The global zone doesn't come online; however VM's configured on that same admin NIC do show up on the network.
From reading the code (it's a headless system build from random consumer-grade parts I had lying around, so no IPMI to easily debug the problem in-situ while the admin NIC is down) my problem is probably because the NIC I removed was part of an aggr with a custom MTU set in
/usbkey/config
, and configuring this MTU happens before the admin NIC is initialized, and if configuring that MTU fails we exit with a fatal error.Note that not being able to create the aggr in itself does not seem to trigger an immediate exit; only when a custom MTU is configured we end with a fatal error. It would probably be nice that if configuring some other NIC failed, the admin NIC would still be fully configured to make the GZ at least accessible over the network?
Context: This is where the aggr setup and MTU setup happens before the admin NIC configuration:
illumos-joyent/usr/src/cmd/svc/milestone/net-physical
Lines 476 to 482 in 5760e8d
To fix this, we can perhaps skip trying to set the MTU on the aggr if creating that aggr failed, since that in itself is not an fatal error (apparently). Would be as simple as moving the MTU part into the if-check above it here:
illumos-joyent/usr/src/cmd/svc/milestone/net-physical
Lines 277 to 289 in 5760e8d
(And perhaps something can be said for also moving
setup_mtu
so that MTU failures don't impact the admin interface being brought up, though that can also be a separate change.)The text was updated successfully, but these errors were encountered: