I posted about this on my Mastodon some time ago but I wanted to type it out here so that it may be indexed by search engines and someone might see it so they don’t have to suffer like I did.
A few months ago, I had a genius idea that was going to greatly improve the way we did our automation. It was going to save so much time! What if we could stop relying on VMware to put the interfaces in the right order using the system-assigned interfaces of ethX or ensXXX? Well, I found a little feature called labels in the Netplan documentation that seemed like the perfect solution. I tested it and yes, the interfaces all came up nicely labelled in Ansible and everything looked awesome.
The first hint that something was amiss came in the form of snmpd spamming syslog with an arcane message about an invalid ifIndex that I noticed long enough after doing this that I didn’t immediately twig to what it was. Searching returned absolutely no useful results and once I figured out it was related to the labels, tried changing the naming schema of them. This stopped it throwing the error because for some reason, one of the underlying C libraries made big assumptions about the naming scheme of the interfaces.
All was well and it was all happily ticking along until I went to roll out a new Kubernetes cluster. Unfortunately, this one took much longer to troubleshoot because the level of abstraction was much greater. Basically, we were spinning up a new control plane node, all good, but for some reason one of the core DaemonSets wasn’t completing its readiness checks and remaining uninitialised. I tried many things, including removing the uninitialized cloud provider taint which pushed it further along but not far enough.
Eventually I started looking from the cloud provider CPI upwards and I started to make headway. Apparently [when the CPI initiates, it grabs IPs from the GuestInfo that’s shown in VSphere, I checked the VM and hey, look at that, it wasn’t showing any of the IPs on the guest machine in the interface. Maybe it just needed me to update the network, so I ran a /usr/bin/vmware-toolbox-cmd info update network
to refresh the GuestInfo… nope, that wasn’t it. I started to get that sneaking suspicion that it was those damn labels again, so I removed them from the interface, ran the VMware tools again, restarted the rollout and…
Yep.
Labels again.
Don’t do it. Just don’t.
Side note: If anyone has a good solution to labelling interfaces, that is addressable in Ansible and that doesn’t mess with every other damn service, I’d love to hear about it in the comments.