Strange HA error after VMWare ESX upgrade

Recently, the company brought in a clownsultant (supposedly he'd not only taken a VCP but also been a VCP course instructor) to upgrade our VMWare ESX cluster to ESX 3.5 and VC 2.5. This lead to a number of interesting issues, that I'll list here - with solutions for your enjoyment:

 

1. Upgrade failed with the server stuck at GRUB after reboot
Solution: disconnect the fibers from the HBA before upgrading, they seem to confuse grub-install even if you select the right boot device (internal RAID) during install/upgrade.

2.  HA failed to enable after all servers in the cluster were upgraded
This issue will need a long explanation...

First indication of this problem is the error message "Unable to contact primary HA agent" after turning on HA for the first time after upgrades/reinstalls. Examining the individual hosts lead to the discovery of an interesting message on the first server in the cluster: "Command 'hostname -s' on host vm01.domain.tld failed or returned incorrect name format" (vm01 being the only host the clownsultant actually installed before giving up and going home). Strange error, as all settings seemed okay, not only when viewed from the VC, but also from the console. Running hostname -s manually resulted in a "hostname: Unknown host" error message.

The solution: run "hostname vm01.domain.tld" manually in the console, and reconfigure for HA once more.

Things we've learned from this:
1. Don't upgrade, reinstall is faster and causes less pain.
2. Disconnect HBA's either way if you're booting from an internal disk/RAID
3. Don't hire clownsultants without proper background checks