Ok, so jolity aside.... The issue is a large number of APs failed to preload. We have no pointers as to why. The AP303h and the AP203h appeared to be disproportionately represented. AOS created partitions of approx 120 APs at a time. Almost every partition of APs had at least one that didn't preload. This holds up the entire process for a good 50 minutes.
The rolling upgrade process is definitely our preferred method. We aim to communicate what's going on to users, but know the vast majority of students living on campus won't actually get the message for one reason or another. The design in many of our accommodation buildings should allow for hitless upgrade.
I went to bed. Sadly when I got up the upgrade had not completed.
I agree the aim should be to avoid downtime but it seems clear that once an AP has failed to preload that's it... It doesn't appear to matter how long you wait and how many times you retry, the AP is in a state where it's not going to work so it would be much better to move on if a preload fails.
In our case 67 APs failed to preload, the upgrade took over 11 hours and those APs ultimately still just had to be rebooted. But now that's happening in the morning when it's more user affecting. If the system had just continued without 50 minutes of retries those APs would have been kicked in the middle of the night... much better.
I like this upgrade approach... I want to use it.... I'm trying to be constructive here people.