What's your time-frame/window for completion of the upgrade? Any particular reason you were asked to upgrade (features, bug fixes someone noticed) or just "standard maintenance" due to the time of the year?
We're currently on ArubaOS 8.3.0.6 as of beginning April and it finally allowed us to enable AirGroup Centralize Mode in stable state. Unfortunately, after about a month, we've hit three cluster-type-issues (two that were already patched in 8.3.0.4 to 8.3.0.6 - and occured during a fail-over) but have since returned. Note - the root cause may be different - but the "result" is the same. We have current TAC tickets for all these.
1. We had this first issue prior to 8.3.0.6 and was the reason we needed to upgrade to 8.3.0.6 instead of waiting for 8.3.0.7 which had another minor fix we were looking forward to.
- "Dynamic BSS tunnel could not be setup /Denied; AP not found in STM." - Shortly after a fail-over event, some APs are unable to establish dynamic tunnels to an individual's UAC Controller - which causes the AP to send a deauthentication to the client. Other users in the area may be connected just fine (they're terminating to other UAC Controllers). Able to resolve by rebooting the AP. I built a Splunk Regex query to tell me the UAC Controller, Client MAC, and BSSID -> which i import to another table to give me the the AP-Name i need to address. Engineering ticket has been raised again for this.
2. This is the more concerning one that we discovered a couple weeks ago due to what it takes to bring the APs back online. During a fail-over event OR if an MD is rebooted (TAC had us reboot to clear out a separate stale Radius IP Address), a small population of APs become "isolated" and no longer reachable by the MDs -> SSIDs no longer broadcast, but the bridged ports continue to function normally if a down-stream device was connected at that time. The problem -> this requires us having to perform a shut/no shut on the uplink port in order to cycle the APs to restore connectivity to our MDs. I saw a similar bug in the release notes of 8.5.0.0 and mentioned it to TAC -> my co-worker consoled into one of the affected APs and TAC gathered some logs/filed a ticket with engineering - they were able to restart a specific process on the AP without a reboot -> but consoling into a 100 APs isn't an option. We've had three separate cases of this occuring. 100 APs isn't that much when compared to our over-all 3400 AP count - but when those 100 APs are randomly spread across several access switches...
3. This was new small-one. A single-specific AP was unable to establish a tunnel back to the client's UAC due to "IKE_Timeouts" being reported and deauth due to "UAC Down". This one didn't generate any syslog messages like the dynamic bss tunnel issue so was a bit more difficult to see and cycling the port did not resolve the issue. Although didn't see any syslog errors - by running the linked command - shows the UAC Down - https://www.arubanetworks.com/techdocs/ArubaOS_83_Web_Help/content/arubaframestyles/1commandlist/show_ap_remote_debug_sapd.htm
While we're waiting on engineering tickets and with it being summer, we're waiting for answers. We've been keeping our more "critical services" such as ticket scanning on our legacy 6.5.X.X controllers till we get to the point of a stable cluster.