Yes, this happens to us as well. We have two clusters of two each 7220s, running 8.10.0.11 . AP failure for an hour when a controller comes back up is new with 8.10.0.10
We have had an ongoing issue since January (and a TAC case since then) which requires frequent boots of one of the pair of controllers (not always the same one, and not always the same cluster) so we noticed this change in behavior, and I documented it in our TAC case, but they have not addressed it at all. (Since it is not our primary issue, I have not really pushed on it.)
If you go to CLI and run "sh ap database | inc Tot" to get the AP count, and "sh ap database status down" to see down APs on the two members, you find an interesting result: the MD that was in the cluster all along will just drop APs from its database ! They are not down, they are just gone!
We start with a cluster of two MDs, with 570 APs, all up on both MDs (though having lossy pings to one of them. We reboot this problem MD. Here is a timeline of the AP status after giving the reload command:
NO APs are lost while one MD is down -- since they all have tunnels to both MDs, they seamlessly change to the remaining MD.
The problems begin after the booted MD comes back up and rejoins the cluster. At that point, when it has rebooted and is again accepting APs, then we will have a bunch of airwave alerts, and the MM shows a bunch of APs as down.
At about 15 minutes after the reload of the problem MD, the good MD had lost 13 APs from its database. They were not down, it still had 0 APs down, but its total had changed from 570 to 557!
About 30 minutes after one MD was restarted, the MM showed 73 APs down! (Shutting down an MD was totally hitless, but having one start up kills APs for many minutes.) The MD that stayed up has only 514 APs in its database, down from the actual 570. Meanwhile, the newly booted MD has 551 (which is closer to the real total) but it shows 45 as down.
10 minutes later, the MD that stayed up is down to 495 APs in its database. It is unclear why having and MD rejoin the cluster would cause APs to be removed from the active database. (Seems like a bug!)
10 minutes later (50 minutes past reboot) APs are recovering in batches. MM shows only 32 down.
5 minutes later (55 past reboot) only 25 down on MM. Then it shifted to 24 a minute later.
It held at 24 APs down for the next 15 minutes.
Then went to 13 down at 1 hour 20 minutes past reboot, then 6 down a couple minutes later.
Finally, 1 hour 25 minutes after I rebooted one MD, the MM finally showed all APs as up again.
Not sure if anyone from TAC looks at these discussions, but this certainly seems like a big change for the worse from 8.10.0.9 to 8.10.0.10 and 8.10.0.11 !
------------------------------
Steve Bohrer
IT Infrastructure, Emerson College
------------------------------
Original Message:
Sent: May 16, 2024 03:51 PM
From: Vignljv
Subject: Loss of Access Points in 2 node 7240XM cluster - ArubaOS 8.10.0.11
Cluster supports +1000 Access Points, 330 and 340 models. Live upgrade of 2 node 7240XM cluster to v 8.10.0.11 from v8.10.0.8. Live upgrade went well (as usual) until the last controller rebooted to invoke code 8.10.0.11. Within several minutes of the controller booting and joining the cluster at L2, access points started to go down. Almost 300 went down, no common item to them, different networks, different AP groups etc. Example: 30 access points went down in a AP that has 165 access points.
Without doing a thing it took 50 minutes to an hour for every AP to connect. TAC wrote it off as "sometimes it takes a while for the upgrade to complete on all devices". That was last week.
This morning I rebooted one node of the cluster, something we have done many times over the years running various versions of the Aruba 8.x code. Never have we had an issue with Access Points losing connection with a controller. During the controller reboot, all access points were operational on the other controller. Once the controller finished rebooting and joined the cluster again, access points went offline. Total count offline was 237. I did nothing but run show commands. In approximately 50 minutes the offline access points started reporting online again. It took about 20 minutes for all 237 to show online.
I'll open another TAC case and report my findings.
------------------------------
Lenny
------------------------------