AP outages during the hour after rebooting one of MD of a two member cluster?

View Only

last person joined: yesterday

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.

Back to discussions

Expand all | Collapse all

AP outages during the hour after rebooting one of MD of a two member cluster?

This thread has been viewed 2 times

1. AP outages during the hour after rebooting one of MD of a two member cluster?

0 Kudos
skbohrer
Posted 13 days ago

Reply Reply Privately
We are on-prem Aruba 8.10.0.10, with two clusters of two 7220s each, covering basically half of campus each. Each AP normally has a connection to a primary and secondary MD, so in our two-MD clusters, every AP is connected to both MDs in its cluster.

As expected, rebooting one 7220 is nearly hitless, as the APs that were primary to the booting 7220 switch to their secondary. On the MM gui, you see the AP count dip for a few seconds, and then it gets caught up, and all the APs are up and happy.

But then, once the booted 7220 comes up and rejoins the cluster, then we start seeing lots of APs marked as down, typically a few tens at first, then briefly hundreds, and then the count gradually decreases until all APs are again marked as up.

From the CLIs of the MDs during this phase, we find that the Down APs are just gone from the MD that stayed up (which had been their secondary), that is, the Total AP count has dropped from the expected number. And, on the newly booted MD (which had been primary for these APs) the APs are shown as "status down", but the full count is shown.

What is going on here? Seems the system has no trouble with the sudden loss of one MD, but then does a very disruptive job of setting everything back to normal after the reboot. My sense is that the time with lots of APs down after the reboot is much longer with 8.10.0.10 than it was with 8.10.0.9, but I have not actually rolled back to confirm that -- it is more that I never used to notice these down APs for long, and now it takes about an hour.

Anyone else seeing this behavior? Our clusters report that they are happily L2 connected, and TAC has not found any issues yet.

Background:

Beginning January, when were 8.10.0.8, we started having issues where one of the two MDs in a cluster would start having connectivity issues to some of its APs. At first, the MDs and MM would not report it, but Airwave would mark about 15 or 20 APs as down, and all were primary on the same member of the cluster. PIngs from the problem MD to the AP's IP would be very lossy (20% to 60% loss, bursty), but the same AP would ping flawlessly from the other MD.

If we left the issue for several hours, more and more APs would go down, and eventually the MM would start showing them as down, and then eventually it would show the problem MD as down. From the problem MD at this point, we had very lossy pings to its peer MD, and to the MM out at the data center. But again, peer MD on the same switch, same path to the data center had no ping loss. Rebooting the problem MD restored good connectivity, so seemed like it was failing coms in that MD. This issue has occurred various times on all four MDs, and none of the dual 10G connections to any of the four MDs show any errors on either end of the link, so really does not seem like a link issue.

But so far Aruba has not been able to find anything, despite extensive log uploads. They had us update to 8.10.0.9 and then 8.10.0.10, but no difference.

After the first incidents, we now reboot at the first sign down APs in Airwave, hence my now frequent instance of booting one MD of a cluster. And I'm surprised to have it take an hour to recover.

------------------------------
Steve Bohrer
IT Infrastructure, Emerson College
------------------------------

Wireless Access

AP outages during the hour after rebooting one of MD of a two member cluster?

1. AP outages during the hour after rebooting one of MD of a two member cluster?