Wireless Access

 View Only
last person joined: 23 hours ago 

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.
Expand all | Collapse all

Loss of Access Points in 2 node 7240XM cluster - ArubaOS 8.10.0.11

This thread has been viewed 51 times
  • 1.  Loss of Access Points in 2 node 7240XM cluster - ArubaOS 8.10.0.11

    Posted 15 days ago

    Cluster supports +1000 Access Points, 330 and 340 models. Live upgrade of 2 node 7240XM cluster to v 8.10.0.11 from v8.10.0.8. Live upgrade went well (as usual) until the last controller rebooted to invoke code 8.10.0.11. Within several minutes of the controller booting and joining the cluster at L2, access points started to go down. Almost 300 went down, no common item to them, different networks, different AP groups etc. Example: 30 access points went down in a AP that has 165 access points.

    Without doing a thing it took 50 minutes to an hour for every AP to connect. TAC wrote it off as "sometimes it takes a while for the upgrade to complete on all devices". That was last week. 

    This morning I rebooted one node of the cluster, something we have done many times over the years running various versions of the Aruba 8.x code. Never have we had an issue with Access Points losing connection with a controller. During the controller reboot, all access points were operational on the other controller. Once the controller finished rebooting and joined the cluster again, access points went offline. Total count offline was 237. I did nothing but run show commands. In approximately 50 minutes the offline access points started reporting online again. It took about 20 minutes for all 237 to show online.

    I'll open another TAC case and report my findings. 



    ------------------------------
    Lenny
    ------------------------------



  • 2.  RE: Loss of Access Points in 2 node 7240XM cluster - ArubaOS 8.10.0.11

    Posted 13 days ago

    Unfortunately, we experienced a similar behaviour yesterday evening when we upgraded our two-node 7205 Cluster from 8.10.0.7 to 8.10.0.12. It affected circa 13 of our 66 APs. Because our maintenance windows was coming to an end, we powered off one of the MDs and about 15 minutes later all of the APs successfully connected to the remaining node in the cluster. But this might possibly also have happened if we would have left both nodes running?

    This thread from about a month ago also sounds very much alike: https://community.arubanetworks.com/discussion/ap-outages-during-the-hour-after-rebooting-one-of-md-of-a-two-member-cluster




  • 3.  RE: Loss of Access Points in 2 node 7240XM cluster - ArubaOS 8.10.0.11

    Posted 12 days ago

    Yes, this happens to us as well. We have two clusters of two each 7220s, running 8.10.0.11 . AP failure for an hour when a controller comes back up is new with 8.10.0.10

    We have had an ongoing issue since January (and a TAC case since then) which requires frequent boots of one of the pair of controllers (not always the same one, and not always the same cluster) so we noticed this change in behavior, and I documented it in our TAC case, but they have not addressed it at all. (Since it is not our primary issue, I have not really pushed on it.)

    If you go to CLI and run "sh ap database | inc Tot" to get the AP count, and "sh ap database status down" to see down APs on the two members, you find an interesting result: the MD that was in the cluster all along will just drop APs from its database ! They are not down, they are just gone!

    We start with a cluster of two MDs, with 570 APs, all up on both MDs (though having lossy pings to one of them. We reboot this problem MD. Here is a timeline of the AP status after giving the reload command:

    NO APs are lost while one MD is down -- since they all have tunnels to both MDs, they seamlessly change to the remaining MD.

    The problems begin after the booted MD comes back up and rejoins the cluster. At that point, when it has rebooted and is again accepting APs, then we will have a bunch of airwave alerts, and the MM shows a bunch of APs as down.

    At about 15 minutes after the reload of the problem MD, the good MD had lost 13 APs from its database. They were not down, it still had 0 APs down, but its total had changed from 570 to 557!

    About 30 minutes after one MD was restarted, the MM showed 73 APs down! (Shutting down an MD was totally hitless, but having one start up kills APs for many minutes.) The MD that stayed up has only 514 APs in its database, down from the actual 570. Meanwhile, the newly booted MD has 551 (which is closer to the real total) but it shows 45 as down.

    10 minutes later, the MD that stayed up is down to 495 APs in its database. It is unclear why having and MD rejoin the cluster would cause APs to be removed from the active database. (Seems like a bug!)

    10 minutes later (50 minutes past reboot) APs are recovering in batches. MM shows only 32 down.

    5 minutes later (55 past reboot) only 25 down on MM. Then it shifted to 24 a minute later.

    It held at 24 APs down for the next 15 minutes.

    Then went to 13 down at 1 hour 20 minutes past reboot, then 6 down a couple minutes later.

    Finally, 1 hour 25 minutes after I rebooted one MD, the MM finally showed all APs as up again.

    Not sure if anyone from TAC looks at these discussions, but this certainly seems like a big change for the worse from 8.10.0.9 to 8.10.0.10 and 8.10.0.11 !



    ------------------------------
    Steve Bohrer
    IT Infrastructure, Emerson College
    ------------------------------



  • 4.  RE: Loss of Access Points in 2 node 7240XM cluster - ArubaOS 8.10.0.11

    Posted 11 days ago

    A follow up on my last post:

    Like I mentioned, we left one our first MD switched off on Thursday evening after experiencing the mentioned problems during the update to 8.10.0.12 and 15 Minutes later all APs were back online on the second MD.

    On Friday afternoon we decided to switch the first MD back on again. After this, 18 of our APs went offline. All of these 18 APs were assigned to the first MD as Active Controller (but some other APs that also had the first MD as Active Controller were working fine).

    Nearly exactly one our after switching the first MD on, all 18 APs came back online again (over a course of about 8 minutes between the first and the last of them). I checked the log on the first MD and saw that during this 8-minute timeframe the IKE and IPSEC SAs of the 18 affected APs were cleared ("errcode:ERR_IKESA_CLEARED"), but I'm unsure if this might be the cause or an effect of the APs coming back online.




  • 5.  RE: Loss of Access Points in 2 node 7240XM cluster - ArubaOS 8.10.0.11

    Posted 2 days ago

    We're experiencing a similar issue after an upgrade to 8.10.0.11. We statically assign all our AP's IP's. After the upgrade, we had 160 AP's that remained offline/DOWN. Our fix at the moment is to again set the server IP address. "setenv serverip 10.x.x.x" When we try and boot the AP without making this change the TFTP server is set to "60.106.120.132" it's like the firmware update change the config. Anyhow, i've NOT tried rebooting the controllers. When I mentioned this to TAC on a recent call, the fix was to visit all 160 AP's and make the config change. I might need to reach out again. 




  • 6.  RE: Loss of Access Points in 2 node 7240XM cluster - ArubaOS 8.10.0.11

    Posted 2 days ago

    We have not opened another TAC case on this issue. We are in the process of adding (1)  9240 series controllers to each of the existing clusters. End game is we will have two clusters, each with (2) 7240XM controllers and one 9240 controller. One cluster in data center A and other cluster in data center B.

    Above should be done in a couple weeks. It will be interesting to see how the access points behave once the new controller is added.

    I know this does not help others experiencing the same issue but thought I'd share our plan.