For about half a year now, I've had (2) IAPs in a clsuter running stably in a multi-family home. Previously, I had broadcast filtering set to disabled. Wanting to take advantage of best practices, I enabled broadcast filtering and turned on AirGroup. The cluster was running smoothly for the first day or so until I ran into problems.
The cluster is completely unstable in this configuration when I'm running the master controller on the IAP115. Every few hours, CPU usage spikes to 100% and stays there causing pings and packet loss everywhere. This persists until it's reboot.
If I move the master controller from the IAP115 to the 225, I see much more stability. Though occasionally stability will fall off the cliff (albeit without the CPU spikes I was seeing on the 115) and the cluster again needs to be reboot.
After opening a case with TAC, they advised that I upgrade to yesterday's firmware (184.108.40.206-220.127.116.11_48114). So far the cluster has been stable on the 225 (I haven't tried the 115 yet) but the newest firmware presents a huge bug:
112117 Symptom: When the 80 MHz support is enabled on an IAP, ARM chooses only 36E as a validchannel.Scenario: This issue occurrs when ARM is enabled on an IAP to allocate 80 MHz channels. This issueis observed in the IAP-22x and IAP-27x devices running 18.104.22.168-22.214.171.124 release.Woraround: None
It seems I'm caught between an unstable cluster and limited 80Mhz channel availability.
Can anyone from Aruba shed some light on this or when I might expect to see a fix? Does anyone have any other suggestions for this type of setup?
This doesn't even address the fact that I would like to run the cluster on the IAP115 so that I can test reboots/configuration changes safely on the 225. I feel like the firmware has gone from bad to worse (still waiting on mesh support).
I should also add that none of the 80Mhz E channels show up in ARM under the new firmware. Aruba, please fix this!
Here's at least hoping that the new firmware brings some cluster stability.
Question 1) Does anyone know what busybox is? It might be causing the issue.
Also, TAC seems useless on this. Third phone call in three days.
CPU and Memory Usage--------------------Timestamp CPU Util(%) Memory Util(%)--------- ----------- --------------2015-01-20 21:48:02 57 372015-01-20 21:47:52 35 372015-01-20 21:47:42 20 382015-01-20 21:47:32 70 382015-01-20 21:47:12 100 382015-01-20 21:46:35 100 382015-01-20 21:45:52 99 38
Peak CPU Util in the last one hour----------------------------------Timestamp CPU Util(%) Memory Util(%)--------- ----------- --------------2015-01-20 21:36:18 100 37
Output of top-------------Mem: 95076K used, 160860K free, 0K shrd, 0K buff, 28348K cachedLoad average: 4.91 8.13 6.39 (Status: S=sleeping R=running, W=waiting)PID USER STATUS RSS PPID %CPU %MEM COMMAND16976 root R N 368 16975 30.5 0.1 busybox2 root RWN 0 1 7.6 0.0 ksoftirqd/01738 root S < 13732 1671 0.0 5.3 cli1748 root S N 5096 1671 0.0 1.9 sapd1761 root S 2648 1671 0.0 1.0 mdns1752 root S < 2508 1671 0.0 0.9 stm1758 root S 2340 1671 0.0 0.9 snmpd_sap16518 root S 1876 1671 0.0 0.7 radiusd-term16517 root S 1868 1671 0.0 0.7 radiusd1737 root S N 1688 1671 0.0 0.6 awc1767 root S 1472 1671 0.0 0.5 meshd1764 root S 1316 1671 0.0 0.5 lldpd1680 root S 1196 1671 0.0 0.4 tinyproxy1671 root S 1116 1 0.0 0.4 nanny16966 root S < 1108 1579 0.0 0.4 mini_httpd1765 root S 1020 1671 0.0 0.3 rfd1689 root S 976 1680 0.0 0.3 tinyproxy1692 root S 976 1680 0.0 0.3 tinyproxy1690 root S 976 1680 0.0 0.3 tinyproxy1691 root S 976 1680 0.0 0.3 tinyproxy1576 root S < 736 1 0.0 0.2 mini_httpd
I've escalated the case with TAC. The third rep was more detailed and promised that he would look into the issue.
I've also added much more robust monitoring so that i should be emailed immediately should this occur again.
I'm also attaching the AP Tech Support Dump here for others to review. Just a note, the log was grabbed 2-3mins after the CPU spike. I couldn't grab it at the time of the event because the portal stopped responding and returned an internal server error.
Thanks for the assist!
I am also seeing some instability with 126.96.36.199. The elected controller spikes to 100% CPU and loses network connectivity. A power cycle temporarily fixes it but the next AP to be elected at the controller does the same.
Are there known issues with 188.8.131.52?
I have been informed by TAC that this is indeed a bug (a huge one at that). A bug report has been submitted internally and they are investigating the issue. I'm trying to pull a tech support dump from the console when I experience the issue but I haven't yet been able to access the device phyiscally when hte issue occurrs so it's been difficult.
The other strange thing is that it appears to be the same bug affecting the IAP115 and the IAP225. If it happens on the 225, I need to reboot all IAPs. If it happens on the 115, the whole network drops due to a packet storm. >_<
This sounds serious. Do you have a bug/defect ID? Have you heard of any ETA on the fix?
@ComplexMind wrote:I am also seeing some instability with 184.108.40.206. The elected controller spikes to 100% CPU and loses network connectivity. A power cycle temporarily fixes it but the next AP to be elected at the controller does the same. Are there known issues with 220.127.116.11?
I'm also seeing this issue with a mixed cluster of IAP-105s and IAP-115s. The elected controller becomes completely unresponsive -- fails to respond even to pings when it's in this state, although it continued to send logging info to my syslog host for a few minutes after the failure, consisting mostly of ~100 of these per second (where I've replaced the controller's IP address with xxx.xxx.xxx.xxx, and the controller's MAC address with AA:AA:AA:AA:AA:AA):
Feb 23 15:11:08 2015 xxx.xxx.xxx.xxx <xxx.xxx.xxx.xxx AA <Error>: AA:AA:AA:AA:AA> stm: PAPI_Send: sendto ARM Process failed: No such file or directory Message Code 3115 Sequence Num is 27866
Feb 23 15:11:08 2015 xxx.xxx.xxx.xxx stm <Error>: <304065> <ERRS> <xxx.xxx.xxx.xxx AA:AA:AA:AA:AA:AA> PAPI_Send failed, send_papi_message_with_args, 790: No such file or directory, dstport 8494
The scary part about this is that the network is still basically functional -- clients can connect to other APs and it passes network traffic, but nothing connects to the elected controller and I have no access to the admin UI.
I also just noticed that the controller seems to be spamming ARP requests for various hosts on the network at an unusual rate while it's in this state. I'm not sure what to make of this.
This seems to be similar to my issue. -- I would open a TAC case and escalate. I wasn't able to send them tech dumps due to the high CPU spikes but I did get a syslog server up with some data.
I'm also told that 18.104.22.168 may have some fixes for this and should be out in ~2weeks. The proof as they say though, is in the pudding.
I downgraded back to 22.214.171.124 and the controller issue cleared up. However I believe that leaves us exposed to the DOS issues that were supposedly mitigated in 126.96.36.199. Unfortunately I don't have the bandwidth to troubleshoot with TAC. Hopefully they are watching these threads and can help. I'll wait for a fix to 188.8.131.52 before I upgrade again.
Please message me the case number.
I'm also seeing the same issue on my home office Instant network with an IAP-105 and RAP-109 running 184.108.40.206. I have factory reset both access points one at a time, but the issue still persists where my wifi clients drop the association and can't reconnect until the virtual controller is rebooted. here is a copy of my log files during the outage:
2015-03-04 20:34:44 Local1.Error 10.0.20.161 Mar 4 20:34:43 2015 10.0.20.161 <10.0.20.161 6C:F3:7F:XX:XX:XX> stm: PAPI_Send: sendto ARM Process failed: No such file or directory Message Code 3115 Sequence Num is 610822015-03-04 20:34:44 Local1.Error 10.0.20.161 Mar 4 20:34:43 2015 10.0.20.161 stm: <304065> <ERRS> <10.0.20.161 6C:F3:7F:XX:XX:XX> PAPI_Send failed, send_papi_message_with_args, 790: No such file or directory, dstport 8494
220.127.116.11 has been removed from the downloads website. Please downgrade to 18.104.22.168 and let us know if you still see the issue. If you cannot obtain the software to downgrade, please message me.
I would like to confirm the software downgrade procedure:
1. Multiple IAP-105's in a single cluster. Login to the Virtual controller, manually upload the older firmware, click Upgrade Now, and reboot. The remaining IAP-105 access points will automatically downgrade to the older firmware
2. Mix of IAP-105 and IAP-225 in a single cluster. Enable TFTP server on my computer. Login to th virtual controller, specify the URL location for each category of instant with the format tftp://<My-PC-IP-Address/<software name>, click Upgrade now, and reboot. the remaining mix of access points will automatically downgrade and reboot.
Am I missing anything from this procedure?
You are not missing anything. Just like the procedure here: http://www.arubanetworks.com/techdocs/Instant_41_WebHelp/InstantWebHelp.htm#UG_files/IAP_maintenance/Firmware_Image_Server_in.htm for the Image URL Option.
Ok, anyone heard any ETA for a new release with this issue fixed?
Don´t know yet, trying to pick code for a new deployment of IAP-205. Guessing 22.214.171.124 is the best shot for now.
At Aruba, we believe that the most dynamic customer experiences happen at the Edge. Our mission is to deliver innovative solutions that harness data at the Edge to drive powerful business outcomes.
© Copyright 2021 Hewlett Packard Enterprise Development LPAll Rights Reserved.