Controllerless Networks

last person joined: yesterday 

Instant Mode - the controllerless Wi-Fi solution that's easy to set up, is loaded with security and smarts, and won't break your budget
Expand all | Collapse all

Big problems w/ 115/225 Cluster

This thread has been viewed 7 times
  • 1.  Big problems w/ 115/225 Cluster

    Posted Jan 20, 2015 01:26 PM

    For about half a year now, I've had (2) IAPs in a clsuter running stably in a multi-family home. Previously, I had broadcast filtering set to disabled. Wanting to take advantage of best practices, I enabled broadcast filtering and turned on AirGroup. The cluster was running smoothly for the first day or so until I ran into problems.

     

    The cluster is completely unstable in this configuration when I'm running the master controller on the IAP115. Every few hours, CPU usage spikes to 100% and stays there causing pings and packet loss everywhere. This persists until it's reboot.

     

    If I move the master controller from the IAP115 to the 225, I see much more stability. Though occasionally stability will fall off the cliff (albeit without the CPU spikes I was seeing on the 115) and the cluster again needs to be reboot.

     

    After opening a case with TAC, they advised that I upgrade to yesterday's firmware (6.4.2.3-4.1.1.2_48114). So far the cluster has been stable on the 225 (I haven't tried the 115 yet) but the newest firmware presents a huge bug:

     

    112117 Symptom: When the 80 MHz support is enabled on an IAP, ARM chooses only 36E as a valid
    channel.
    Scenario: This issue occurrs when ARM is enabled on an IAP to allocate 80 MHz channels. This issue
    is observed in the IAP-22x and IAP-27x devices running 6.4.2.3-4.1.1.2 release.
    Woraround: None

     

    It seems I'm caught between an unstable cluster and limited 80Mhz channel availability.

     

    Can anyone from Aruba shed some light on this or when I might expect to see a fix? Does anyone have any other suggestions for this type of setup?


    This doesn't even address the fact that I would like to run the cluster on the IAP115 so that I can test reboots/configuration changes safely on the 225. I feel like the firmware has gone from bad to worse (still waiting on mesh support).

     

     


    #AP225


  • 2.  RE: Big problems w/ 115/225 Cluster

    Posted Jan 20, 2015 04:51 PM
      |   view attached

    I should also add that none of the 80Mhz E channels show up in ARM under the new firmware. Aruba, please fix this!

     

    Here's at least hoping that the new firmware brings some cluster stability.



  • 3.  RE: Big problems w/ 115/225 Cluster

    Posted Jan 21, 2015 12:09 AM

    Question 1) Does anyone know what busybox is? It might be causing the issue.

     

    Also, TAC seems useless on this. Third phone call in three days.

     

    CPU and Memory Usage
    --------------------
    Timestamp CPU Util(%) Memory Util(%)
    --------- ----------- --------------
    2015-01-20 21:48:02 57 37
    2015-01-20 21:47:52 35 37
    2015-01-20 21:47:42 20 38
    2015-01-20 21:47:32 70 38
    2015-01-20 21:47:12 100 38
    2015-01-20 21:46:35 100 38
    2015-01-20 21:45:52 99 38

    Peak CPU Util in the last one hour
    ----------------------------------
    Timestamp CPU Util(%) Memory Util(%)
    --------- ----------- --------------
    2015-01-20 21:36:18 100 37

     

    Output of top
    -------------
    Mem: 95076K used, 160860K free, 0K shrd, 0K buff, 28348K cached
    Load average: 4.91 8.13 6.39 (Status: S=sleeping R=running, W=waiting)
    PID USER STATUS RSS PPID %CPU %MEM COMMAND
    16976 root R N 368 16975 30.5 0.1 busybox
    2 root RWN 0 1 7.6 0.0 ksoftirqd/0
    1738 root S < 13732 1671 0.0 5.3 cli
    1748 root S N 5096 1671 0.0 1.9 sapd
    1761 root S 2648 1671 0.0 1.0 mdns
    1752 root S < 2508 1671 0.0 0.9 stm
    1758 root S 2340 1671 0.0 0.9 snmpd_sap
    16518 root S 1876 1671 0.0 0.7 radiusd-term
    16517 root S 1868 1671 0.0 0.7 radiusd
    1737 root S N 1688 1671 0.0 0.6 awc
    1767 root S 1472 1671 0.0 0.5 meshd
    1764 root S 1316 1671 0.0 0.5 lldpd
    1680 root S 1196 1671 0.0 0.4 tinyproxy
    1671 root S 1116 1 0.0 0.4 nanny
    16966 root S < 1108 1579 0.0 0.4 mini_httpd
    1765 root S 1020 1671 0.0 0.3 rfd
    1689 root S 976 1680 0.0 0.3 tinyproxy
    1692 root S 976 1680 0.0 0.3 tinyproxy
    1690 root S 976 1680 0.0 0.3 tinyproxy
    1691 root S 976 1680 0.0 0.3 tinyproxy
    1576 root S < 736 1 0.0 0.2 mini_httpd



  • 4.  RE: Big problems w/ 115/225 Cluster

    EMPLOYEE
    Posted Jan 21, 2015 12:18 AM
    Busybox is part of the Linux backend. Did you try escalating your case? 


    Thanks, 
    Tim


  • 5.  RE: Big problems w/ 115/225 Cluster

    Posted Jan 21, 2015 12:31 AM
      |   view attached

    I've escalated the case with TAC. The third rep was more detailed and promised that he would look into the issue.

     

    I've also added much more robust monitoring so that i should be emailed immediately should this occur again.

     

    I'm also attaching the AP Tech Support Dump here for others to review. Just a note, the log was grabbed 2-3mins after the CPU spike. I couldn't grab it at the time of the event because the portal stopped responding and returned an internal server error.

     

    Thanks for the assist!

    Attachment(s)

    txt
    IAP115_cpu_max_event.txt   372 KB 1 version


  • 6.  RE: Big problems w/ 115/225 Cluster

    Posted Feb 06, 2015 05:54 PM

    I am also seeing some instability with 4.1.1.2. The elected controller spikes to 100% CPU and loses network connectivity. A power cycle temporarily fixes it but the next AP to be elected at the controller does the same.

     

    Are there known issues with 4.1.1.2?



  • 7.  RE: Big problems w/ 115/225 Cluster

    Posted Feb 09, 2015 12:04 PM

    I have been informed by TAC that this is indeed a bug (a huge one at that). A bug report has been submitted internally and they are investigating the issue. I'm trying to pull a tech support dump from the console when I experience the issue but I haven't yet been able to access the device phyiscally when hte issue occurrs so it's been difficult.

     

    The other strange thing is that it appears to be the same bug affecting the IAP115 and the IAP225. If it happens on the 225, I need to reboot all IAPs. If it happens on the 115, the whole network drops due to a packet storm. >_<



  • 8.  RE: Big problems w/ 115/225 Cluster

    Posted Feb 20, 2015 05:08 AM

    This sounds serious. Do you have a bug/defect ID? Have you heard of any ETA on the fix?



  • 9.  RE: Big problems w/ 115/225 Cluster

    Posted Mar 02, 2015 07:53 PM

    @ComplexMind wrote:

    I am also seeing some instability with 4.1.1.2. The elected controller spikes to 100% CPU and loses network connectivity. A power cycle temporarily fixes it but the next AP to be elected at the controller does the same.

     

    Are there known issues with 4.1.1.2?


     

    I'm also seeing this issue with a mixed cluster of IAP-105s and IAP-115s.  The elected controller becomes completely unresponsive -- fails to respond even to pings when it's in this state, although it continued to send logging info to my syslog host for a few minutes after the failure, consisting mostly of ~100 of these per second (where I've replaced the controller's IP address with xxx.xxx.xxx.xxx, and the controller's MAC address with AA:AA:AA:AA:AA:AA):

     

    Feb 23 15:11:08 2015 xxx.xxx.xxx.xxx <xxx.xxx.xxx.xxx AA <Error>: AA:AA:AA:AA:AA> stm[1537]: PAPI_Send: sendto ARM Process failed: No such file or directory Message Code 3115 Sequence Num is 27866 

    Feb 23 15:11:08 2015 xxx.xxx.xxx.xxx stm[1537] <Error>: <304065> <ERRS> <xxx.xxx.xxx.xxx AA:AA:AA:AA:AA:AA>  PAPI_Send failed, send_papi_message_with_args, 790: No such file or directory, dstport 8494

     

    The scary part about this is that the network is still basically functional -- clients can connect to other APs and it passes network traffic, but nothing connects to the elected controller and I have no access to the admin UI.

     

    I also just noticed that the controller seems to be spamming ARP requests for various hosts on the network at an unusual rate while it's in this state.  I'm not sure what to make of this.



  • 10.  RE: Big problems w/ 115/225 Cluster

    Posted Mar 02, 2015 08:27 PM

    This seems to be similar to my issue. -- I would open a TAC case and escalate. I wasn't able to send them tech dumps due to the high CPU spikes but I did get a syslog server up with some data.

     

    I'm also told that 4.1.1.3 may have some fixes for this and should be out in ~2weeks. The proof as they say though, is in the pudding.



  • 11.  RE: Big problems w/ 115/225 Cluster

    Posted Mar 03, 2015 08:42 AM

    I downgraded back to 4.1.1.0 and the controller issue cleared up. However I believe that leaves us exposed to the DOS issues that were supposedly mitigated in 4.1.1.2. Unfortunately I don't have the bandwidth to troubleshoot with TAC. Hopefully they are watching these threads and can help. I'll wait for a fix to 4.1.1.2 before I upgrade again.



  • 12.  RE: Big problems w/ 115/225 Cluster

    EMPLOYEE
    Posted Jan 21, 2015 12:23 AM

    Ciscokid85,

     

    Please message me the case number.



  • 13.  RE: Big problems w/ 115/225 Cluster

    Posted Mar 07, 2015 01:30 PM

    I'm also seeing the same issue on my home office Instant network with an IAP-105 and RAP-109 running 4.1.1.2.  I have factory reset both access points one at a time, but the issue still persists where my wifi clients drop the association and can't reconnect until the virtual controller is rebooted.  here is a copy of my log files during the outage:

     

    2015-03-04 20:34:44 Local1.Error 10.0.20.161 Mar  4 20:34:43 2015 10.0.20.161 <10.0.20.161 6C:F3:7F:XX:XX:XX> stm[1775]: PAPI_Send: sendto ARM Process failed: No such file or directory Message Code 3115 Sequence Num is 61082
    2015-03-04 20:34:44 Local1.Error 10.0.20.161 Mar  4 20:34:43 2015 10.0.20.161 stm[1775]: <304065> <ERRS> <10.0.20.161 6C:F3:7F:XX:XX:XX>  PAPI_Send failed, send_papi_message_with_args, 790: No such file or directory, dstport 8494

     

     



  • 14.  RE: Big problems w/ 115/225 Cluster

    EMPLOYEE
    Posted Mar 07, 2015 02:26 PM

    Rhamaura,

     

    4.1.1.2 has been removed from the downloads website.  Please downgrade to 4.1.1.1 and let us know if you still see the issue.  If you cannot obtain the software to downgrade, please message me.



  • 15.  RE: Big problems w/ 115/225 Cluster

    Posted Mar 07, 2015 02:54 PM

    Colin,

     

    I would like to confirm the software downgrade procedure:

     

    1.  Multiple IAP-105's in a single cluster.  Login to the Virtual controller, manually upload the older firmware, click Upgrade Now, and reboot.  The remaining IAP-105 access points will automatically downgrade to the older firmware

     

    2.  Mix of IAP-105 and IAP-225 in a single cluster.  Enable TFTP server on my computer. Login to th virtual controller, specify the URL location for each category of instant with the format tftp://<My-PC-IP-Address/<software name>, click Upgrade now, and reboot.  the remaining mix of access points will automatically downgrade and reboot.

     

    Am I missing anything from this procedure?



  • 16.  RE: Big problems w/ 115/225 Cluster

    EMPLOYEE
    Posted Mar 07, 2015 03:30 PM

    You are not missing anything.  Just like the procedure here:  http://www.arubanetworks.com/techdocs/Instant_41_WebHelp/InstantWebHelp.htm#UG_files/IAP_maintenance/Firmware_Image_Server_in.htm for the Image URL Option.



  • 17.  RE: Big problems w/ 115/225 Cluster

    Posted Mar 09, 2015 03:18 AM

    Ok, anyone heard any ETA for a new release with this issue fixed?



  • 18.  RE: Big problems w/ 115/225 Cluster

    EMPLOYEE
    Posted Mar 09, 2015 03:25 AM
    Christoffer,

    We will do a post in this thread when it is released. Is there an issue you are having with the downgraded code?


  • 19.  RE: Big problems w/ 115/225 Cluster

    Posted Mar 09, 2015 03:31 AM

    Don´t know yet, trying to pick code for a new deployment of IAP-205. Guessing 4.1.1.1 is the best shot for now.