Higher Education

last person joined: 10 days ago 

Got questions on how to enable mobility in education? Submit them here!
Expand all | Collapse all

Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

This thread has been viewed 0 times
  • 1.  Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Aug 28, 2013 01:03 PM

    Hello all, we implemented new centralized 7210 controllers in late July and since the return of our residents this past week we are having issues where our APs (125 and 135s) in number (20-30) are loosing contact with our controllers randomly. Investigation with Aruba TAC seems to show high datapath CPU utilization on the controllers when the issue crops up, and appears so far to look like it may be a code issue. We are running 6.2.1.3 on both 7210 and 3600 model controllers, the 3600 was in service all last year on 6.1.x code without this high datapath CPU occuring but now the datapath cpu is spiking on it as well as the 7210s. Port utilizations look great (around 10% on Gb links) and no errors to speak of on the ports. Just hopeful we are not alone with this issue.

    Thanks in advance for any feedback

    mike


    #7210
    #3600


  • 2.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Aug 28, 2013 01:27 PM

     

    We are currently running 6.2.1.3 with three 7240's.

     

    We been experiencing high CPU utilization on the AP125's (when running the show ap debug system-status ap-name <ap-name> even if those have no clients but I haven't seen the same behavior with AP105's or 135's

     

    I currently have a case open.

     

    Would like to know what you guys find out with TAC.


    #7240


  • 3.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 08:37 AM

    Yes please post any resolution of word from TAC. We are running 6.2.1.2 and are experienceing many AP reboots per day. The reason for reboot is given as out of memory. After reviewing our case the TAC recommended upgrading to 6.2.1.3 which resolves to known memory exhaustion issue.

     

    Hoping not to trade one issue for another..

     

     Mike



  • 4.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 09:37 AM

    Hello All:

    We have some feedback from TAC: the CPUs for datapath (show datapath utilization on controllers) was showing at least one CPU spiking to 100% when our APs were rebooting or bootstrapping and bouncing between master and backup LMS controllers. We had enabled Broadcast/Multicast (BCMC) Optimization on the SSID profiles, but apparently there is also the option to turn on BCMC Optimization on the VLAN interface as well (thank you Princeton for getting this feature I understand). This solved our issue, however, as we are a heavy iOS (iPad) environment the consequence of enabling BCMC on the VLAN interface is now the AirPrint and other AirGroup traffic will not work (we use PaperCut on a MAC server to allow our iPad users to print to campus printers). So we are between the proverbial Rock and a Hard Place until we can solve the BCMC issue on the VLAN the controllers and APs are located.

     

    In three years with Aruba controllers/APs we have never had to use the BCMC Optimization on the VLAN interface, so we have either reached some threshold of BC traffic or we were concerned that we went from 6.1.3.x code to the 6.2 code may have caused the traffic we had to now be an issue for the controllers. So for you folks with APs dropping check periodically the datapath utilization, check the ap bss-table total times for APs to make sure they match closely the UP time of the AP (looking for APs bootstrapping), as well as the system log for any instances of heartbeat misses.

    Hope this is helpful to someone, I'll be staring at packet traces for awhile to see if we can fully identify and resolve this errant BC/MC traffic

     

    Mike



  • 5.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 09:49 AM
    How many APs are in your network that sees bouncing?


  • 6.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 10:00 AM

    For us, we have 2 local controllers, one for dorms (7210) and one for remainder of campus (3600 curently, another 7210 will be deployed soon but we are leaving it in place for now just to see if we had a specific issue with the newer controller model). Dorm controller has about 84 APs (mostly 125s), the other has 104 APs (mix of 125 and 135s). We would have about 50% of the APs on either controller bounce to their backup (our 1 master controller), and at the time of the issue (datapath utilization at 100%), and the ping response would either get very long (1.5-2.5 seconds) or not repsond at all to the controllers (but they would ping our gateways just fine, under 2 ms). Typically we would see the datapath utililzation spike for about 3-5 minutes and then return to normal for an hour or more, then another episode would come again. Also, we had almost no case where both local controllers were experiencing the problem at the same time, which led us to believe it might be user traffic vs the LAN traffic on the VLAN interface, but we are told the BC/MC Optimization on the VLAN Interface on the controllers is operating on the inbound traffic from the LAN into the controllers.

     

    Mike


    #7210


  • 7.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 10:04 AM
    Wow. Ok. I was anticipating you saying 100s of APs. What type of broadcast traffic is coming into the controller during these bouncing periods? I've seen (MANY times) where a loop within datapath processing results in a CPU spike. I would not be surprised if there was a specific frame/packet that was mis-processed resulting in this issue. . . especially since it seems related to the code you are running.


  • 8.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 11:55 AM

    Ryan:

    I guess that is the $64k question (what BC/MC traffic is it that is causing the issue), and what we are trying to determine as quickly as we can. I am hoping we are not the only ones seeing this issue with these controllers/code version, our AP and user population is reltaively small with under 200 APs and about 3k total clients, so with much larger organizations out there using Aruba gear I was hoping not to be the only one experiencing this. Off to do more sniffing (looks like I picked the wrong week to stop sniffing glue).

    Mike



  • 9.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 12:18 PM

    Mike,

    i would recommend to disable the BCMC on all VLAN and run the below command to deteremine the root cause to see what kind of broadcast or mutilcast cause this issue.

     

    show datapath utilizaiton

    show datapath maintenance

    show datapath frame

    show datapath message-queue

    show memory

    show cpuload current

    show cpuload

    show processes

    show process monitor statistics

    show storage

     

    run the above command 4-5 tiimes continous and collect the output. Also get the logs.tar with tech-support so that we could get this reviewed with engineering team.

     

    thank you.



  • 10.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 12:39 PM

    Hello:

    We actually did run these exact commands repeatedly with TAC and they sent results to engineering (if you can, see case# 1450965) and their only suggestion was to turn on the BCMC Optimization at the VLAN interface, they did not have any idea what kind of traffic it was that was spiking the datapath utilization. So I am now trying to determine now what BC/MC traffic is it that the controllers are choking on.

    Is anyone using a CheckPoint FW as the gateway to the network from their wireless controllers, and in a HA redundant fashion?

     

    mike



  • 11.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 12:58 PM

    Thanks for the info. Let me check internally to review the progress on case.



  • 12.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 09, 2013 09:41 PM

    @mpasquerette wrote:

    Hello:

    We actually did run these exact commands repeatedly with TAC and they sent results to engineering (if you can, see case# 1450965) and their only suggestion was to turn on the BCMC Optimization at the VLAN interface, they did not have any idea what kind of traffic it was that was spiking the datapath utilization. So I am now trying to determine now what BC/MC traffic is it that the controllers are choking on.

    Is anyone using a CheckPoint FW as the gateway to the network from their wireless controllers, and in a HA redundant fashion?

     

    mike


     

    Are these just client VLANs ? or are you hosting AP VLANs on your controllers ?



  • 13.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    EMPLOYEE
    Posted Sep 05, 2013 10:54 AM

    <removed> already asked



  • 14.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 05, 2013 10:59 AM

    We upgrade to code 6.2.1.2 and starting have 30 - 40 AP-125s reboot every day, so we upgrade to 6.2.1.3 and that resolved the problem of the APs rebooting. Now we have users complaining about slow wireless performance, which we have never really had before. I'm not sure if it's a code problem, but from 3.x to 6.1 we have had almost no complaints of wireless performance problems. 

     

    Bob 

     



  • 15.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 10, 2013 10:27 AM

    We are noticing a issue very similar to this when we moved from 6.1 to 6.3.  Our AP's are flapping between controllers on a regular basis.  We did not notice this issue on 6.1.

     

    We enabled BCMC optimization and the issue is still happening.

     

    We also noticed several of these error messages in the logs when it happens:

    Sep 10 06:00:38 :311020:  <ERRS> |AP xxxxxx sapd|  An internal system error has occurred at file sapd_redun.c function set_route_af line 650 error set_route_af: ioctl (SIOCDELRT) failed: No such process.

     

    We are also seeing these crashes in the logs on each controller frequently:

    Sep 10 00:26:23 :303080: <ERRS> |nanny| Please tar and email the file crash.tar to support@arubanetworks.com
    Sep 10 00:26:23 :303081: <ERRS> |nanny| To tar type the following commands at the Command Line Interface: (1) tar crash (2) copy flash: crash.tar tftp: [serverip] [destn filename]
    Sep 10 00:26:34 :303073: <ERRS> |nanny| Process /mswitch/bin/stm [pid 26321] died: got signal SIGABRT
    Sep 10 00:26:38 :303029: <ERRS> |nanny| Process /mswitch/bin/stm [pid 26321]: crash data saved in dir /flash/crash/process/9-10-2013@00-26-34/stm
    Sep 10 00:26:44 :303079: <ERRS> |nanny| Restarted process /mswitch/bin/stm, new pid 26702
    Sep 10 00:26:44 :303025: <ERRS> |nanny| Found core file /tmp/core.26321.stm.A6xxx_39170, 63582208 bytes, compressing...
    Sep 10 00:28:14 :303080: <ERRS> |nanny| Please tar and email the file crash.tar to support@arubanetworks.com
    Sep 10 00:28:14 :303081: <ERRS> |nanny| To tar type the following commands at the Command Line Interface: (1) tar crash (2) copy flash: crash.tar tftp: [serverip] [destn filename]

     

    Sep 10 03:51:54 :303086: <ERRS> |AP xxxxx nanny| Process Manager (nanny) shutting down - AP will reboot!
    Sep 10 05:51:12 :303073: <ERRS> |nanny| Process /mswitch/bin/stm [pid 10344] died: got signal SIGABRT
    Sep 10 05:51:16 :303029: <ERRS> |nanny| Process /mswitch/bin/stm [pid 10344]: crash data saved in dir /flash/crash/process/9-10-2013@05-51-12/stm
    Sep 10 05:51:22 :303079: <ERRS> |nanny| Restarted process /mswitch/bin/stm, new pid 22076
    Sep 10 05:51:22 :303025: <ERRS> |nanny| Found core file /tmp/core.10344.stm.A6xxx_39170, 88985600 bytes, compressing...
    Sep 10 05:51:49 :311020: <ERRS> |AP xxxx sapd| An internal system error has occurred at file sapd_redun.c function redun_tunnel_up line 4992 error redun_tunnel_up: client not found port:8423.
    Sep 10 05:55:14 :303080: <ERRS> |nanny| Please tar and email the file crash.tar to support@arubanetworks.com
    Sep 10 05:55:14 :303081: <ERRS> |nanny| To tar type the following commands at the Command Line Interface: (1) tar crash (2) copy flash: crash.tar tftp: [serverip] [destn filename]

     

     

    We have a case open with TAC and as of yet they do not have a solution.  They tried to say our link was going down, but we have constant pings to controllers and the switche's in between and none of them have ping loss during the time it happens.  The ap's don't have any loss unless they decide to reboot which also seems to happen frequently since moving to 6.3.

     

    I have attached graphs of our AP movements.



  • 16.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 10, 2013 10:49 AM
    Oh no! Your STM process is crashing. That's a bad issue. Are others seeing this as well?

    - Ryan -


  • 17.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 10, 2013 01:31 PM

    @gustie wrote:

    We are noticing a issue very similar to this when we moved from 6.1 to 6.3.  Our AP's are flapping between controllers on a regular basis.  We did not notice this issue on 6.1.

     

    We enabled BCMC optimization and the issue is still happening.

     

    We also noticed several of these error messages in the logs when it happens:

    Sep 10 06:00:38 :311020:  <ERRS> |AP xxxxxx sapd|  An internal system error has occurred at file sapd_redun.c function set_route_af line 650 error set_route_af: ioctl (SIOCDELRT) failed: No such process.

     

    We are also seeing these crashes in the logs on each controller frequently:

    Sep 10 00:26:23 :303080: <ERRS> |nanny| Please tar and email the file crash.tar to support@arubanetworks.com
    Sep 10 00:26:23 :303081: <ERRS> |nanny| To tar type the following commands at the Command Line Interface: (1) tar crash (2) copy flash: crash.tar tftp: [serverip] [destn filename]
    Sep 10 00:26:34 :303073: <ERRS> |nanny| Process /mswitch/bin/stm [pid 26321] died: got signal SIGABRT
    Sep 10 00:26:38 :303029: <ERRS> |nanny| Process /mswitch/bin/stm [pid 26321]: crash data saved in dir /flash/crash/process/9-10-2013@00-26-34/stm
    Sep 10 00:26:44 :303079: <ERRS> |nanny| Restarted process /mswitch/bin/stm, new pid 26702
    Sep 10 00:26:44 :303025: <ERRS> |nanny| Found core file /tmp/core.26321.stm.A6xxx_39170, 63582208 bytes, compressing...
    Sep 10 00:28:14 :303080: <ERRS> |nanny| Please tar and email the file crash.tar to support@arubanetworks.com
    Sep 10 00:28:14 :303081: <ERRS> |nanny| To tar type the following commands at the Command Line Interface: (1) tar crash (2) copy flash: crash.tar tftp: [serverip] [destn filename]

     

    Sep 10 03:51:54 :303086: <ERRS> |AP xxxxx nanny| Process Manager (nanny) shutting down - AP will reboot!
    Sep 10 05:51:12 :303073: <ERRS> |nanny| Process /mswitch/bin/stm [pid 10344] died: got signal SIGABRT
    Sep 10 05:51:16 :303029: <ERRS> |nanny| Process /mswitch/bin/stm [pid 10344]: crash data saved in dir /flash/crash/process/9-10-2013@05-51-12/stm
    Sep 10 05:51:22 :303079: <ERRS> |nanny| Restarted process /mswitch/bin/stm, new pid 22076
    Sep 10 05:51:22 :303025: <ERRS> |nanny| Found core file /tmp/core.10344.stm.A6xxx_39170, 88985600 bytes, compressing...
    Sep 10 05:51:49 :311020: <ERRS> |AP xxxx sapd| An internal system error has occurred at file sapd_redun.c function redun_tunnel_up line 4992 error redun_tunnel_up: client not found port:8423.
    Sep 10 05:55:14 :303080: <ERRS> |nanny| Please tar and email the file crash.tar to support@arubanetworks.com
    Sep 10 05:55:14 :303081: <ERRS> |nanny| To tar type the following commands at the Command Line Interface: (1) tar crash (2) copy flash: crash.tar tftp: [serverip] [destn filename]

     

     

    We have a case open with TAC and as of yet they do not have a solution.  They tried to say our link was going down, but we have constant pings to controllers and the switche's in between and none of them have ping loss during the time it happens.  The ap's don't have any loss unless they decide to reboot which also seems to happen frequently since moving to 6.3.

     

    I have attached graphs of our AP movements.


     

    Is this occuring with AP125's ?

     

     



  • 18.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 10, 2013 02:01 PM

    Can you tell me which version of 6.3 code you are running (6.3.0.1 or some other)?

    We are eaglerly awaiting 6.3.1.0 (or something close, tentatively told to be available near end of this week) to help resolve our Datapath Utilization Spiking issue, hopefully we don't trade one big issue for another.

    Mike



  • 19.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 10, 2013 02:19 PM

    We are running 6.3.0.1

     

    We only have 3 AP 125's. One of them died in this process which could be a hardware issue or possibly is related to this issue.  I can ping it and it tries to talk to the controller in the logs, but never comes up.  Going to connect via serial to see what we can figure out.  The other two that we have are having the same issue.

     

     

    Most of are AP's are 105's.

     

    The AP flapping issue happens at the same time we see the STM process crash.



  • 20.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 10, 2013 06:13 PM

    Can you please check if there's high CPU/Memory utilization on the AP125's when you run the following command:

    show ap debug system-status <apname>  | begin CPU

     

     

     



  • 21.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 10, 2013 09:23 PM

    This is from a AP in our dining area with an average of 20-40 clients over the last several hours
    --------------------
    Timestamp                      CPU Util(%) Memory Util(%)
    ---------                                     ----------- --------------
    2013-09-10 20:15:44            12             90
    2013-09-10 20:15:34            15             90
    2013-09-10 20:15:24            14             90
    2013-09-10 20:15:14            14             90
    2013-09-10 20:15:04            13             90
    2013-09-10 20:14:54            14             90
    2013-09-10 20:14:44            14             90

    Peak CPU Util in the last one hour
    ----------------------------------
    Timestamp                         CPU Util(%)             Memory Util(%)
    ---------                                -----------                         --------------
    2013-09-10 19:17:18          51                                   91



  • 22.  RE: Anyone running 6.2.1.3 code and seeing APs randomly loosing connection to controllers?

    Posted Sep 10, 2013 09:34 PM

    @gustie wrote:

    This is from a AP in our dining area with an average of 20-40 clients over the last several hours
    --------------------
    Timestamp                      CPU Util(%) Memory Util(%)
    ---------                                     ----------- --------------
    2013-09-10 20:15:44            12             90
    2013-09-10 20:15:34            15             90
    2013-09-10 20:15:24            14             90
    2013-09-10 20:15:14            14             90
    2013-09-10 20:15:04            13             90
    2013-09-10 20:14:54            14             90
    2013-09-10 20:14:44            14             90

    Peak CPU Util in the last one hour
    ----------------------------------
    Timestamp                         CPU Util(%)             Memory Util(%)
    ---------                                -----------                         --------------
    2013-09-10 19:17:18          51                                   91


     

    Wow memory is really high , how many clients do you have connected to that AP ?