Wireless Access

last person joined: 18 hours ago 

Access network design for branch, remote, outdoor, and campus locations with HPE Aruba Networking access points and mobility controllers.
Expand all | Collapse all

Aruba APs Fail to Master and go into Dirty Config

This thread has been viewed 2 times
  • 1.  Aruba APs Fail to Master and go into Dirty Config

    Posted Aug 17, 2013 09:08 PM

    6 months ago upgrade from a single 2400 Controller to a 7210 Master/Local setup.  Devices come up without issue and then randomly the APs with Fail from the Local to the Master and go into a Dirty or No Config and once they all clear up fail back to the Local and go into a Dirty or No Config state once they clear up on the Local they stay functioning.  It will just happen out of the blue and some days only happen once or twice in a full day or other times go for a week.  Controllers are located in the same data center and the APs are at many locations.  Had a support case opened but support reviewed the devices and said everything was ok configuration wise and since the issue couldn't be produced while they were on the phone nothing could be identified.  Yesterday the devics all got stuck on the Local Controller with an ID Flag, could access the controller without issue and power cycle it which triggered them to fail over to the Master and then back to the Local after it rebooted.  ID Flag was a first normally it is only a D Flag.  Was considering moving the controllers to another VLAN and re-IP the devices but not certain how much trouble it will be  with all the APs offsite especially not knowing what the issue is.  Not certain if this could be as simple as a bad network cable?

     

    I notice this from the logs if I do a show log system all for numerous of my APs

    Aug 6 08:56:16 :311004:  <WARN> |AP RIDOA_AP65.11@158.123.114.148 sapd|  Missed 25 heartbeats; rebootstrapping

     

    Just actually saw it fail again and it shows this on the Local from a show log system 50

    Aug 17 20:58:30 :311004:  <WARN> |AP RIDOA_AP65.12@158.123.114.160 sapd|  Missed 25 heartbeats; rebootstrapping

     

    Full Output of the show log system 50 from the Local 7210

     

    Aug 17 20:57:25  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/1517 to 129.2.139.140:8419
    Aug 17 20:57:27  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:57:37  KERNEL:   0:<7>UDP: short packet: From 0.0.0.0:8211 1621/245 to 10.200.200.2:8419
    Aug 17 20:57:37  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:57:42 :303073:  <ERRS> |nanny|  Process /mswitch/bin/stm [pid 3077] died: got signal SIGSEGV
    Aug 17 20:57:45 :399803:  <ERRS> ||  An internal system error has occurred at file mon_client.c function mon_client_send_query line 114 error PAPI_Send to 8456 failed: Connection refused.
    Aug 17 20:57:47  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:57:49 :303029:  <ERRS> |nanny|  Process /mswitch/bin/stm [pid 3077]: crash data saved in dir /flash/crash/process/8-17-2013@20-57-42/stm
    Aug 17 20:57:50  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:57:50  KERNEL:   0:<7>UDP: short packet: From 0.0.0.0:8211 1621/245 to 10.200.200.4:8419
    Aug 17 20:57:50  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:57:53 :303079:  <ERRS> |nanny|  Restarted process /mswitch/bin/stm, new pid 7139
    Aug 17 20:57:53 :303025:  <ERRS> |nanny|  Found core file /tmp/core.3077.stm.A72xx_38532, 65011712 bytes, compressing...
    Aug 17 20:58:01  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:58:03  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/1517 to 129.2.139.140:8419
    Aug 17 20:58:04 :304001:  <ERRS> |stm|  Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from             AP with eth_mac 00:24:6c:c9:96:34, and phy_type is 1
    Aug 17 20:58:04 :304001:  <ERRS> |stm|  Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from             AP with eth_mac 00:24:6c:c9:96:20, and phy_type is 1
    Aug 17 20:58:04 :304001:  <ERRS> |stm|  Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from             AP with eth_mac 00:24:6c:c9:a7:56, and phy_type is 1
    Aug 17 20:58:04 :304001:  <ERRS> |stm|  Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from             AP with eth_mac 00:24:6c:c9:a6:f6, and phy_type is 1
    Aug 17 20:58:05  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:58:12 :304001:  <ERRS> |stm|  Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from             AP with eth_mac 00:1a:1e:c7:c0:4e, and phy_type is 1
    Aug 17 20:58:12 :304001:  <ERRS> |stm|  Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from             AP with eth_mac 00:0b:86:cf:bd:d1, and phy_type is 1
    Aug 17 20:58:12  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/1517 to 129.2.139.140:8419
    Aug 17 20:58:12 :304001:  <ERRS> |stm|  Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from             AP with eth_mac 00:24:6c:c9:a6:e4, and phy_type is 1
    Aug 17 20:58:18 :303080:  <ERRS> |nanny|  Please tar and email the file crash.tar to support@arubanetworks.com
    Aug 17 20:58:18 :303081:  <ERRS> |nanny| To tar type the following commands at the Command Line Interface: (1) tar crash (2) copy flash: crash.tar tftp: [serverip] [destn filename]
    Aug 17 20:58:20  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:58:25  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/1517 to 129.2.139.140:8419
    Aug 17 20:58:27  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:58:28 :311004:  <WARN> |AP RIEOC_AP105.8@10.200.200.12 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:28 :311004:  <WARN> |AP RIPUC_AP105.1@10.203.1.4 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:28 :311004:  <WARN> |AP RIEOC_AP105.3@10.200.200.8 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:28 :311004:  <WARN> |AP RIETH_AP105.2@10.230.40.4 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:28 :311004:  <WARN> |AP RIDOA_AP65.4@158.123.114.147 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:28 :311004:  <WARN> |AP RIDOTMT_AP105.1@10.203.36.11 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:28 :311004:  <WARN> |AP RIEOC_AP105.9@10.200.200.13 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RIDOA_AP105.8@158.123.114.202 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RIEOC_AP105.2@10.200.200.5 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RIDOA_AP65.2@158.123.114.146 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RISH_AP65.10@10.230.4.11 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RISH_AP65.8@10.230.4.4 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RIEOC_AP105.12@10.200.200.6 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RIDOA_AP65.11@158.123.114.148 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RIPUC_AP65.2@10.203.1.2 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RISH_AP65.5@10.230.4.8 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RISH_AP65.3@10.230.4.7 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:29 :311004:  <WARN> |AP RIDOA_AP65.6@158.123.114.150 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:30 :311004:  <WARN> |AP RISH_AP65.2@10.230.4.5 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:30 :311004:  <WARN> |AP RISH_AP65.1@10.230.4.6 sapd|  Missed 25 heartbeats; rebootstrapping
    Aug 17 20:58:30 :311004:  <WARN> |AP RIDOA_AP65.12@158.123.114.160 sapd|  Missed 25 heartbeats; rebootstrapping


    #7210


  • 2.  RE: Aruba APs Fail to Master and go into Dirty Config

    EMPLOYEE
    Posted Aug 18, 2013 06:52 AM

    You did not say what version of ArubaOS you have, but you should start with opening a case with TAC.  Why?  There are many, many, many reasons why this could be happening and the simple output from the log does not point to any of them specifically.  All it says is that a number of access points lost connectivity and one crashed....



  • 3.  RE: Aruba APs Fail to Master and go into Dirty Config

    Posted Aug 18, 2013 02:59 PM
    Currently running 6.2.1.2. I had opened a TAC case but it clears up in sometimes 10 minutes so once I get through to support they never see anything wrong and stated everything looked correctly configured.


  • 4.  RE: Aruba APs Fail to Master and go into Dirty Config

    EMPLOYEE
    Posted Aug 18, 2013 03:51 PM

     

    Evan,

     

    Colin's right on here -   If I read things correctly, the initial failure scenario are lost heartbeats from AP to controller, causing the failover.

     

     when the APs fail over to the master controller, the APs show "D" dirty flag, which means they have not received

    a complete configuration.

     

    Reviewing your post, a few notes:

     

    These indicate the AP has missed heartbeats, and bootstraps.

     

    Aug 6 08:56:16 :311004:  <WARN> |AP RIDOA_AP65.11@158.123.114.148 sapd|  Missed 25 heartbeats; rebootstrapping

     

    It sounds like there is some intermittent connectivity issue between a number of APs and the controller(s)

    where they are currently connected.

     

     

    These messages are indicating frame size issues:

     

    Aug 17 20:57:25  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/1517 to 129.2.139.140:8419
    Aug 17 20:57:27  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:57:37  KERNEL:   0:<7>UDP: short packet: From 0.0.0.0:8211 1621/245 to 10.200.200.2:8419
    Aug 17 20:57:37  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419

     

     

    There is a crash present, I'd gather the "tar crash" on the controller CLI, which will will retrieve any crash data present.


    Aug 17 20:57:42 :303073:  <ERRS> |nanny|  Process /mswitch/bin/stm [pid 3077] died: got signal SIGSEGV
    Aug 17 20:57:45 :399803:  <ERRS> ||  An internal system error has occurred at file mon_client.c function mon_client_send_query line 114 error PAPI_Send to 8456 failed: Connection refused.

     

    Aug 17 20:57:49 :303029:  <ERRS> |nanny|  Process /mswitch/bin/stm [pid 3077]: crash data saved in dir /flash/crash/process/8-17-2013@20-57-42/stm
    Aug 17 20:57:50  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:57:50  KERNEL:   0:<7>UDP: short packet: From 0.0.0.0:8211 1621/245 to 10.200.200.4:8419
    Aug 17 20:57:50  KERNEL:   0:<7>UDP: short packet: From 255.255.255.255:8211 1621/245 to 129.2.139.140:8419
    Aug 17 20:57:53 :303079:  <ERRS> |nanny|  Restarted process /mswitch/bin/stm, new pid 7139
    Aug 17 20:57:53 :303025:  <ERRS> |nanny|  Found core file /tmp/core.3077.stm.A72xx_38532, 65011712 bytes, compressing...

     

     

    These messages  are indicating some packet length issue, it sounds as if the MTU is limited between the AP and the controller, which may affect controller to AP configuration communications.

    Aug 17 20:58:04 :304001:  <ERRS> |stm|  Unexpected stm (Station management) runtime error at handle_ap_statistics, 1019, Length mismatch expected 1527 received 1387 from             AP with eth_mac 00:24:6c:c9:96:34, and phy_type is 1

     

     

    I'd suggest looking at the initial event causing the missed hearbeats, and also look at the MTU along the path between AP and master controller.