Wireless Access

last person joined: yesterday 

Access network design for branch, remote, outdoor and campus locations with Aruba access points, and mobility controllers.
Expand all | Collapse all

Backup MM lost network connection, but not to Master MM

  • 1.  Backup MM lost network connection, but not to Master MM

    Posted Oct 29, 2019 11:05 AM

    Hi

     

    I have two Mobility Masters running version 8.3.7.

     

    All of a sudden, the backup has lost contact with the outside world network. However, it has a tunnel up towards Master, vrrp is up and the database is synced as it should.


    I can access the master's ip, vrrp ip and the backup's own ip from it, the rest is not reachable. Not even that devices on the same layer 2 domain.


    No change has occurred on hypervizor, they are on the same host that allows promiscuous mode.


    All ip configuration is correct like gateway, routing table, submask.

     

    My idea is that all traffic goes through the tunnel in some way but cannot verify this.

     

    Is there anyone who recognizes the problem and has a solution?

     

    BR



  • 2.  RE: Backup MM lost network connection, but not to Master MM

    Posted Oct 30, 2019 07:28 PM

    Wow, I've run into the same issue maybe twice, but I just figured it must be my network. In my case I actually lost the Primary MM, it lost all communication with the outside world, but VRRP kept working and it kept talking to the Standby, so VRRP never failed over. Of course this took my MM offline because the primary server kept the VIP. I could not ping the default gateway from the MM console, but I could ping the peer. In my case the only thing that brought it back online was a reboot. Since my MM was offline I didn't have time to troubleshoot with TAC. 

     

    If you have time to troubleshoot with TAC while the issue is happening that might help us all figure out what's going on. 

     

    In my case I can't say for sure what version I was on, most likely I was either on 8.3.0.7 or 8.5.01. I've been on 8.5.0.2 for a few weeks now, and haven't had the issue again. 



  • 3.  RE: Backup MM lost network connection, but not to Master MM

    Posted Oct 30, 2019 07:51 PM

    The #1 reason we see for controllers losing connectivity in the field is too much traffic on their subnet(broadcast/multicast) for example.  Controllers have a firewall that will attempt to drop traffic if it exceeds certain boundaries.  "show firewall | include Rate" will show the rates above which traffic is limited:

    (Babarella) #show firewall | include Rate
    Policy                                       Action                                          Rate        Port
    Rate limit CP untrusted ucast traffic        Enabled                                         9765 pps     
    Rate limit CP untrusted mcast traffic        Enabled                                         3906 pps     
    Rate limit CP trusted ucast traffic          Enabled                                         65535 pps    
    Rate limit CP trusted mcast traffic          Enabled                                         3906 pps     
    Rate limit CP route traffic                  Enabled                                         976 pps      
    Rate limit CP session mirror traffic         Enabled                                         976 pps      
    Rate limit CP auth process traffic           Enabled                                         976 pps      
    Rate limit CP vrrp traffic                   Enabled                                         512 pps      
    Rate limit CP ARP traffic                    Enabled                                         3906 pps     
    Rate limit CP L2 protocol/other traffic      Enabled                                         1953 pps     
    Rate limit CP IKE traffic                    Disabled                                                     

    "show datapath bwm" will tell you if that traffic has even been exceeded (the policed" column.

     

    (Babarella) #show datapath bwm 
    
    Datapath Bandwidth Management Table Entries
    -------------------------------------------
    Contract Types : 
       0 - CP Dos 1 - Configured contracts 2 - Internal contracts
    ------------------------------------------------
    Flags: Q - No drop, P - No shape(Only Policed), 
           T - Auto tuned 
    --------------------------------------------------------------------
    Rate: pps - Packets-per-second (256 byte packets), bps - Bits-per-second
    --------------------------------------------------------------------
          Cont                          Avail     Queued/Pkts 
    Type  Id    Rate       Policed     Credits  Bytes         Flags    CPU      Status
    ----  ----  ---------  ----------  -------  ------------  -------  -------  ----------
    0     1     9792 pps   0           305            0/0              4        ALLOCATED
    0     2     3936 pps   0           123            0/0              4        ALLOCATED
    0     3     65536 pps  0           2048           0/0              4        ALLOCATED
    0     4     3936 pps   0           123            0/0              4        ALLOCATED
    0     5     992 pps    0           31             0/0              4        ALLOCATED
    0     6     992 pps    0           31             0/0              4        ALLOCATED
    0     7     992 pps    0           31             0/0              4        ALLOCATED
    0     8     512 pps    0           16             0/0              4        ALLOCATED
    0     9     3936 pps   0           123            0/0              4        ALLOCATED
    0     10    1984 pps   0           62             0/0              4        ALLOCATED

    Long story short:

    - Make sure the management subnet of your controllers are not in large broadcast subnets.  In addition, avoid putting APs directly on your MD or MM management subnet; when broadcast and ARP traffic spikes, the MD or MM will protect itself by dropping useful traffic like VRRP and ARP necessary to communicate with outside components

    - If you can, make the management subnet of your MM different from the management subnet of your MDs, so that the VRRP/Broadcast traffic are on separate subnets to avoid the same issue as above.

    - Enable bcmc-optimization on all VLANs to drop unnecessary traffic so that the controller does not consume cycles attempting to process it

    - Make sure that you are only trunking VLANs from your switch to your controller that the controller will be actually using.  Traffic that is on unnecessary VLANs will force the controller to process that traffic and it will be policed if it goes over a threshhold.  Also conversely, don't enable VLANs on an MD that will not be used by that MD for the same reason; the controller will have to process traffic that will never be seen or used by the MD for no reason.

    - Do not force any unnecessary physical redundancy - Don't feel the need to dual-connect controllers to different switches.  You could introduce an inadvertent loop and that will either completely or partially blackhole your controller

    - Do not span MDs in a cluster physically far from each other - If the latency between the two MDs in a cluster increases due to the distance between them, it will generate a "split brain" situation where you will not know what MD has controll over what APs.  This can easily happen again due to alot of traffic being generated and MDs in a cluster being far away from each other.  Try to avoid this design so that AP redundancy is predictable.

     

    **All of these tips will not eliminate your MDs and MMs losing connectivity, but they will (1) decrease the likelihood and (2) more easily allow you to understand and issue when you encounter one*****

     



  • 4.  RE: Backup MM lost network connection, but not to Master MM

    Posted Oct 31, 2019 03:43 AM


  • 5.  RE: Backup MM lost network connection, but not to Master MM

    Posted Oct 31, 2019 03:48 AM

    Thanks for the elaborate answer.

    didn't know this so it was useful information.

     

    However, MM is not in a large L2 domain and only the backup has problems. The master show the same output put it's not effected.

    However, the symptoms do not rate out the firewall.
    I get the output below which shows 0 on all policed ​​rates.
    I have restarted backup MM without results.

    Any idee on wheat to do next?

     

    show firewall | include Rate
    Policy                                       Action                                            Rate       Port
    Rate limit CP untrusted ucast traffic        Enabled                                           9765 pps
    Rate limit CP untrusted mcast traffic        Enabled                                           1953 pps
    Rate limit CP trusted ucast traffic          Enabled                                           98304 pps
    Rate limit CP trusted mcast traffic          Enabled                                           1953 pps
    Rate limit CP route traffic                  Enabled                                           976 pps
    Rate limit CP session mirror traffic         Enabled                                           976 pps
    Rate limit CP auth process traffic           Enabled                                           976 pps
    Rate limit CP vrrp traffic                   Enabled                                           512 pps
    Rate limit CP ARP traffic                    Enabled                                           976 pps
    Rate limit CP L2 protocol/other traffic      Enabled                                           976 pps
    Rate limit CP IKE traffic                    Enabled                                           1953 pps
    show datapath bwm
    
    Datapath Bandwidth Management Table Entries
    -------------------------------------------
    Contract Types :
       0 - CP Dos 1 - Configured contracts 2 - Internal contracts
    ------------------------------------------------
    Flags: Q - No drop, P - No shape(Only Policed),
           T - Auto tuned
    --------------------------------------------------------------------
    Rate: pps - Packets-per-second (256 byte packets), bps - Bits-per-second
    --------------------------------------------------------------------
          Cont                          Avail     Queued/Pkts
    Type   Id    Rate      Policed      Credits    Bytes      Flags   CPU     Status
    ----  ----  ---------  ----------  -------  -----------   ------- -------  ------
    0     1     9792 pps             0      306        0/0            1        ALLOCATED
    0     2     1984 pps             0       62        0/0            1        ALLOCATED
    0     3     98304 pps            0     3072        0/0            1        ALLOCATED
    0     4     1984 pps             0       62        0/0            1        ALLOCATED
    0     5     992 pps              0       31        0/0            1        ALLOCATED
    0     6     992 pps              0       31        0/0            1        ALLOCATED
    0     7     992 pps              0       31        0/0            1        ALLOCATED
    0     8     512 pps              0       16        0/0            1        ALLOCATED
    0     9     992 pps              0       31        0/0            1        ALLOCATED
    0     10    992 pps              0       31        0/0            1        ALLOCATED
    0     11    1984 pps             0       62        0/0            1        ALLOCATED


  • 6.  RE: Backup MM lost network connection, but not to Master MM

    Posted Oct 31, 2019 05:18 AM

    Please open a TAC case, so that they can explore what is wrong in your specific situation.

     

    It could be a bug, or configuration issue or a combination.  Please report back to us here when you get any insight on that.



  • 7.  RE: Backup MM lost network connection, but not to Master MM

    Posted Dec 20, 2019 01:52 PM

    I just ran into this issue again. Seems to manifest after the MM VM was moved due to host maintenance in our environment. Primary MM lost outside network connectivity, but not to standby, so MM VRRP did not fail over. Unlike in the past, this actually caused a service disruption. It also took down one of our MDs, which also lost outside connectivity at the exact same time. But it is in a completely different datacenter, so the only reason it lost connectivity is because of something with the IPSec tunnel to the MM VRRP (I'm guessing). Same symptoms, the MD lost outside connectivity, however it could still see it's cluster peer, and even weirder, we lost connectivity to all the APs registered to that MD, but the APs did not lose contact with the controller, so they did not fail over to the backup! So half of all our APs were effectively black-holed. Rebooted the Primary MM, as soon as that happed the Standby took over VRRP, the MD came back online, and all of it's APs came back online! Very strange issue. It's got to have something to do with internal routing between APs, MDs and MM via the IPsec tunnel.

     

    This is all on 8.5.0.4 btw. Opening a TAC case.. 



  • 8.  RE: Backup MM lost network connection, but not to Master MM

    Posted Dec 20, 2019 03:55 PM

    The MD does not have any dependency on the MM to pass user traffic.  I am sitting here with popcorn to understand what could be making that happen.



  • 9.  RE: Backup MM lost network connection, but not to Master MM

    Posted Dec 26, 2019 05:46 PM

    I'm not exactly sure how much *user* traffic was affected, but definetely management traffic to the MD and all APs connected to it. 

     

    Here's an interesting piece.. While the MM was 'offline', or not reachable, the MD pulled it's default route out of the route table. 

     

    MD during outage

    show ip route 
    
    Codes: C - connected, O - OSPF, R - RIP, S - static, B - Bgw peer uplink
           M - mgmt, U - route usable, * - candidate default, V - RAPNG VPN/Branch
           I - Ike-overlay, N - not redistributed
    
    Gateway of last resort is Imported from DHCP to network 0.0.0.0 at cost 10
    Gateway of last resort is Imported from CELL to network 0.0.0.0 at cost 10
    Gateway of last resort is Imported from PPPOE to network 0.0.0.0 at cost 10
    C    10.20.0.0/16 is directly connected, VLAN20
    C    10.210.0.0/16 is directly connected, VLAN810
    C    10.212.0.0/16 is directly connected, VLAN812
    C    10.20.40.51/32 is an ipsec map default-ha-ipsecmap10.20.40.51
    C    10.150.20.31/32 is an ipsec map default-local-master-ipsecmap

    MD normal operation

    (adc-mod-awlc01) [MDC] #show ip route
    
    Codes: C - connected, O - OSPF, R - RIP, S - static, B - Bgw peer uplink
           M - mgmt, U - route usable, * - candidate default, V - RAPNG VPN/Branch
           I - Ike-overlay, N - not redistributed
    
    Gateway of last resort is Imported from DHCP to network 0.0.0.0 at cost 10
    Gateway of last resort is Imported from CELL to network 0.0.0.0 at cost 10
    Gateway of last resort is Imported from PPPOE to network 0.0.0.0 at cost 10
    Gateway of last resort is 10.20.0.1 to network 0.0.0.0 at cost 1
    S*    0.0.0.0/0  [0/1] via 10.20.0.1*
    C    10.20.0.0/16 is directly connected, VLAN20
    C    10.210.0.0/16 is directly connected, VLAN810
    C    10.212.0.0/16 is directly connected, VLAN812
    C    10.20.40.51/32 is an ipsec map default-ha-ipsecmap10.20.40.51
    C    10.150.20.31/32 is an ipsec map default-local-master-ipsecmap


  • 10.  RE: Backup MM lost network connection, but not to Master MM

    Posted Dec 26, 2019 06:03 PM

    Sounds to me like 10.20.0.1 is unreachable for some reason.

     

    EDIT:  I would do a "show datapath route-cache" and see if you see the default gateway in there.



  • 11.  RE: Backup MM lost network connection, but not to Master MM

    Posted Jan 16, 2020 12:21 PM

    We have a bug-id for this now while the developers investigate. AOS-198880.



  • 12.  RE: Backup MM lost network connection, but not to Master MM

    Posted Apr 27, 2020 12:04 PM

    We had one of these issues again, so it reminded me to update this thread. It's pretty obvious now that what is happening in our case is a VMware VMotion of the VMM is triggering this issue. The VMM that is migrated seems to lose layer 3 connectivity, but maintains layer 2 connectivity, so VRRP never fails over and takes the VIP offline, or creates a one-way traffic issue on the VMM. There is something about this series of events that causes the MDs to freak out, and usually at least one of the MDs and all of it's associated APs stop responding over the network.

     

    Aruba has responded to our case that VMotion is not supported on VMM, but has offered no solutions on how to prevent this issue. There is no good/reliable way to prevent VMotion on a single VM, there are only a handful of workarounds that I have found. I'm going to try escalating to Aruba again for some kind of enhancement request to figure out the VMotion issue, or at least prevent the MDs from freaking out when the VMM goes weird.