Wired

 View Only
last person joined: yesterday 

Expand all | Collapse all

VSX routed keep alive design learnings

This thread has been viewed 16 times
  • 1.  VSX routed keep alive design learnings

    Posted Jun 15, 2023 06:59 PM

    I've recently had to lab up a new VSX design with routed VSX keep alives. This was on the 8360 using 10.10 OS. The documentation for this setup was very sparse with only Appendix D in the 2020 1.3 version of the 'VSX Configuration Best Practices for Aruba CX' making mention of it but missing out a few key considerations. It took me a few days to figure out how to set it up correctly so that it didn't suffer from extended user traffic outages under certain failure conditions.

    My design situation was that I needed my VSX switches split across two different comms rooms. The 8360s are my site L3 core/agg switches and each comms room had a WAN router. The WAN routers are independent of each other, and I don't have the 8360 connecting to both WAN routers only the local WAN router. So a typical campus type deployment scenario but without the core switch pair that Appendix D talks about.

    Rather than OSPF I needed to use BGP so it was eBGP to the WAN routers while running iBGP between the VSX switches over the ISL link, BFD configured for all peerings. I have multiple VRFs. Because of a common duct between the main comms room I needed a routed VSX keep alive design using the switch loopbacks so that under normal conditions keep alive packets routed over the ISL links but if those failed it would re-route via the WAN. That re-routing needs to happen within 3 second to avoid split brain.

    I was discovering that failing the ISL links or powering off one of the VSX switches would result in split brain and traffic outage for the end user of up to 7 seconds. I finally tracked this down to keep alive traffic not re-routing quickly enough and there were two mistakes in my design. First I played with changing the default 'bfd detect-multiplier' changing it to 2 and this improved the fail over time down to 3s or so. But I realised the root cause of this fault was the VLAN point-to-point interface between both switches that iBGP operates over was not going down when the ISL link was failed. That was because I had another LAG interface on the switches allowing all VLANs. That LAG was not going down so my P2P VLAN2 was staying up. So BGP needed to time out before withdrawing the route for the keep alive remote loopback.

    With the below configurations I was able to get sub second fail over for loss of ISL links or power down of SW1. Upstream WAN router failure took approx 1-2 sec to recover and re-route.

    I'm trying to get my design validated by Aruba TAC, however that doesn't seem to be going very well with my TAC engineer a little out of his depth. So I thought I'd post it here to get feedback.

    VSX Switch #1
    ! Lag to my 2930 access switch
    interface lag 1 multi-chassis
        description uplink
        vsx-sync vlans
        no shutdown
        no routing
        vlan trunk native 1
        ! need to ensure you don't have any interface that allow all VLANs
        ! otherwise your keep alive routed link between switches don't go down
        vlan trunk allowed all
        lacp mode active
        spanning-tree root-guard
    
    ! ISL LAG
    interface lag 256
        description VSX ISL link
        no shutdown
        no routing
        vlan trunk native 1 tag
        vlan trunk allowed all
        lacp mode active
    
    
    
    ! my 2 x ISL links between switches
    interface 1/1/21
        description ISL physical link
        no shutdown
        mtu 9198
        lag 256
    interface 1/1/22
        description ISL physical link
        no shutdown
        mtu 9198
        lag 256
    
    ! My loopback which I use for the keep alive
    interface loopback 0
        ip address 192.168.1.1/32
    
    ! My routed VLAN going over ISL link for heatbeat when link ok
    ! Modified BFD timer to limit outage should VLAN 2 not go down
    ! through someone miss configuring another interface
    interface vlan 2
        ip mtu 9128
        ip address 10.0.2.1/30
        bfd detect-multiplier 2
    
    ! LAN VLAN my downstream 2930F devices connect into
    interface vlan 550
        description local LAN
        vsx-sync active-gateways
        ip address 10.55.1.2/24
        active-gateway ip mac 12:01:00:00:01:00
        active-gateway ip 10.55.1.1
    
    ! Routed sub interfaces for upstream router to avoid the potential
    ! VLAN2 not going down issue
    interface 1/1/9
        description WAN router
        no shutdown
    interface 1/1/9.12
        encapsulation dot1q 12
        ip address 192.168.12.1/31
    interface 1/1/9.13
        encapsulation dot1q 13
        vrf attach lno
        ip address 192.168.13.1/31
    
    ! typical VSX setup, just using the loopbacks as the source/dst
    vsx
        system-mac 02:01:00:00:01:00
        inter-switch-link lag 256
        role primary
        keepalive peer 192.168.1.2 source 192.168.1.1
        vsx-sync dhcp-relay dhcp-server mclag-interfaces snmp static-routes stp-glob al time vsx-global
    
    
    ! routes learned from WAN on this switch are more prefered
    route-map localpref-120 permit seq 10
         set local-preference 120
    !
    
    ! two BGP peers, iBGP to other 8360 switch and one to the WAN router
    ! BFD configured for both peerings
    router bgp 65503
        bgp router-id 192.168.1.1
        neighbor 10.0.2.2 remote-as 65503
        neighbor 10.0.2.2 description 8360-2
        neighbor 10.0.2.2 fall-over bfd
        neighbor 192.168.12.0 remote-as 65504
        neighbor 192.168.12.0 description WAN router
        neighbor 192.168.12.0 fall-over bfd
        address-family ipv4 unicast
            neighbor 10.0.2.2 activate
            neighbor 10.0.2.2 next-hop-self
            neighbor 192.168.12.0 activate
            neighbor 192.168.12.0 route-map localpref-120 in
            neighbor 192.168.12.0 soft-reconfiguration inbound
            redistribute connected
            redistribute local loopback
        exit-address-family
    !
    !
    
    VSX Switch #2
    ! Lag to my 2930 access switch
    interface lag 1 multi-chassis
        description uplink
        vsx-sync vlans
        no shutdown
        no routing
        vlan trunk native 1
        ! need to ensure you don't have any interface that allow all VLANs
        ! otherwise your keep alive routed link between switches don't go down
        vlan trunk allowed all
        lacp mode active
        spanning-tree root-guard
    
    ! ISL LAG
    interface lag 256
        description VSX ISL link
        no shutdown
        no routing
        vlan trunk native 1 tag
        vlan trunk allowed all
        lacp mode active
    
    
    ! my 2 x ISL links between switches
    interface 1/1/21
        description ISL physical link
        no shutdown
        mtu 9198
        lag 256
    interface 1/1/22
        description ISL physical link
        no shutdown
        mtu 9198
        lag 256
    
    ! My loopback which I use for the keep alive
    interface loopback 0
        ip address 192.168.1.2/32
    
    
    ! My routed VLAN going over ISL link for heatbeat when link ok
    ! Modified BFD timer to limit outage should VLAN 2 not go down
    ! through someone miss configuring another interface
    interface vlan 2
        ip mtu 9128
        ip address 10.0.2.2/30
        bfd detect-multiplier 2
    
    ! LAN VLAN my downstream 2930F devices connect into
    interface vlan 550
        description local LAN
        vsx-sync active-gateways
        ip address 10.55.1.3/24
        active-gateway ip mac 12:01:00:00:01:00
        active-gateway ip 10.55.1.1
    
    ! Routed sub interfaces for upstream router to avoid the potential
    ! VLAN2 not going down issue
    interface 1/1/9
        description VCE LAN
        no shutdown
    interface 1/1/9.12
        encapsulation dot1q 12
        ip address 192.168.12.3/31
    interface 1/1/9.13
        encapsulation dot1q 13
        vrf attach lno
        ip address 192.168.13.3/31
    
    ! typical VSX setup, just using the loopbacks as the source/dst
    vsx
        system-mac 02:01:00:00:01:00
        inter-switch-link lag 256
        role secondary
        keepalive peer 192.168.1.1 source 192.168.1.2
        vsx-sync dhcp-relay dhcp-server mclag-interfaces snmp static-routes al time vsx-global
    
    ! routes learned from WAN on this switch are less prefered
    route-map localpref-90 permit seq 10
         set local-preference 90
    !
    
    ! two BGP peers, iBGP to other 8360 switch and one to the WAN router
    ! BFD configured for both peerings
    router bgp 65503
        bgp router-id 192.168.1.2
        neighbor 10.0.2.1 remote-as 65503
        neighbor 10.0.2.1 description 8360-1
        neighbor 10.0.2.1 fall-over bfd
        neighbor 192.168.12.2 remote-as 65504
        neighbor 192.168.12.2 description WAN
        neighbor 192.168.12.2 fall-over bfd
        address-family ipv4 unicast
            neighbor 10.0.2.1 activate
            neighbor 10.0.2.1 next-hop-self
            neighbor 192.168.12.2 activate
            neighbor 192.168.12.2 route-map localpref-90 in
            neighbor 192.168.12.2 soft-reconfiguration inbound
            redistribute connected
            redistribute local loopback
        exit-address-family
    !
    !
    
    


  • 2.  RE: VSX routed keep alive design learnings

    Posted Oct 16, 2023 03:28 AM

    It's not best practise to have keepalive running over ISL at all. I would avoid that at all times. 

    I never releases I tend to use the mgmt interface for keepalive, preserving switchports.

    From the ACSX study guide it states:

    In a routed design for keepalive  northbound you should implement a full mesh topology with cross connection, each aggregation switch my have physical connection to both cores.

    Suppose you use a non-mesh topology and aggregation switch only have a single connection to core. If a link fails between agg and core, routing convergence must occur impacting traffic.  Tweaking timers will help. So I guess thats what you have done and found a feasible solution.

    Thanks for posting!



    ------------------------------
    Ole Morten Kårbø
    ACEA ACSP
    Netnordic Norway
    ------------------------------