VSX routed keep alive design learnings

View Only

last person joined: yesterday

Back to discussions

Expand all | Collapse all

VSX routed keep alive design learnings

This thread has been viewed 16 times

1. VSX routed keep alive design learnings

Kudos

Campbell

Posted Jun 15, 2023 06:59 PM

I've recently had to lab up a new VSX design with routed VSX keep alives. This was on the 8360 using 10.10 OS. The documentation for this setup was very sparse with only Appendix D in the 2020 1.3 version of the 'VSX Configuration Best Practices for Aruba CX' making mention of it but missing out a few key considerations. It took me a few days to figure out how to set it up correctly so that it didn't suffer from extended user traffic outages under certain failure conditions.

My design situation was that I needed my VSX switches split across two different comms rooms. The 8360s are my site L3 core/agg switches and each comms room had a WAN router. The WAN routers are independent of each other, and I don't have the 8360 connecting to both WAN routers only the local WAN router. So a typical campus type deployment scenario but without the core switch pair that Appendix D talks about.

Rather than OSPF I needed to use BGP so it was eBGP to the WAN routers while running iBGP between the VSX switches over the ISL link, BFD configured for all peerings. I have multiple VRFs. Because of a common duct between the main comms room I needed a routed VSX keep alive design using the switch loopbacks so that under normal conditions keep alive packets routed over the ISL links but if those failed it would re-route via the WAN. That re-routing needs to happen within 3 second to avoid split brain.

I was discovering that failing the ISL links or powering off one of the VSX switches would result in split brain and traffic outage for the end user of up to 7 seconds. I finally tracked this down to keep alive traffic not re-routing quickly enough and there were two mistakes in my design. First I played with changing the default 'bfd detect-multiplier' changing it to 2 and this improved the fail over time down to 3s or so. But I realised the root cause of this fault was the VLAN point-to-point interface between both switches that iBGP operates over was not going down when the ISL link was failed. That was because I had another LAG interface on the switches allowing all VLANs. That LAG was not going down so my P2P VLAN2 was staying up. So BGP needed to time out before withdrawing the route for the keep alive remote loopback.

With the below configurations I was able to get sub second fail over for loss of ISL links or power down of SW1. Upstream WAN router failure took approx 1-2 sec to recover and re-route.

I'm trying to get my design validated by Aruba TAC, however that doesn't seem to be going very well with my TAC engineer a little out of his depth. So I thought I'd post it here to get feedback.

VSX Switch #1

! Lag to my 2930 access switch
interface lag 1 multi-chassis
    description uplink
    vsx-sync vlans
    no shutdown
    no routing
    vlan trunk native 1
    ! need to ensure you don't have any interface that allow all VLANs
    ! otherwise your keep alive routed link between switches don't go down
    vlan trunk allowed all
    lacp mode active
    spanning-tree root-guard

! ISL LAG
interface lag 256
    description VSX ISL link
    no shutdown
    no routing
    vlan trunk native 1 tag
    vlan trunk allowed all
    lacp mode active



! my 2 x ISL links between switches
interface 1/1/21
    description ISL physical link
    no shutdown
    mtu 9198
    lag 256
interface 1/1/22
    description ISL physical link
    no shutdown
    mtu 9198
    lag 256

! My loopback which I use for the keep alive
interface loopback 0
    ip address 192.168.1.1/32

! My routed VLAN going over ISL link for heatbeat when link ok
! Modified BFD timer to limit outage should VLAN 2 not go down
! through someone miss configuring another interface
interface vlan 2
    ip mtu 9128
    ip address 10.0.2.1/30
    bfd detect-multiplier 2

! LAN VLAN my downstream 2930F devices connect into
interface vlan 550
    description local LAN
    vsx-sync active-gateways
    ip address 10.55.1.2/24
    active-gateway ip mac 12:01:00:00:01:00
    active-gateway ip 10.55.1.1

! Routed sub interfaces for upstream router to avoid the potential
! VLAN2 not going down issue
interface 1/1/9
    description WAN router
    no shutdown
interface 1/1/9.12
    encapsulation dot1q 12
    ip address 192.168.12.1/31
interface 1/1/9.13
    encapsulation dot1q 13
    vrf attach lno
    ip address 192.168.13.1/31

! typical VSX setup, just using the loopbacks as the source/dst
vsx
    system-mac 02:01:00:00:01:00
    inter-switch-link lag 256
    role primary
    keepalive peer 192.168.1.2 source 192.168.1.1
    vsx-sync dhcp-relay dhcp-server mclag-interfaces snmp static-routes stp-glob al time vsx-global


! routes learned from WAN on this switch are more prefered
route-map localpref-120 permit seq 10
     set local-preference 120
!

! two BGP peers, iBGP to other 8360 switch and one to the WAN router
! BFD configured for both peerings
router bgp 65503
    bgp router-id 192.168.1.1
    neighbor 10.0.2.2 remote-as 65503
    neighbor 10.0.2.2 description 8360-2
    neighbor 10.0.2.2 fall-over bfd
    neighbor 192.168.12.0 remote-as 65504
    neighbor 192.168.12.0 description WAN router
    neighbor 192.168.12.0 fall-over bfd
    address-family ipv4 unicast
        neighbor 10.0.2.2 activate
        neighbor 10.0.2.2 next-hop-self
        neighbor 192.168.12.0 activate
        neighbor 192.168.12.0 route-map localpref-120 in
        neighbor 192.168.12.0 soft-reconfiguration inbound
        redistribute connected
        redistribute local loopback
    exit-address-family
!
!

VSX Switch #2

! Lag to my 2930 access switch
interface lag 1 multi-chassis
    description uplink
    vsx-sync vlans
    no shutdown
    no routing
    vlan trunk native 1
    ! need to ensure you don't have any interface that allow all VLANs
    ! otherwise your keep alive routed link between switches don't go down
    vlan trunk allowed all
    lacp mode active
    spanning-tree root-guard

! ISL LAG
interface lag 256
    description VSX ISL link
    no shutdown
    no routing
    vlan trunk native 1 tag
    vlan trunk allowed all
    lacp mode active


! my 2 x ISL links between switches
interface 1/1/21
    description ISL physical link
    no shutdown
    mtu 9198
    lag 256
interface 1/1/22
    description ISL physical link
    no shutdown
    mtu 9198
    lag 256

! My loopback which I use for the keep alive
interface loopback 0
    ip address 192.168.1.2/32


! My routed VLAN going over ISL link for heatbeat when link ok
! Modified BFD timer to limit outage should VLAN 2 not go down
! through someone miss configuring another interface
interface vlan 2
    ip mtu 9128
    ip address 10.0.2.2/30
    bfd detect-multiplier 2

! LAN VLAN my downstream 2930F devices connect into
interface vlan 550
    description local LAN
    vsx-sync active-gateways
    ip address 10.55.1.3/24
    active-gateway ip mac 12:01:00:00:01:00
    active-gateway ip 10.55.1.1

! Routed sub interfaces for upstream router to avoid the potential
! VLAN2 not going down issue
interface 1/1/9
    description VCE LAN
    no shutdown
interface 1/1/9.12
    encapsulation dot1q 12
    ip address 192.168.12.3/31
interface 1/1/9.13
    encapsulation dot1q 13
    vrf attach lno
    ip address 192.168.13.3/31

! typical VSX setup, just using the loopbacks as the source/dst
vsx
    system-mac 02:01:00:00:01:00
    inter-switch-link lag 256
    role secondary
    keepalive peer 192.168.1.1 source 192.168.1.2
    vsx-sync dhcp-relay dhcp-server mclag-interfaces snmp static-routes al time vsx-global

! routes learned from WAN on this switch are less prefered
route-map localpref-90 permit seq 10
     set local-preference 90
!

! two BGP peers, iBGP to other 8360 switch and one to the WAN router
! BFD configured for both peerings
router bgp 65503
    bgp router-id 192.168.1.2
    neighbor 10.0.2.1 remote-as 65503
    neighbor 10.0.2.1 description 8360-1
    neighbor 10.0.2.1 fall-over bfd
    neighbor 192.168.12.2 remote-as 65504
    neighbor 192.168.12.2 description WAN
    neighbor 192.168.12.2 fall-over bfd
    address-family ipv4 unicast
        neighbor 10.0.2.1 activate
        neighbor 10.0.2.1 next-hop-self
        neighbor 192.168.12.2 activate
        neighbor 192.168.12.2 route-map localpref-90 in
        neighbor 192.168.12.2 soft-reconfiguration inbound
        redistribute connected
        redistribute local loopback
    exit-address-family
!
!

2. RE: VSX routed keep alive design learnings

0 Kudos
OK96
Posted Oct 16, 2023 03:28 AM

Reply Reply Privately
It's not best practise to have keepalive running over ISL at all. I would avoid that at all times.

I never releases I tend to use the mgmt interface for keepalive, preserving switchports.

From the ACSX study guide it states:

In a routed design for keepalive northbound you should implement a full mesh topology with cross connection, each aggregation switch my have physical connection to both cores.

Suppose you use a non-mesh topology and aggregation switch only have a single connection to core. If a link fails between agg and core, routing convergence must occur impacting traffic. Tweaking timers will help. So I guess thats what you have done and found a feasible solution.

Thanks for posting!

------------------------------
Ole Morten Kårbø
ACEA ACSP
Netnordic Norway
------------------------------

Original Message

Wired

VSX routed keep alive design learnings

1. VSX routed keep alive design learnings

2. RE: VSX routed keep alive design learnings